ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.20947
147
55
v1v2v3v4v5 (latest)

OR-Bench: An Over-Refusal Benchmark for Large Language Models

31 May 2024
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
    ALM
ArXiv (abs)PDFHTML
Main:9 Pages
8 Figures
Bibliography:6 Pages
15 Tables
Appendix:13 Pages
Abstract

Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that can elicit the over-refusal behaviors of LLMs. This study proposes a novel method for automatically generating large-scale over-refusal datasets. Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 over-refusal prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 32 popular LLMs across 8 model families. Our datasets are publicly available at this https URL and our codebase is open-sourced at this https URL. We hope this benchmark can help the community develop better safety aligned models.

View on arXiv
@article{cui2025_2405.20947,
  title={ OR-Bench: An Over-Refusal Benchmark for Large Language Models },
  author={ Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh },
  journal={arXiv preprint arXiv:2405.20947},
  year={ 2025 }
}
Comments on this paper