ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.20947
  4. Cited By
OR-Bench: An Over-Refusal Benchmark for Large Language Models

OR-Bench: An Over-Refusal Benchmark for Large Language Models

31 May 2024
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
    ALM
ArXivPDFHTML

Papers citing "OR-Bench: An Over-Refusal Benchmark for Large Language Models"

50 / 93 papers shown
Title
Multi-objective Large Language Model Alignment with Hierarchical Experts
Multi-objective Large Language Model Alignment with Hierarchical Experts
Zhuo Li
Guodong DU
Weiyang Guo
Yigeng Zhou
Xiucheng Li
...
Fangming Liu
Yequan Wang
Deheng Ye
Min Zhang
Jing Li
ALM
MoE
41
0
0
27 May 2025
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Baihui Zheng
Boren Zheng
Kerui Cao
Y. Tan
Zhendong Liu
...
Jian Yang
Wenbo Su
Xiaoyong Zhu
Bo Zheng
Kaifu Zhang
ELM
32
0
0
26 May 2025
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs
Alireza Arbabi
Florian Kerschbaum
105
0
0
22 May 2025
RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection
RAR: Setting Knowledge Tripwires for Retrieval Augmented Rejection
T. M. Buonocore
Enea Parimbelli
AAML
23
0
0
19 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
70
1
0
17 May 2025
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Zhehao Zhang
Weijie Xu
Fanyou Wu
Chandan K. Reddy
61
1
0
12 May 2025
aiXamine: Simplified LLM Safety and Security
aiXamine: Simplified LLM Safety and Security
Fatih Deniz
Dorde Popovic
Yazan Boshmaf
Euisuh Jeong
M. Ahmad
Sanjay Chawla
Issa M. Khalil
ELM
186
0
0
21 Apr 2025
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi
Monojit Choudhury
Shivam Chauhan
Rocktim Jyoti Das
Dhruv Sahnan
Xudong Han
...
Rituraj Joshi
Gurpreet Gosal
Avraham Sheinin
Natalia Vassilieva
Preslav Nakov
51
1
0
08 Apr 2025
Arti-"fickle" Intelligence: Using LLMs as a Tool for Inference in the Political and Social Sciences
Arti-"fickle" Intelligence: Using LLMs as a Tool for Inference in the Political and Social Sciences
Lisa P. Argyle
Ethan C. Busby
Joshua R Gubler
Bryce Hepner
Alex Lyman
David Wingate
LLMAG
65
0
0
04 Apr 2025
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification
Yingjie Zhang
Tong Liu
Zhe Zhao
Guozhu Meng
Kai Chen
AAML
68
1
0
14 Mar 2025
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Ling Team
B. Zeng
Chenyu Huang
Chao Zhang
Changxin Tian
...
Zhaoxin Huan
Zujie Wen
Zhenhang Sun
Zhuoxuan Du
Z. He
MoE
ALM
131
3
0
07 Mar 2025
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Sophie Xhonneux
Simon Geisler
Gauthier Gidel
Leo Schwinn
Stephan Günnemann
ALM
ELM
398
1
0
04 Mar 2025
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh
Fajri Koto
Rituraj Joshi
Nurdaulet Mukhituly
Yanjie Wang
Zhuohan Xie
...
Avraham Sheinin
Natalia Vassilieva
Neha Sengupta
Larry Murray
Preslav Nakov
ALM
KELM
85
0
0
03 Mar 2025
Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries
Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries
David Noever
Grant Rosario
285
0
0
20 Feb 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
105
3
0
11 Feb 2025
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Javier Rando
Jie Zhang
Nicholas Carlini
F. Tramèr
AAML
ELM
88
6
0
04 Feb 2025
POROver: Improving Safety and Reducing Overrefusal in Large Language
  Models with Overgeneration and Preference Optimization
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
62
0
0
16 Oct 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM
  Agent Cyber Offense Capabilities
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Andrey Anurin
Jonathan Ng
Kibo Schaffer
Jason Schreiber
Esben Kran
ELM
58
5
0
10 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
You Know What I'm Saying: Jailbreak Attack via Implicit Reference
Tianyu Wu
Lingrui Mei
Ruibin Yuan
Lujun Li
Wei Xue
Yike Guo
60
1
0
04 Oct 2024
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang
Chengzhi Hu
Paul Röttger
Barbara Plank
99
9
0
04 Oct 2024
The Perfect Blend: Redefining RLHF with Mixture of Judges
The Perfect Blend: Redefining RLHF with Mixture of Judges
Tengyu Xu
Eryk Helenowski
Karthik Abinav Sankararaman
Di Jin
Kaiyan Peng
...
Gabriel Cohen
Yuandong Tian
Hao Ma
Sinong Wang
Han Fang
91
11
0
30 Sep 2024
Towards Safe Multilingual Frontier AI
Towards Safe Multilingual Frontier AI
Artūrs Kanepajs
Vladimir Ivanov
Richard Moulange
54
1
0
06 Sep 2024
Programming Refusal with Conditional Activation Steering
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
Karthikeyan N. Ramamurthy
Erik Miehling
Pierre Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
132
20
0
06 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals
  in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
71
15
0
01 Sep 2024
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma: Generative AI Content Moderation Based on Gemma
Wenjun Zeng
Yuchi Liu
Ryan Mullins
Ludovic Peran
Joe Fernandez
...
Drew Proud
Piyush Kumar
Bhaktipriya Radharapu
Olivia Sturman
O. Wahltinez
AI4MH
53
40
0
31 Jul 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
  Safer Language Models
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
Kavel Rao
Seungju Han
Allyson Ettinger
Faeze Brahman
...
Niloofar Mireshghallah
Ximing Lu
Maarten Sap
Yejin Choi
Nouha Dziri
29
58
0
26 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
50
86
0
26 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
68
62
0
20 Jun 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
91
61
0
18 Jun 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
  Phone
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin
Sam Ade Jacobs
A. A. Awan
J. Aneja
Ahmed Hassan Awadallah
...
Li Zhang
Yi Zhang
Yue Zhang
Yunan Zhang
Xiren Zhou
LRM
ALM
90
1,136
0
22 Apr 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path
  Forward
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
92
6
0
12 Apr 2024
ALERT: A Comprehensive Benchmark for Assessing Large Language Models'
  Safety through Red Teaming
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Simone Tedeschi
Felix Friedrich
P. Schramowski
Kristian Kersting
Roberto Navigli
Huu Nguyen
Bo Li
ELM
69
49
0
06 Apr 2024
Towards Safety and Helpfulness Balanced Responses via Controllable Large
  Language Models
Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models
Yi-Lin Tuan
Xilun Chen
Eric Michael Smith
Louis Martin
Soumya Batra
Asli Celikyilmaz
William Yang Wang
Daniel M. Bikel
50
11
0
01 Apr 2024
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team
Gemma Team Thomas Mesnard
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
...
Armand Joulin
Noah Fiedel
Evan Senter
Alek Andreev
Kathleen Kenealy
VLM
LLMAG
165
460
0
13 Mar 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
85
74
0
26 Feb 2024
Defending LLMs against Jailbreaking Attacks via Backtranslation
Defending LLMs against Jailbreaking Attacks via Backtranslation
Yihan Wang
Zhouxing Shi
Andrew Bai
Cho-Jui Hsieh
AAML
47
36
0
26 Feb 2024
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM
  Jailbreakers
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
Xirui Li
Ruochen Wang
Minhao Cheng
Tianyi Zhou
Cho-Jui Hsieh
AAML
54
42
0
25 Feb 2024
Inadequacies of Large Language Model Benchmarks in the Era of Generative
  Artificial Intelligence
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence
Timothy R. McIntosh
Teo Susnjak
Tong Liu
Paul Watters
Malka N. Halgamuge
ALM
ELM
75
53
0
15 Feb 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large
  Language Models
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
Lijun Li
Bowen Dong
Ruohui Wang
Xuhao Hu
Wangmeng Zuo
Dahua Lin
Yu Qiao
Jing Shao
ELM
37
95
0
07 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Minlie Huang
Nanyun Peng
AAML
84
54
0
31 Jan 2024
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan
Zhiwei He
Lingzhong Dong
Yiming Wang
Ruijie Zhao
...
Binglin Zhou
Fangqi Li
Zhuosheng Zhang
Rui Wang
Gongshen Liu
ELM
48
71
0
18 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
60
284
0
12 Jan 2024
Mixtral of Experts
Mixtral of Experts
Albert Q. Jiang
Alexandre Sablayrolles
Antoine Roux
A. Mensch
Blanche Savary
...
Théophile Gervet
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LLMAG
91
1,049
0
08 Jan 2024
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan
Kartikeya Upasani
Jianfeng Chi
Rashi Rungta
Krithika Iyer
...
Michael Tontchev
Qing Hu
Brian Fuller
Davide Testuggine
Madian Khabsa
AI4MH
66
423
0
07 Dec 2023
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
49
98
0
13 Nov 2023
ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in
  Real-World User-AI Conversation
ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation
Zi Lin
Zihan Wang
Yongqi Tong
Yangkun Wang
Yuxin Guo
Yujia Wang
Jingbo Shang
AI4MH
30
99
0
26 Oct 2023
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai
Xuehai Pan
Ruiyang Sun
Jiaming Ji
Xinbo Xu
Mickel Liu
Yizhou Wang
Yaodong Yang
76
330
0
19 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
72
642
0
12 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
68
258
0
10 Oct 2023
SC-Safety: A Multi-round Open-ended Question Adversarial Safety
  Benchmark for Large Language Models in Chinese
SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
Liang Xu
Kangkang Zhao
Lei Zhu
Hang Xue
ELM
ALM
36
15
0
09 Oct 2023
12
Next