ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.01111
  4. Cited By
Rule Based Rewards for Language Model Safety

Rule Based Rewards for Language Model Safety

2 November 2024
Tong Mu
Alec Helyar
Johannes Heidecke
Joshua Achiam
Andrea Vallone
Ian Kivlichan
Molly Lin
Alex Beutel
John Schulman
Lilian Weng
    ALM
ArXiv (abs)PDFHTML

Papers citing "Rule Based Rewards for Language Model Safety"

23 / 23 papers shown
Title
EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
EQA-RM: A Generative Embodied Reward Model with Test-time Scaling
Yuhang Chen
Zhen Tan
Tianlong Chen
98
0
0
12 Jun 2025
Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints
Yaswanth Chittepu
Blossom Metevier
Will Schwarzer
Austin Hoag
S. Niekum
Philip S Thomas
17
0
0
09 Jun 2025
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment
Kyubyung Chae
Hyunbin Jin
Taesup Kim
21
0
0
07 Jun 2025
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
Ruizhong Qiu
Gaotang Li
Tianxin Wei
Jingrui He
Hanghang Tong
LRM
17
0
0
06 Jun 2025
R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning
R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning
Qingfei Zhao
Ruobing Wang
Dingling Xu
Daren Zha
Limin Liu
AI4TSKELMLRM
70
0
0
04 Jun 2025
Contrastive Distillation of Emotion Knowledge from LLMs for Zero-Shot Emotion Recognition
Minxue Niu
E. Provost
VLM
206
0
0
23 May 2025
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy
Mostafa Elhoushi
Amr Alanwar
LLMSV
58
0
0
22 May 2025
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Sagnik Mukherjee
Lifan Yuan
Dilek Hakkani-Tur
Hao Peng
108
0
0
16 May 2025
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
Yuanhang Zhang
Zihao Zeng
Dongbai Li
Yao Huang
Zhijie Deng
Yinpeng Dong
LRM
96
10
0
14 Apr 2025
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Xuerui Su
Shufang Xie
Guoqing Liu
Yingce Xia
Renqian Luo
Peiran Jin
Zhiming Ma
Yue Wang
Zun Wang
Yuting Liu
LRM
90
5
0
06 Apr 2025
Unity RL Playground: A Versatile Reinforcement Learning Framework for Mobile Robots
Linqi Ye
Rankun Li
Xiaowen Hu
Jiayi Li
Boyang Xing
Yan Peng
Bin Liang
113
0
0
07 Mar 2025
Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges
Francisco Eiras
Eliott Zemour
Eric Lin
Vaikkunth Mugunthan
ELM
120
1
0
06 Mar 2025
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Junda Zhu
Lingyong Yan
Shuaiqiang Wang
Dawei Yin
Lei Sha
AAMLLRM
96
6
0
18 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Yongbin Li
Qingbin Liu
Xingyuan Bu
Wenbo Su
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
153
6
0
17 Feb 2025
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
Isha Gupta
David Khachaturov
Robert D. Mullins
AAMLAuLLM
115
4
0
02 Feb 2025
Reinforcement Learning Enhanced LLMs: A Survey
Reinforcement Learning Enhanced LLMs: A Survey
Shuhe Wang
Shengyu Zhang
Jing Zhang
Runyi Hu
Xiaoya Li
Tianwei Zhang
Jiwei Li
Leilei Gan
G. Wang
Eduard H. Hovy
OffRL
243
16
0
05 Dec 2024
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang
Jingdi Lei
Junxian Li
Xunzhi Wang
Yong Liu
...
Steve Yang
Jianbo Wu
Peng Ye
Wanli Ouyang
Dongzhan Zhou
OffRLLRM
188
8
0
27 Nov 2024
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization
Batuhan K. Karaman
Ishmam Zabir
Alon Benhaim
Vishrav Chaudhary
M. Sabuncu
Xia Song
AI4CE
83
2
0
16 Oct 2024
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
Enyu Zhou
Guodong Zheng
Binghai Wang
Zhiheng Xi
Shihan Dou
...
Yurong Mou
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
ALM
147
21
0
13 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
167
1
0
09 Oct 2024
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
Zhe Su
Xuhui Zhou
Sanketh Rangreji
Anubha Kabra
Julia Mendelsohn
Faeze Brahman
Maarten Sap
LLMAG
185
7
0
13 Sep 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip Torr
Mohamed Elhoseiny
Adel Bibi
198
15
0
27 Aug 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
113
12
0
02 Jun 2024
1