Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.05862
Cited By
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
12 April 2022
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
Nova Dassarma
Dawn Drain
Stanislav Fort
Deep Ganguli
T. Henighan
Nicholas Joseph
Saurav Kadavath
John Kernion
Tom Conerly
S. E. Showk
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
Tristan Hume
Scott R. Johnston
Shauna Kravec
Liane Lovitt
Neel Nanda
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
50 / 654 papers shown
Title
Doubly Robust Alignment for Large Language Models
Erhan Xu
Kai Ye
Hongyi Zhou
Luhan Zhu
Francesco Quinzan
Chengchun Shi
50
0
0
01 Jun 2025
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
Narmeen Oozeer
Luke Marks
Fazl Barez
Amirali Abdullah
LLMSV
45
0
0
30 May 2025
Adversarial Preference Learning for Robust LLM Alignment
Yuanfu Wang
Pengyu Wang
Chenyang Xi
Bo Tang
Junyi Zhu
...
Keming Mao
Zhiyu Li
Feiyu Xiong
Jie Hu
Mingchuan Yang
AAML
33
0
0
30 May 2025
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Zafir Stojanovski
Oliver Stanley
Joe Sharratt
Richard Jones
Abdulhakeem Adefioye
Jean Kaddour
Andreas Kopf
OffRL
LRM
70
1
0
30 May 2025
On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective
Gengze Xu
Wei Yao
Ziqiao Wang
Yong Liu
46
0
0
30 May 2025
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Jingyan Shen
Jiarui Yao
Rui Yang
Yifan Sun
Feng Luo
Boyao Wang
Tong Zhang
Han Zhao
35
0
0
30 May 2025
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
Soichiro Nishimori
Yu Zhang
Thanawat Lodkaew
Masashi Sugiyama
NoLa
44
0
0
30 May 2025
Accelerating RLHF Training with Reward Variance Increase
Zonglin Yang
Zhexuan Gu
Houduo Qi
Yancheng Yuan
95
0
0
29 May 2025
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Sheng Ouyang
Yulan Hu
Ge Chen
Qingyang Li
Fuzheng Zhang
Yong Liu
39
0
0
29 May 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
73
0
0
29 May 2025
MAP: Revisiting Weight Decomposition for Low-Rank Adaptation
Chongjie Si
Zhiyi Shi
Yadao Wang
Xiaokang Yang
Susanto Rahardja
Wei Shen
66
0
0
29 May 2025
Text2Grad: Reinforcement Learning from Natural Language Feedback
Hanyang Wang
Lu Wang
Chaoyun Zhang
Tianjun Mao
Si Qin
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
85
0
0
28 May 2025
Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R
2
^2
2
)GRPO
Ran Li
Shimin Di
Yuchen Liu
Chen Jing
Yu Qiu
Lei Chen
LRM
79
0
0
28 May 2025
Large Language Models Often Know When They Are Being Evaluated
Joe Needham
Giles Edkins
Govind Pimpale
Henning Bartsch
Marius Hobbhahn
LLMAG
ELM
ALM
37
0
0
28 May 2025
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition
Hanting Chen
Yasheng Wang
Kai Han
Dong Li
Lin Li
...
Hailin Hu
Yehui Tang
Dacheng Tao
Xinghao Chen
Yunhe Wang
LRM
101
0
0
28 May 2025
Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations
Zeinab Dehghani
Koorosh Aslansefat
Adil Khan
Mohammed Naveed Akram
MILM
LRM
138
0
0
27 May 2025
Can Large Reasoning Models Self-Train?
Sheikh Shafayat
Fahim Tajwar
Ruslan Salakhutdinov
J. Schneider
Andrea Zanette
ReLM
OffRL
LRM
87
2
0
27 May 2025
RRO: LLM Agent Optimization Through Rising Reward Trajectories
Zilong Wang
Jingfeng Yang
Sreyashi Nag
Samarth Varshney
Xianfeng Tang
Haoming Jiang
Jingbo Shang
Sheikh Sarwar
LRM
50
0
0
27 May 2025
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
Fengqing Jiang
Fengbo Ma
Zhangchen Xu
Yuetai Li
Bhaskar Ramasubramanian
Luyao Niu
Bo Li
Xianyan Chen
Zhen Xiang
Radha Poovendran
ALM
ELM
76
1
0
27 May 2025
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Yao Huang
Yitong Sun
Shouwei Ruan
Yichi Zhang
Yinpeng Dong
Xingxing Wei
AAML
64
0
0
27 May 2025
Token-Importance Guided Direct Preference Optimization
Yang Ning
Lin Hai
Liu Yibo
Tian Baoliang
Liu Guoqing
Zhang Haijun
73
0
0
26 May 2025
Learning a Pessimistic Reward Model in RLHF
Yinglun Xu
Hangoo Kang
Tarun Suresh
Yuxuan Wan
Gagandeep Singh
OffRL
68
0
0
26 May 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Sangyeop Kim
Yohan Lee
Yongwoo Song
Kimin Lee
AAML
41
0
0
26 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
38
0
0
26 May 2025
Multi-Domain Explainability of Preferences
Nitay Calderon
Liat Ein-Dor
Roi Reichart
LRM
58
0
0
26 May 2025
Accelerating Nash Learning from Human Feedback via Mirror Prox
D. Tiapkin
Daniele Calandriello
Denis Belomestny
Eric Moulines
Alexey Naumov
Kashif Rasul
Michal Valko
Pierre Ménard
58
0
0
26 May 2025
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Y. Zhang
Yu Yu
Bo Tang
Yu Zhu
Chuxiong Sun
...
Jie Hu
Zipeng Xie
Zhiyu Li
Feiyu Xiong
Edward Chung
106
0
0
26 May 2025
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Yi Liu
Dianqing Liu
Mingye Zhu
Junbo Guo
Yongdong Zhang
Zhendong Mao
112
0
0
26 May 2025
What Can RL Bring to VLA Generalization? An Empirical Study
Jijia Liu
Feng Gao
Bingwen Wei
Xinlei Chen
Qingmin Liao
Yi Wu
Chao Yu
Yu Wang
OffRL
311
0
0
26 May 2025
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu
Rongzhen Wang
Shen Nie
Xiaolu Zhang
Chunwei Wu
...
Jun Zhou
Jianfei Chen
Yankai Lin
Ji-Rong Wen
Chongxuan Li
197
2
0
25 May 2025
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Xiaoqiang Lin
Arun Verma
Zhongxiang Dai
Daniela Rus
See-Kiong Ng
Bryan Kian Hsiang Low
275
0
0
25 May 2025
Incentivizing High-Quality Human Annotations with Golden Questions
Shang Liu
Zhongze Cai
Hanzhao Wang
Zhongyao Ma
Xiaocheng Li
84
0
0
25 May 2025
SATORI-R1: Incentivizing Multimodal Reasoning with Spatial Grounding and Verifiable Rewards
Chuming Shen
Wei Wei
Xiaoye Qu
Yu Cheng
LRM
192
1
0
25 May 2025
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Steffen Backmann
David Guzman Piedrahita
Emanuel Tewolde
Rada Mihalcea
Bernhard Schölkopf
Zhijing Jin
96
0
0
25 May 2025
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li
Guangda Huzhang
Haibo Zhang
Xiting Wang
Anxiang Zeng
44
0
0
24 May 2025
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Reva Schwartz
Rumman Chowdhury
Akash Kundu
Heather Frase
Marzieh Fadaee
...
Andrew Thompson
Maya Carlyle
Qinghua Lu
Matthew Holmes
Theodora Skeadas
71
0
0
24 May 2025
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELM
LRM
218
0
0
24 May 2025
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Mengru Wang
Ziwen Xu
Shengyu Mao
Shumin Deng
Zhaopeng Tu
Ningyu Zhang
N. Zhang
LLMSV
135
0
0
23 May 2025
Stable Reinforcement Learning for Efficient Reasoning
Muzhi Dai
Shixuan Liu
Qingyi Si
OffRL
LRM
117
0
0
23 May 2025
Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios
Xueyang Zhou
Weidong Wang
Lin Lu
Jiawen Shi
Guiyao Tie
Yongtian Xu
Lixing Chen
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LLMAG
217
0
0
23 May 2025
AI-Augmented LLMs Achieve Therapist-Level Responses in Motivational Interviewing
Yinghui Huang
Yuxuan Jiang
Hui Liu
Yixin Cai
Weiqing Li
Xiangen Hu
AI4MH
250
0
0
23 May 2025
Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models
Vijeta Deshpande
Debasmita Ghose
John D. Patterson
Roger Beaty
Anna Rumshisky
118
0
0
22 May 2025
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Eric Hanchen Jiang
Haozheng Luo
Shengyuan Pang
Xiaomin Li
Zhenting Qi
...
Zongyu Lin
Xinfeng Li
Hao Xu
Kai-Wei Chang
Ying Nian Wu
LRM
125
0
0
21 May 2025
Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis
Hong Huang
Dapeng Wu
114
0
0
20 May 2025
Self-Evolving Curriculum for LLM Reasoning
Xiaoyin Chen
Jiarui Lu
Minsu Kim
Dinghuai Zhang
Jian Tang
Alexandre Piché
Nicolas Angelard-Gontier
Yoshua Bengio
Ehsan Kamalloo
ReLM
LRM
126
0
0
20 May 2025
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Jianwei Li
Jung-Eng Kim
AAML
192
1
0
19 May 2025
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Gang Li
Ming Lin
Tomer Galanti
Zhengzhong Tu
Tianbao Yang
113
1
0
18 May 2025
ExpertSteer: Intervening in LLMs through Expert Knowledge
Weixuan Wang
Minghao Wu
Barry Haddow
Alexandra Birch
LLMSV
186
0
0
18 May 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
Nurdaulet Mukhituly
Kentaro Inui
Hanan Aldarmaki
Nils Lukas
AAML
87
0
0
18 May 2025
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
Wenqiao Zhu
Ji Liu
Lulu Wang
Jun Wu
Yulun Zhang
106
0
0
18 May 2025
Previous
1
2
3
4
5
...
12
13
14
Next