Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.05862
Cited By
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
12 April 2022
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
Nova Dassarma
Dawn Drain
Stanislav Fort
Deep Ganguli
T. Henighan
Nicholas Joseph
Saurav Kadavath
John Kernion
Tom Conerly
S. E. Showk
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
Tristan Hume
Scott R. Johnston
Shauna Kravec
Liane Lovitt
Neel Nanda
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
50 / 654 papers shown
Title
Better Language Model Inversion by Compactly Representing Next-Token Distributions
Murtaza Nazir
Matthew Finlayson
John X. Morris
Xiang Ren
Swabha Swayamdipta
27
0
0
20 Jun 2025
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning
Yanzhi Zhang
Zhaoxi Zhang
Haoxiang Guan
Yilin Cheng
Yitong Duan
Chen Wang
Yue Wang
Shuxin Zheng
Jiyan He
ReLM
LRM
59
0
0
20 Jun 2025
Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples
Soumya Suvra Ghosal
Vaibhav Singh
Akash Ghosh
Soumyabrata Pal
Subhadip Baidya
Sriparna Saha
Dinesh Manocha
22
0
0
19 Jun 2025
Reranking-based Generation for Unbiased Perspective Summarization
Narutatsu Ri
Nicholas Deas
Kathleen McKeown
OffRL
24
0
0
19 Jun 2025
Can structural correspondences ground real world representational content in Large Language Models?
Iwan Williams
29
0
0
19 Jun 2025
Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Jing Yang Lee
Kong-Aik Lee
Woon-Seng Gan
40
0
0
18 Jun 2025
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Andrew Wagenmaker
Mitsuhiko Nakamoto
Yunchu Zhang
S. Park
Waleed Yagoub
Anusha Nagabandi
Abhishek Gupta
Sergey Levine
OffRL
39
0
0
18 Jun 2025
Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Zongxia Li
Yapei Chang
Yuhang Zhou
Xiyang Wu
Zichao Liang
Yoo Yeon Sung
Jordan L. Boyd-Graber
26
0
0
18 Jun 2025
LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning
Gabrel J. Perin
Runjin Chen
Xuxi Chen
Nina S. T. Hirata
Zhangyang Wang
Junyuan Hong
AAML
45
0
0
18 Jun 2025
GRAM: A Generative Foundation Reward Model for Reward Generalization
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Qiaozhi He
...
Bei Li
Tong Xiao
Chunliang Zhang
Tongran Liu
Jingbo Zhu
ALM
OffRL
LRM
59
0
0
17 Jun 2025
Collaborative Editable Model
Kaiwen Tang
Aitong Wu
Yao Lu
Guangda Sun
KELM
48
0
0
17 Jun 2025
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
Christina Q. Knight
Kaustubh Deshpande
Ved Sirdeshmukh
Meher Mankikar
Scale Red Team
SEAL Research Team
Julian Michael
AAML
ELM
51
0
0
17 Jun 2025
The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models
Peiyuan Tang
Haojie Xin
Xiaodong Zhang
Jun Sun
Qin Xia
Zijiang Yang
VLM
28
0
0
15 Jun 2025
Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025
Zonghao Ying
Siyang Wu
Run Hao
Peng Ying
Shixuan Sun
...
Xianglong Liu
Dawn Song
Alan Yuille
Philip Torr
Dacheng Tao
37
0
0
14 Jun 2025
Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases
Bonam Mingole
Aditya Majumdar
Firdaus Ahmed Choudhury
Jennifer L. Kraschnewski
S. Shyam Sundar
A. Yadav
LM&MA
ELM
AI4MH
29
0
0
13 Jun 2025
Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko
Mrinmaya Sachan
Bernhard Schölkopf
Zhijing Jin
AAML
17
0
0
13 Jun 2025
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
Shuai Wang
Zhenhua Liu
Jiaheng Wei
Xuanwu Yin
Dong Li
E. Barsoum
LRM
88
0
0
11 Jun 2025
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
Zijie Wu
Chaohui Yu
Fan Wang
Xiang Bai
AI4CE
65
0
0
11 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
79
0
0
11 Jun 2025
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li
Qiang Sheng
Yehan Yang
Xueyao Zhang
Juan Cao
91
0
0
11 Jun 2025
Intra-Trajectory Consistency for Reward Modeling
Chaoyang Zhou
Shunyu Liu
Zengmao Wang
Di Wang
Rong-Cheng Tu
Bo Du
Dacheng Tao
54
0
0
10 Jun 2025
Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling
Phuc Minh Nguyen
Ngoc-Hieu Nguyen
Duy Nguyen
Anji Liu
An Mai
Binh T. Nguyen
Daniel Sonntag
Khoa D. Doan
44
0
0
10 Jun 2025
GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO
Yiyang Zhao
Huiyu Bai
Xuejiao Zhao
OffRL
36
0
0
10 Jun 2025
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Feifan Song
Shaohang Wei
Wen Luo
Yuxuan Fan
Tianyu Liu
Guoyin Wang
Houfeng Wang
21
0
0
09 Jun 2025
Explicit Preference Optimization: No Need for an Implicit Reward Model
Xiangkun Hu
Lemin Kong
Tong He
David Wipf
35
0
0
09 Jun 2025
Robotic Policy Learning via Human-assisted Action Preference Optimization
Wenke Xia
Yichu Yang
Hongtao Wu
Xiao Ma
Tao Kong
Di Hu
35
0
0
08 Jun 2025
History-Aware Cross-Attention Reinforcement: Self-Supervised Multi Turn and Chain-of-Thought Fine-Tuning with vLLM
Andrew Kiruluta
Andreas Lemos
Priscilla Burity
LRM
29
0
0
08 Jun 2025
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
Pengfei Zhao
Rongbo Luan
Wei Zhang
Peng Wu
Sifeng He
27
0
0
08 Jun 2025
AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization
Zixuan Jiang
Renjing Xu
29
0
0
08 Jun 2025
Tokenized Bandit for LLM Decoding and Alignment
Suho Shin
Chenghao Yang
Haifeng Xu
Mohammad T. Hajiaghayi
28
0
0
08 Jun 2025
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao
Han Zhu
Jiaming Ji
Qichao Sun
Z. Zhu
Yinyu Wu
Juntao Dai
Yaodong Yang
Sirui Han
Yike Guo
AILaw
ALM
ELM
29
0
0
07 Jun 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Sooyung Choi
Jaehyeok Lee
Xiaoyuan Yi
Jing Yao
Xing Xie
JinYeong Bak
27
0
0
06 Jun 2025
Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
Weixun Wang
Shaopan Xiong
Gengru Chen
Wei Gao
Sheng Guo
...
Lin Qu
Wenbo Su
Wei Wang
Jiamang Wang
Bo Zheng
OffRL
72
0
0
06 Jun 2025
Saffron-1: Safety Inference Scaling
Ruizhong Qiu
Gaotang Li
Tianxin Wei
Jingrui He
Hanghang Tong
LRM
36
0
0
06 Jun 2025
Distillation Robustifies Unlearning
Bruce W. Lee
Addie Foote
Alex Infanger
Leni Shor
Harish Kamath
Jacob Goldman-Wetzler
Bryce Woodworth
Alex Cloud
Alexander Matt Turner
MU
75
0
0
06 Jun 2025
The Lock-in Hypothesis: Stagnation by Algorithm
Tianyi Qiu
Zhonghao He
Tejasveer Chugh
Max Kleiman-Weiner
59
0
0
06 Jun 2025
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat
Yuru Jiang
Wenxuan Ding
Shangbin Feng
Greg Durrett
Yulia Tsvetkov
92
0
0
05 Jun 2025
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Y. Wu
Yushi Bai
Zhiqiang Hu
Juanzi Li
Roy Ka-wei Lee
70
0
0
04 Jun 2025
Robust Preference Optimization via Dynamic Target Margins
Jie Sun
Junkang Wu
Jiancan Wu
Zhibo Zhu
Xingyu Lu
Jun Zhou
Lintao Ma
Xiang Wang
63
0
0
04 Jun 2025
Misalignment or misuse? The AGI alignment tradeoff
Max Hellrigel-Holderbaum
Leonard Dung
81
0
0
04 Jun 2025
Crowd-SFT: Crowdsourcing for LLM Alignment
Alex Sotiropoulos
Sulyab Thottungal Valapu
Linus Lei
J. Coleman
Bhaskar Krishnamachari
ALM
99
0
0
04 Jun 2025
RewardAnything: Generalizable Principle-Following Reward Models
Zhuohao Yu
Jiali Zeng
Weizheng Gu
Yidong Wang
Jindong Wang
Fandong Meng
Jie Zhou
Yue Zhang
Shikun Zhang
Wei Ye
LRM
128
1
0
04 Jun 2025
Think Twice, Act Once: A Co-Evolution Framework of LLM and RL for Large-Scale Decision Making
Xu Wan
Wenyue Xu
Chao Yang
Mingyang Sun
54
1
0
03 Jun 2025
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
Shenghua He
Tian Xia
Xuan Zhou
Hui Wei
OffRL
71
0
0
03 Jun 2025
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
Shaona Ghosh
Amrita Bhattacharjee
Yftah Ziser
Christopher Parisien
LLMSV
33
0
0
01 Jun 2025
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
Bumjin Park
Jinsil Lee
Jaesik Choi
22
0
0
01 Jun 2025
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
Songtao Jiang
Yan Zhang
Yeying Jin
Zhihang Tang
Y. Wu
Yang Feng
Jian Wu
Zuozhu Liu
54
1
0
01 Jun 2025
Generalizable LLM Learning of Graph Synthetic Data with Reinforcement Learning
Yizhuo Zhang
Heng Wang
Shangbin Feng
Zhaoxuan Tan
Xinyun Liu
Yulia Tsvetkov
OffRL
82
0
0
01 Jun 2025
Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models
William Overman
Mohsen Bayati
41
0
0
01 Jun 2025
Aligning VLM Assistants with Personalized Situated Cognition
Yongqi Li
Shen Zhou
Xiaohu Li
Xin Miao
Jintao Wen
...
Birong Pan
Hankun Kang
Yuanyuan Zhu
Ming Zhong
T. Qian
36
0
0
01 Jun 2025
1
2
3
4
...
12
13
14
Next