Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2204.05862
Cited By
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
12 April 2022
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
Nova Dassarma
Dawn Drain
Stanislav Fort
Deep Ganguli
T. Henighan
Nicholas Joseph
Saurav Kadavath
John Kernion
Tom Conerly
S. E. Showk
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
Tristan Hume
Scott R. Johnston
Shauna Kravec
Liane Lovitt
Neel Nanda
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
50 / 655 papers shown
Title
PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
Linqing Chen
Weilei Wang
Zilong Bai
Peng Xu
Yan Fang
...
Lisha Zhang
Fu Bian
Zhongkai Ye
Lidong Pei
Changyang Tu
AI4MH
LM&MA
107
3
0
26 Jun 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
117
14
0
25 Jun 2024
Large Language Models Assume People are More Rational than We Really are
Ryan Liu
Jiayi Geng
Joshua C. Peterson
Ilia Sucholutsky
Thomas Griffiths
164
20
0
24 Jun 2024
PORT: Preference Optimization on Reasoning Traces
Salem Lahlou
Abdalgader Abubaker
Hakim Hacid
LRM
122
5
0
23 Jun 2024
Pareto-Optimal Learning from Preferences with Hidden Context
Ryan Boldi
Li Ding
Lee Spector
S. Niekum
153
6
0
21 Jun 2024
Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Han Jiang
Xiaoyuan Yi
Zhihua Wei
Ziang Xiao
Shu Wang
Xing Xie
ELM
ALM
166
8
0
20 Jun 2024
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
Seungbeen Lee
Seungwon Lim
Seungju Han
Giyeong Oh
Hyungjoo Chae
...
Beong-woo Kwak
Yeonsoo Lee
Dongha Lee
Jinyoung Yeo
Youngjae Yu
101
16
0
20 Jun 2024
FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering
Tianchi Cai
Zhiwen Tan
Xierui Song
Tao Sun
Jiyan Jiang
Yunqi Xu
Yinger Zhang
Jinjie Gu
82
7
0
19 Jun 2024
[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
Abhinav Rao
Monojit Choudhury
Somak Aditya
77
0
0
18 Jun 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li
Zhangchen Xu
Fengqing Jiang
Luyao Niu
D. Sahabandu
Bhaskar Ramasubramanian
Radha Poovendran
SILM
AAML
124
10
0
18 Jun 2024
Is poisoning a real threat to LLM alignment? Maybe more so than you think
Pankayaraj Pathmanathan
Souradip Chakraborty
Xiangyu Liu
Yongyuan Liang
Furong Huang
AAML
132
17
0
17 Jun 2024
Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini
Somnath Basu Roy Chowdhury
Snigdha Chaturvedi
187
9
0
17 Jun 2024
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
Yongting Zhang
Lu Chen
Guodong Zheng
Yifeng Gao
Rui Zheng
...
Yu Qiao
Xuanjing Huang
Feng Zhao
Tao Gui
Jing Shao
VLM
228
33
0
17 Jun 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
138
16
0
17 Jun 2024
Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models
Jie Chen
Xintian Han
Yu Ma
Xun Zhou
Liang Xiang
ALM
LRM
76
2
0
14 Jun 2024
Teaching Language Models to Self-Improve by Learning from Language Feedback
Chi Hu
Yimin Hu
Hang Cao
Tong Xiao
Jingbo Zhu
LRM
VLM
83
5
0
11 Jun 2024
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward
Yuzi Yan
Yibo Miao
J. Li
Yipin Zhang
Jian Xie
Zhijie Deng
Dong Yan
112
13
0
11 Jun 2024
Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art
Chen Cecilia Liu
Iryna Gurevych
Anna Korhonen
191
6
0
06 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
95
1
0
04 Jun 2024
AI as Decision-Maker: Ethics and Risk Preferences of LLMs
Shumiao Ouyang
Hayong Yun
Xingjian Zheng
89
4
0
03 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
123
12
0
02 Jun 2024
Phased Instruction Fine-Tuning for Large Language Models
Wei Pang
Chuan Zhou
Xiao-Hua Zhou
Xiaojie Wang
ALM
85
5
0
01 Jun 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
165
55
0
31 May 2024
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
113
21
0
31 May 2024
Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation
Guillaume Huguet
James Vuckovic
Kilian Fatras
Eric Thibodeau-Laufer
Pablo Lemos
...
Jarrid Rector-Brooks
Tara Akhound-Sadegh
Michael M. Bronstein
Alexander Tong
A. Bose
105
32
0
30 May 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
124
22
0
30 May 2024
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
M. G. Azar
...
Gil Shamir
Rishabh Joshi
Tianqi Liu
Rémi Munos
Bilal Piot
OffRL
123
29
0
29 May 2024
Robust Preference Optimization through Reward Model Distillation
Adam Fisch
Jacob Eisenstein
Vicky Zayats
Alekh Agarwal
Ahmad Beirami
Chirag Nagpal
Peter Shaw
Jonathan Berant
160
37
0
29 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
160
20
0
28 May 2024
BWArea Model: Learning World Model, Inverse Dynamics, and Policy for Controllable Language Generation
Chengxing Jia
Pengyuan Wang
Ziniu Li
Yi-Chen Li
Zhilong Zhang
Nan Tang
Yang Yu
OffRL
64
2
0
27 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
150
56
0
27 May 2024
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu
Miao Lu
Shenao Zhang
Boyi Liu
Hongyi Guo
Yingxiang Yang
Jose H. Blanchet
Zhaoran Wang
147
62
0
26 May 2024
Synthesizing Programmatic Reinforcement Learning Policies with Large Language Model Guided Search
Max Liu
Chan-Hung Yu
Wei-Hsu Lee
Cheng-Wei Hung
Yen-Chun Chen
Shao-Hua Sun
145
5
0
26 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
151
15
0
26 May 2024
Bayesian WeakS-to-Strong from Text Classification to Generation
Ziyun Cui
Ziyang Zhang
Wen Wu
Wen Wu
Chao Zhang
130
3
0
24 May 2024
Extracting Prompts by Inverting LLM Outputs
Collin Zhang
John X. Morris
Vitaly Shmatikov
71
22
0
23 May 2024
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang
Chong Wang
Xiaojun Jia
Qing Guo
Felix Juefei Xu
Jian Zhang
G. Pu
Yang Liu
111
9
0
23 May 2024
LIRE: listwise reward enhancement for preference alignment
Mingye Zhu
Yi Liu
Lei Zhang
Junbo Guo
Zhendong Mao
58
7
0
22 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
131
3
0
21 May 2024
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging
Xiaobo Liang
Haoke Zhang
Helan hu
Juntao Li
Jun Xu
Min Zhang
ALM
77
3
0
20 May 2024
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback
Ruitao Chen
Liwei Wang
148
1
0
18 May 2024
HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants
Milan Gritta
Gerasimos Lampouras
Ignacio Iacobacci
ALM
69
2
0
15 May 2024
Automating Creativity
Ming-Hui Huang
R. Rust
105
0
0
11 May 2024
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
JoonHo Lee
Jae Oh Woo
Juree Seok
Parisa Hassanzadeh
Wooseok Jang
...
Hankyu Moon
Wenjun Hu
Yeong-Dae Kwon
Taehee Lee
Seungjai Min
148
2
0
10 May 2024
SUTRA: Scalable Multilingual Language Model Architecture
Abhijit Bendale
Michael Sapienza
Steven Ripplinger
Simon Gibbs
Jaewon Lee
Pranav Mistry
LRM
ELM
71
5
0
07 May 2024
The Elephant in the Room -- Why AI Safety Demands Diverse Teams
David Rostcheck
Lara Scheibling
63
1
0
07 May 2024
MetaRM: Shifted Distributions Alignment via Meta-Learning
Shihan Dou
Yan Liu
Enyu Zhou
Changze Lv
Haoxiang Jia
...
Junjie Ye
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
OOD
160
2
0
01 May 2024
Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment
Zhili Liu
Yunhao Gou
Kai Chen
Lanqing Hong
Jiahui Gao
...
Yu Zhang
Zhenguo Li
Xin Jiang
Qiang Liu
James T. Kwok
MoE
247
10
0
01 May 2024
RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
Chanwoo Park
Mingyang Liu
Dingwen Kong
Kaiqing Zhang
Asuman Ozdaglar
153
41
0
30 Apr 2024
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Zikang Shan
Guhao Feng
Wei Xiong
Xinle Cheng
Li Zhao
Di He
Jiang Bian
Liwei Wang
158
72
0
29 Apr 2024
Previous
1
2
3
...
8
9
10
...
12
13
14
Next