Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.15217
Cited By
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
27 July 2023
Stephen Casper
Xander Davies
Claudia Shi
T. Gilbert
Jérémy Scheurer
Javier Rando
Rachel Freedman
Tomasz Korbak
David Lindner
Pedro Freire
Tony Wang
Samuel Marks
Charbel-Raphaël Ségerie
Micah Carroll
Andi Peng
Phillip J. K. Christoffersen
Mehul Damani
Stewart Slocum
Usman Anwar
Anand Siththaranjan
Max Nadeau
Eric J. Michaud
J. Pfau
Dmitrii Krasheninnikov
Xin Chen
L. Langosco
Peter Hase
Erdem Biyik
Anca Dragan
David M. Krueger
Dorsa Sadigh
Dylan Hadfield-Menell
ALM
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback"
50 / 95 papers shown
Title
Calibrating Translation Decoding with Quality Estimation on LLMs
Di Wu
Yibin Lei
Christof Monz
94
0
0
26 Apr 2025
Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs
Marina Sakharova
Abhinav Anand
Mira Mezini
82
0
0
21 Apr 2025
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Ziyi Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
80
1
0
08 Apr 2025
Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning
Anja Surina
Amin Mansouri
Lars Quaedvlieg
Amal Seddas
Maryna Viazovska
Emmanuel Abbe
Çağlar Gülçehre
58
2
0
07 Apr 2025
OASST-ETC Dataset: Alignment Signals from Eye-tracking Analysis of LLM Responses
Angela Lopez-Cardona
Sebastian Idesis
Miguel Barreda-Ángeles
Sergi Abadal
Ioannis Arapakis
79
0
0
13 Mar 2025
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
87
2
0
07 Mar 2025
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
Rui Li
Peiyi Wang
Jingyuan Ma
Di Zhang
Lei Sha
Zhifang Sui
LLMAG
82
0
0
22 Feb 2025
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy
Yuhui Chen
Shuai Tian
Shugao Liu
Yingting Zhou
Haoran Li
Dongbin Zhao
OffRL
127
6
0
08 Feb 2025
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi
Xiao-Chang Liu
Iat Long Iong
Hanyu Lai
Xingwu Sun
...
Shuntian Yao
Tianjie Zhang
Wei Xu
J. Tang
Yuxiao Dong
118
27
0
28 Jan 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Min Zhang
LM&MA
AILaw
143
166
0
28 Jan 2025
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation
Suvodip Dey
M. Desarkar
OffRL
55
0
0
20 Jan 2025
Learning to Assist Humans without Inferring Rewards
Vivek Myers
Evan Ellis
Sergey Levine
Benjamin Eysenbach
Anca Dragan
79
3
0
17 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Chaoqi Wang
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
107
4
0
16 Jan 2025
AI Agent for Education: von Neumann Multi-Agent System Framework
Yuan-Hao Jiang
Ruijia Li
Yizhou Zhou
Changyong Qi
Hanglei Hu
Yuang Wei
Bo Jiang
Yonghe Wu
LLMAG
77
5
0
03 Jan 2025
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
78
22
0
31 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
136
2
0
01 Dec 2024
Efficient Alignment of Large Language Models via Data Sampling
Amrit Khera
Rajat Ghosh
Debojyoti Dutta
77
1
0
15 Nov 2024
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
Khaoula Chehbouni
Jonathan Colaço-Carr
Yash More
Jackie CK Cheung
G. Farnadi
117
1
0
12 Nov 2024
L3Ms -- Lagrange Large Language Models
Guneet S. Dhillon
Xingjian Shi
Yee Whye Teh
Alex Smola
346
0
0
28 Oct 2024
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina
Yuan Gao
Dokyun Lee
Gordon Burtch
Sina Fazelpour
LRM
70
8
0
25 Oct 2024
VideoAgent: Self-Improving Video Generation
Achint Soni
Sreyas Venkataraman
Abhranil Chandra
Sebastian Fischmeister
Percy Liang
Bo Dai
Sherry Yang
LM&Ro
VGen
65
8
0
14 Oct 2024
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
Abhijnan Nath
Changsoo Jung
Ethan Seefried
Nikhil Krishnaswamy
338
2
0
11 Oct 2024
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Yuancheng Xu
Udari Madhushani Sehwag
Alec Koppel
Sicheng Zhu
Bang An
Furong Huang
Sumitra Ganesh
76
10
0
10 Oct 2024
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
Angela Lopez-Cardona
Carlos Segura
Alexandros Karatzoglou
Sergi Abadal
Ioannis Arapakis
ALM
84
3
0
02 Oct 2024
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
61
2
0
02 Oct 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
113
38
0
26 Sep 2024
Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits
Tuhin Chakrabarty
Philippe Laban
Chien-Sheng Wu
61
11
0
22 Sep 2024
Uncovering Latent Chain of Thought Vectors in Language Models
Jason Zhang
Scott Viteri
LLMSV
LRM
65
3
0
21 Sep 2024
Problem Solving Through Human-AI Preference-Based Cooperation
Subhabrata Dutta
Timo Kaufmann
Goran Glavaš
Ivan Habernal
Kristian Kersting
Frauke Kreuter
Mira Mezini
Iryna Gurevych
Eyke Hüllermeier
Hinrich Schuetze
107
1
0
14 Aug 2024
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
Yuxin Jiang
Bo Huang
Yufei Wang
Xingshan Zeng
Liangyou Li
Yasheng Wang
Xin Jiang
Lifeng Shang
Ruiming Tang
Wei Wang
66
7
0
14 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
116
30
0
02 Jul 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
67
9
0
25 Jun 2024
Pareto-Optimal Learning from Preferences with Hidden Context
Ryan Boldi
Li Ding
Lee Spector
S. Niekum
85
6
0
21 Jun 2024
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
JoonHo Lee
Jae Oh Woo
Juree Seok
Parisa Hassanzadeh
Wooseok Jang
...
Hankyu Moon
Wenjun Hu
Yeong-Dae Kwon
Taehee Lee
Seungjai Min
61
2
0
10 May 2024
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Zikang Shan
Guhao Feng
Wei Xiong
Xinle Cheng
Li Zhao
Di He
Jiang Bian
Liwei Wang
101
62
0
29 Apr 2024
AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
Minbeom Kim
Hwanhee Lee
Joonsuk Park
Hwaran Lee
Kyomin Jung
58
2
0
18 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
76
137
0
28 Mar 2024
Towards Efficient Risk-Sensitive Policy Gradient: An Iteration Complexity Analysis
Rui Liu
Erfaun Noorani
Pratap Tokekar
John S. Baras
53
1
0
13 Mar 2024
Reinforcement Learning from Human Feedback with Active Queries
Kaixuan Ji
Jiafan He
Quanquan Gu
37
18
0
14 Feb 2024
The Typing Cure: Experiences with Large Language Model Chatbots for Mental Health Support
Inhwa Song
Sachin R. Pendse
Neha Kumar
Munmun De Choudhury
AI4MH
41
16
0
25 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLM
LLMAG
86
18
0
19 Jan 2024
Crowd-PrefRL: Preference-Based Reward Learning from Crowds
David Chhan
Ellen R. Novoseller
Vernon J. Lawhern
76
5
0
17 Jan 2024
Active teacher selection for reinforcement learning from human feedback
Rachel Freedman
Justin Svegliato
K. H. Wray
Stuart J. Russell
71
6
0
23 Oct 2023
Eliciting Human Preferences with Language Models
Belinda Z. Li
Alex Tamkin
Noah D. Goodman
Jacob Andreas
RALM
48
46
0
17 Oct 2023
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation
Benjamin Steenhoek
Michele Tufano
Neel Sundaresan
Alexey Svyatkovskiy
OffRL
ALM
93
18
0
03 Oct 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
59
120
0
06 Jul 2023
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Zeqiu Wu
Yushi Hu
Weijia Shi
Nouha Dziri
Alane Suhr
Prithviraj Ammanabrolu
Noah A. Smith
Mari Ostendorf
Hannaneh Hajishirzi
ALM
77
317
0
02 Jun 2023
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
Lyne Tchapmi
Mingyu Derek Ma
Fei Wang
Chaowei Xiao
Muhao Chen
SILM
74
80
0
24 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
31
426
0
13 Apr 2023
Whose Opinions Do Language Models Reflect?
Shibani Santurkar
Esin Durmus
Faisal Ladhak
Cinoo Lee
Percy Liang
Tatsunori Hashimoto
50
409
0
30 Mar 2023
1
2
Next