Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1706.03741
Cited By
Deep reinforcement learning from human preferences
12 June 2017
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Deep reinforcement learning from human preferences"
50 / 216 papers shown
Title
When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Steffen Backmann
David Guzman Piedrahita
Emanuel Tewolde
Rada Mihalcea
Bernhard Schölkopf
Zhijing Jin
62
0
0
25 May 2025
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Xiaoqiang Lin
Arun Verma
Zhongxiang Dai
Daniela Rus
See-Kiong Ng
Bryan Kian Hsiang Low
253
0
0
25 May 2025
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey
Preeti Lamba
Kiran Ravish
Ankita Kushwaha
Pawan Kumar
EGVM
MedIm
97
0
0
23 May 2025
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Huayu Chen
Kaiwen Zheng
Qinsheng Zhang
Ganqu Cui
Yin Cui
Haotian Ye
Tsung-Yi Lin
Ming-Yu Liu
Jun Zhu
Haoxiang Wang
OffRL
LRM
224
2
0
23 May 2025
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang
Mingyang Wang
Yihong Liu
Hinrich Schutze
Barbara Plank
210
1
0
22 May 2025
Towards eliciting latent knowledge from LLMs with mechanistic interpretability
Bartosz Cywiński
Emil Ryd
Senthooran Rajamanoharan
Neel Nanda
69
0
0
20 May 2025
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung
Sangyeon Yoon
Minsuk Kahng
Albert No
LRM
LLMSV
161
1
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
84
0
0
20 May 2025
ExpertSteer: Intervening in LLMs through Expert Knowledge
Weixuan Wang
Minghao Wu
Barry Haddow
Alexandra Birch
LLMSV
143
0
0
18 May 2025
Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity
Qi Zhou
Jie Zhang
Dongxia Wang
Qiang Liu
Tianlin Li
Jin Song Dong
Wenhai Wang
Qing Guo
SyDa
86
0
0
17 May 2025
Online Iterative Self-Alignment for Radiology Report Generation
Ting Xiao
Lei Shi
Yang Zhang
HaoFeng Yang
Zhe Wang
Chenjia Bai
63
0
0
17 May 2025
ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization
Wenhao Shen
Wanqi Yin
Xiaofeng Yang
Cheng Chen
Chaoyue Song
Zhongang Cai
Lei Yang
Hao Wang
Guosheng Lin
100
0
0
15 May 2025
InfoPO: On Mutual Information Maximization for Large Language Model Alignment
Teng Xiao
Zhen Ge
Sujay Sanghavi
Tian Wang
Julian Katz-Samuels
Marc Versage
Qingjun Cui
Trishul Chilimbi
166
1
0
13 May 2025
Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models
Rei Higuchi
Taiji Suzuki
95
0
0
12 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Jen-tse Huang
Joey Tianyi Zhou
AAML
MU
235
4
0
01 May 2025
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
124
1
0
30 Apr 2025
Adaptive 3D UI Placement in Mixed Reality Using Deep Reinforcement Learning
Feiyu Lu
Mengyu Chen
Hsiang Hsu
Pranav Deshpande
Cheng Yao Wang
Blair MacIntyre
84
4
0
30 Apr 2025
A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification
Junichiro Niimi
106
1
0
26 Apr 2025
Do Large Language Models know who did what to whom?
Joseph M. Denning
Xiaohan
Bryor Snefjella
Idan A. Blank
203
1
0
23 Apr 2025
Benchmarking LLM-based Relevance Judgment Methods
Negar Arabzadeh
Charles L. A. Clarke
70
0
0
17 Apr 2025
Better Estimation of the KL Divergence Between Language Models
Afra Amini
Tim Vieira
Ryan Cotterell
91
0
0
14 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
Jian Wu
Hao Yang
Xinhua Zeng
Guibing He
Zhe Chen
Zhu Li
Xinming Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
343
0
0
12 Apr 2025
2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
Mengyang Li
Zhong Zhang
64
1
0
10 Apr 2025
FuseRL: Dense Preference Optimization for Heterogeneous Model Fusion
Longguang Zhong
Fanqi Wan
Ziyi Yang
Guosheng Liang
Tianyuan Shi
Xiaojun Quan
MoMe
100
0
0
09 Apr 2025
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Ziyi Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
149
2
0
08 Apr 2025
Information-Theoretic Reward Decomposition for Generalizable RLHF
Liyuan Mao
Haoran Xu
Amy Zhang
Weinan Zhang
Chenjia Bai
86
0
0
08 Apr 2025
LLM-based Automated Grading with Human-in-the-Loop
Hang Li
Yucheng Chu
Kaiqi Yang
Yasemin Copur-Gencturk
Jiliang Tang
AI4Ed
ELM
136
2
0
07 Apr 2025
Not All Data Are Unlearned Equally
Aravind Krishnan
Siva Reddy
Marius Mosbach
MU
346
2
0
07 Apr 2025
FISH-Tuning: Enhancing PEFT Methods with Fisher Information
Kang Xue
Ming Dong
Xinhui Tu
Tingting He
172
0
0
05 Apr 2025
Prompt Optimization with Logged Bandit Data
Haruka Kiyohara
Daniel Yiming Cao
Yuta Saito
Thorsten Joachims
212
0
0
03 Apr 2025
Large (Vision) Language Models are Unsupervised In-Context Learners
Artyom Gadetsky
Andrei Atanov
Yulun Jiang
Zhitong Gao
Ghazal Hosseini Mighan
Amir Zamir
Maria Brbić
VLM
MLLM
LRM
234
0
0
03 Apr 2025
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute
Jianhao Chen
Zishuo Xun
Bocheng Zhou
Han Qi
Qiaosheng Zhang
...
Wei Hu
Yuzhong Qu
W. Ouyang
Wanli Ouyang
Shuyue Hu
126
2
0
01 Apr 2025
Reasoning Beyond Limits: Advances and Open Problems for LLMs
M. Ferrag
Norbert Tihanyi
Merouane Debbah
ELM
OffRL
LRM
AI4CE
375
4
0
26 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Jian Guan
Jian Wu
Jia-Nan Li
Chuanqi Cheng
Wei Wu
LM&MA
131
2
0
21 Mar 2025
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai
Shitian Zhao
Ming Li
Jike Zhong
Xiaofeng Yang
OffRL
LRM
LM&MA
VLM
123
25
0
18 Mar 2025
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Zijing Hu
Fengda Zhang
Long Chen
Kun Kuang
Jiahui Li
Kaifeng Gao
Jun Xiao
X. Wang
Wenwu Zhu
EGVM
202
4
0
14 Mar 2025
Safe Explicable Policy Search
Akkamahadevi Hanni
Jonathan Montaño
Yu Zhang
100
0
0
10 Mar 2025
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Zelei Cheng
Xin-Qiang Cai
Yuting Tang
Pushi Zhang
Boming Yang
Masashi Sugiyama
Xinyu Xing
130
0
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
111
3
0
10 Mar 2025
Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
Zhaowei Zhang
Fengshuo Bai
Qizhi Chen
Chengdong Ma
Mingzhi Wang
Haoran Sun
Zilong Zheng
Yaodong Yang
125
4
0
26 Feb 2025
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang
Bingcong Li
Christoph Dann
Niao He
OffRL
234
3
0
26 Feb 2025
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan
Che Liu
Junde Wu
Fenglin Liu
Jiayuan Zhu
Hongwei Bran Li
Chen Chen
Cheng Ouyang
Daniel Rueckert
LRM
LM&MA
VLM
134
35
0
26 Feb 2025
What is the Alignment Objective of GRPO?
Milan Vojnovic
Se-Young Yun
105
5
0
25 Feb 2025
Can Large Language Models Extract Customer Needs as well as Professional Analysts?
Artem Timoshenko
Chengfeng Mao
J. Hauser
ELM
104
0
0
25 Feb 2025
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Yuheng Zhang
Dian Yu
Tao Ge
Linfeng Song
Zhichen Zeng
Haitao Mi
Nan Jiang
Dong Yu
114
4
0
24 Feb 2025
Training a Generally Curious Agent
Fahim Tajwar
Yiding Jiang
Abitha Thankaraj
Sumaita Sadia Rahman
J. Zico Kolter
Jeff Schneider
Ruslan Salakhutdinov
183
3
0
24 Feb 2025
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Michal Bravansky
Vaclav Kubon
Suhas Hariharan
Robert Kirk
105
1
0
24 Feb 2025
Post-edits Are Preferences Too
Nathaniel Berger
Stefan Riezler
M. Exel
Matthias Huck
103
1
0
24 Feb 2025
Is Free Self-Alignment Possible?
Dyah Adila
Changho Shin
Yijing Zhang
Frederic Sala
MoMe
149
2
0
24 Feb 2025
IPO: Your Language Model is Secretly a Preference Classifier
Shivank Garg
Ayush Singh
Shweta Singh
Paras Chopra
409
1
0
22 Feb 2025
1
2
3
4
5
Next