Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2409.12822
Cited By
Language Models Learn to Mislead Humans via RLHF
19 September 2024
Jiaxin Wen
Ruiqi Zhong
Akbir Khan
Ethan Perez
Jacob Steinhardt
Minlie Huang
Samuel R. Bowman
He He
Shi Feng
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Language Models Learn to Mislead Humans via RLHF"
25 / 25 papers shown
Title
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Tuhina Tripathi
Manya Wadhwa
Greg Durrett
S. Niekum
32
0
0
20 Apr 2025
AI Safety Should Prioritize the Future of Work
Sanchaita Hazra
Bodhisattwa Prasad Majumder
Tuhin Chakrabarty
37
0
0
16 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
45
2
0
12 Apr 2025
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen
Peng Liu
J. Li
Chunxin Fang
Yibo Ma
...
Zilun Zhang
Kangjia Zhao
Qianqian Zhang
Ruochen Xu
Tiancheng Zhao
VLM
LRM
74
29
0
10 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
26
0
0
07 Apr 2025
Understanding the Effects of RLHF on the Quality and Detectability of LLM-Generated Texts
Beining Xu
Arkaitz Zubiaga
DeLMO
68
0
0
23 Mar 2025
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang
Zhangyi Jiang
Zhenqi He
Wenhan Yang
Yanan Zheng
Zeyu Li
Zifan He
Shenyang Tong
Hailei Gong
LRM
90
1
0
16 Mar 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
69
11
0
14 Mar 2025
Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models
Afrar Jahin
Arif Hassan Zidan
Yu Bao
Shizhe Liang
T. Liu
W. Zhang
LRM
70
1
0
13 Mar 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?
Junhao Shi
Qinyuan Cheng
Zhaoye Fei
Y. Zheng
Qipeng Guo
Xipeng Qiu
70
0
0
06 Mar 2025
Adaptively evaluating models with task elicitation
Davis Brown
Prithvi Balehannina
Helen Jin
Shreya Havaldar
Hamed Hassani
Eric Wong
ALM
ELM
91
0
0
03 Mar 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
H. Wang
Qi Han
Yanghua Xiao
84
6
0
26 Feb 2025
Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL
Wichayaporn Wongkamjan
Yanze Wang
Feng Gu
Denis Peskoff
Jonathan K. Kummerfeld
Jonathan May
Jordan Boyd-Graber
50
0
0
18 Feb 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
82
1
0
22 Jan 2025
Observation Interference in Partially Observable Assistance Games
Scott Emmons
Caspar Oesterheld
Vincent Conitzer
Stuart Russell
26
1
0
23 Dec 2024
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models
Cameron R. Jones
Benjamin Bergen
67
3
0
22 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
113
1
0
15 Dec 2024
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Jiaxin Wen
Vivek Hebbar
Caleb Larson
Aryan Bhatt
Ansh Radhakrishnan
...
Shi Feng
He He
Ethan Perez
Buck Shlegeris
Akbir Khan
AAML
81
8
0
26 Nov 2024
Sycophancy in Large Language Models: Causes and Mitigations
Lars Malmqvist
76
7
0
22 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Ben He
...
Le Sun
Jie Lou
Bowen Yu
Y. Lu
Hongyu Lin
ALM
83
2
0
18 Nov 2024
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
Michał Pietruszka
Łukasz Borchmann
Aleksander Jędrosz
Paweł Morawiecki
ELM
25
0
0
30 Oct 2024
Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review
Emma Croxford
Yanjun Gao
Nicholas Pellegrino
Karen K. Wong
Graham Wills
Elliot First
Frank J. Liao
Cherodeep Goswami
Brian Patterson
Majid Afshar
HILM
ELM
LM&MA
37
1
0
26 Sep 2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Samuel Arnesen
David Rein
Julian Michael
ELM
33
3
0
25 Sep 2024
Understanding Generative AI Content with Embedding Models
Max Vargas
Reilly Cannon
A. Engel
Anand D. Sarwate
Tony Chiang
52
3
0
19 Aug 2024
1