Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.04734
Cited By
Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective
13 August 2019
Tom Everitt
Marcus Hutter
Ramana Kumar
Victoria Krakovna
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective"
28 / 28 papers shown
Title
Reasoning Models Don't Always Say What They Think
Yanda Chen
Joe Benton
Ansh Radhakrishnan
Jonathan Uesato
Carson E. Denison
...
Vlad Mikulik
Samuel R. Bowman
Jan Leike
Jared Kaplan
E. Perez
ReLM
LRM
76
16
1
08 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
81
2
0
05 May 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Yufei Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
Think2SQL: Reinforce LLM Reasoning Capabilities for Text2SQL
Simone Papicchio
Simone Rossi
Luca Cagliero
Paolo Papotti
ReLM
LMTD
AI4TS
LRM
70
1
0
21 Apr 2025
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM
X. Zhang
Rongxiang Weng
Zifei Cheng
Wenhao Zhuang
Zheng Lin
...
Shouyu Yin
Chaohang Wen
Haotian Zhang
Bin Chen
Bing Yu
LRM
43
6
0
19 Apr 2025
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu
Zhe Zhang
Ruofei Zhu
Yufeng Yuan
Xiaochen Zuo
...
Ya Zhang
Lin Yan
Mu Qiao
Yonghui Wu
Mingxuan Wang
OffRL
LRM
78
69
0
18 Mar 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team
Angang Du
Bofei Gao
Bowei Xing
Changjiu Jiang
...
Zhilin Yang
Zhiqi Huang
Zihao Huang
Ziyao Xu
Zheng Yang
VLM
ALM
OffRL
AI4TS
LRM
120
167
0
22 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
96
1
0
22 Jan 2025
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Chaoqi Wang
Zhuokai Zhao
Yibo Jiang
Zhaorun Chen
Chen Zhu
...
Jiayi Liu
Lizhu Zhang
Xiangjun Fan
Hao Ma
Sinong Wang
82
4
0
17 Jan 2025
Best Practices and Lessons Learned on Synthetic Data for Language Models
Ruibo Liu
Jerry W. Wei
Fangyu Liu
Chenglei Si
Yanzhe Zhang
...
Steven Zheng
Daiyi Peng
Diyi Yang
Denny Zhou
Andrew M. Dai
SyDa
EgoV
48
87
0
11 Apr 2024
SHAPE: A Framework for Evaluating the Ethicality of Influence
Elfia Bezou-Vrakatseli
Benedikt Brückner
Luke Thorburn
TDI
34
3
0
08 Sep 2023
Benchmarks for Detecting Measurement Tampering
Fabien Roger
Ryan Greenblatt
Max Nadeau
Buck Shlegeris
Nate Thomas
33
2
0
29 Aug 2023
Designing Fiduciary Artificial Intelligence
Sebastian Benthall
David Shekman
51
4
0
27 Jul 2023
On Imperfect Recall in Multi-Agent Influence Diagrams
James Fox
Matt MacDermott
Lewis Hammond
Paul Harrenstein
Alessandro Abate
Michael Wooldridge
32
3
0
11 Jul 2023
Learning to Participate through Trading of Reward Shares
Michael Kölle
Tim Matheis
Philipp Altmann
Kyrill Schmid
36
7
0
18 Jan 2023
Reward Gaming in Conditional Text Generation
Richard Yuanzhe Pang
Vishakh Padmakumar
Thibault Sellam
Ankur P. Parikh
He He
35
24
0
16 Nov 2022
Defining and Characterizing Reward Hacking
Joar Skalse
Nikolaus H. R. Howe
Dmitrii Krasheninnikov
David M. Krueger
59
56
0
27 Sep 2022
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
68
183
0
30 Aug 2022
Discovering Agents
Zachary Kenton
Ramana Kumar
Sebastian Farquhar
Jonathan G. Richens
Matt MacDermott
Tom Everitt
CML
49
31
0
17 Aug 2022
Reinforcement Learning For Survival, A Clinically Motivated Method For Critically Ill Patients
Thesath Nanayakkara
OOD
OffRL
24
0
0
17 Jul 2022
Counterfactual harm
Jonathan G. Richens
R. Beard
Daniel H. Thompson
34
27
0
27 Apr 2022
A Complete Criterion for Value of Information in Soluble Influence Diagrams
Chris van Merwijk
Ryan Carey
Tom Everitt
26
5
0
23 Feb 2022
Safe Deep RL in 3D Environments using Human Feedback
Matthew Rahtz
Vikrant Varma
Ramana Kumar
Zachary Kenton
Shane Legg
Jan Leike
34
4
0
20 Jan 2022
Visual Adversarial Imitation Learning using Variational Models
Rafael Rafailov
Tianhe Yu
Aravind Rajeswaran
Chelsea Finn
SSL
33
49
0
16 Jul 2021
Avoiding Tampering Incentives in Deep RL via Decoupled Approval
J. Uesato
Ramana Kumar
Victoria Krakovna
Tom Everitt
Richard Ngo
Shane Legg
28
14
0
17 Nov 2020
REALab: An Embedded Perspective on Tampering
Ramana Kumar
J. Uesato
Richard Ngo
Tom Everitt
Victoria Krakovna
Shane Legg
30
10
0
17 Nov 2020
Hidden Incentives for Auto-Induced Distributional Shift
David M. Krueger
Tegan Maharaj
Jan Leike
13
49
0
19 Sep 2020
Learning Representations for Counterfactual Inference
Fredrik D. Johansson
Uri Shalit
David Sontag
CML
OOD
BDL
232
722
0
12 May 2016
1