STARC: A General Framework For Quantifying Differences Between Reward
Functions

STARC: A General Framework For Quantifying Differences Between Reward Functions

26 September 2023

Adam Gleave

Alessandro Abate

Papers citing "STARC: A General Framework For Quantifying Differences Between Reward Functions"

9 / 9 papers shown

Title
Subtask-Aware Visual Reward Learning from Segmented Demonstrations Changyeon Kim Minho Heo Doohyun Lee Jinwoo Shin Honglak Lee Joseph J. Lim Kimin Lee 44 1 0 28 Feb 2025
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree? Xueru Wen Jie Lou Yaojie Lu Hongyu Lin Xing Yu Xinyu Lu Xianpei Han Xianpei Han Debing Zhang Le Sun ALM 61 4 0 17 Feb 2025
On the Partial Identifiability in Reward Learning: Choosing the Best Reward Filippo Lazzati Alberto Maria Metelli 38 0 0 10 Jan 2025
Can a Bayesian Oracle Prevent Harm from an Agent? Yoshua Bengio Michael K. Cohen Nikolay Malkin Matt MacDermott Damiano Fornasiere Pietro Greiner Younesse Kaddar 45 4 0 09 Aug 2024
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems David Dalrymple Joar Skalse Yoshua Bengio Stuart J. Russell Max Tegmark ... Clark Barrett Ding Zhao Zhi-Xuan Tan Jeannette Wing Joshua Tenenbaum 52 52 0 10 May 2024
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification Joar Skalse Alessandro Abate 28 3 0 11 Mar 2024
Cooperation and Control in Delegation Games Oliver Sourbut Lewis Hammond Harriet Wood 14 0 0 24 Feb 2024
Goodhart's Law in Reinforcement Learning Jacek Karwowski Oliver Hayman Xingjian Bai Klaus Kiendlhofer Charlie Griffin Joar Skalse 34 9 0 13 Oct 2023
Defining and Characterizing Reward Hacking Joar Skalse Nikolaus H. R. Howe Dmitrii Krasheninnikov David M. Krueger 59 55 0 27 Sep 2022