ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1811.07871
  4. Cited By
Scalable agent alignment via reward modeling: a research direction

Scalable agent alignment via reward modeling: a research direction

19 November 2018
Jan Leike
David M. Krueger
Tom Everitt
Miljan Martic
Vishal Maini
Shane Legg
ArXivPDFHTML

Papers citing "Scalable agent alignment via reward modeling: a research direction"

50 / 85 papers shown
Title
Soft Best-of-n Sampling for Model Alignment
Soft Best-of-n Sampling for Model Alignment
C. M. Verdun
Alex Oesterling
Himabindu Lakkaraju
Flavio du Pin Calmon
BDL
159
0
0
06 May 2025
An alignment safety case sketch based on debate
An alignment safety case sketch based on debate
Marie Davidsen Buhl
Jacob Pfau
Benjamin Hilton
Geoffrey Irving
38
0
0
06 May 2025
Scaling Laws For Scalable Oversight
Scaling Laws For Scalable Oversight
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
ELM
75
0
0
25 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Y. Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution
Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution
Zen Kit Heng
Zimeng Zhao
Tianhao Wu
Yuanfei Wang
Mingdong Wu
Yangang Wang
Hao Dong
35
0
0
10 Apr 2025
Information-Theoretic Reward Decomposition for Generalizable RLHF
Information-Theoretic Reward Decomposition for Generalizable RLHF
Liyuan Mao
Haoran Xu
Amy Zhang
Weinan Zhang
Chenjia Bai
33
0
0
08 Apr 2025
Adversarial Training of Reward Models
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Zhilin Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
32
0
0
08 Apr 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
OffRL
63
0
0
13 Mar 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
147
0
0
27 Feb 2025
Learning from Active Human Involvement through Proxy Value Propagation
Learning from Active Human Involvement through Proxy Value Propagation
Zhenghao Peng
Wenjie Mo
Chenda Duan
Quanyi Li
Bolei Zhou
107
14
0
05 Feb 2025
COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models
COS(M+O)S: Curiosity and RL-Enhanced MCTS for Exploring Story Space via Language Models
Tobias Materzok
LRM
69
0
0
28 Jan 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
82
1
0
22 Jan 2025
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
fff-PO: Generalizing Preference Optimization with fff-divergence Minimization
Jiaqi Han
Mingjian Jiang
Yuxuan Song
J. Leskovec
Stefano Ermon
56
3
0
29 Oct 2024
SPIN: Self-Supervised Prompt INjection
SPIN: Self-Supervised Prompt INjection
Leon Zhou
Junfeng Yang
Chengzhi Mao
AAML
SILM
30
0
0
17 Oct 2024
Preference Optimization with Multi-Sample Comparisons
Preference Optimization with Multi-Sample Comparisons
Chaoqi Wang
Zhuokai Zhao
Chen Zhu
Karthik Abinav Sankararaman
Michal Valko
...
Zhaorun Chen
Madian Khabsa
Yuxin Chen
Hao Ma
Sinong Wang
66
10
0
16 Oct 2024
CREAM: Consistency Regularized Self-Rewarding Language Models
CREAM: Consistency Regularized Self-Rewarding Language Models
Zekun Wang
Weilei He
Zhiyuan Liang
Xuchao Zhang
Chetan Bansal
Ying Wei
Weitong Zhang
Huaxiu Yao
ALM
101
7
0
16 Oct 2024
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Yuancheng Xu
Udari Madhushani Sehwag
Alec Koppel
Sicheng Zhu
Bang An
Furong Huang
Sumitra Ganesh
55
6
0
10 Oct 2024
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through
  Data, Reward, and Conditional Guidance Design
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design
Jiachen Li
Qian Long
Jian Zheng
Xiaofeng Gao
Robinson Piramuthu
Wenhu Chen
William Yang Wang
VGen
29
22
0
08 Oct 2024
Prompt Baking
Prompt Baking
Aman Bhargava
Cameron Witkowski
Alexander Detkov
Matt W. Thomson
AI4CE
38
0
0
04 Sep 2024
Emergence in Multi-Agent Systems: A Safety Perspective
Emergence in Multi-Agent Systems: A Safety Perspective
Philipp Altmann
Julian Schonberger
Steffen Illium
Maximilian Zorn
Fabian Ritz
Tom Haider
Simon Burton
Thomas Gabor
40
1
0
08 Aug 2024
Meta-Rewarding Language Models: Self-Improving Alignment with
  LLM-as-a-Meta-Judge
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Tianhao Wu
Weizhe Yuan
O. Yu. Golovneva
Jing Xu
Yuandong Tian
Jiantao Jiao
Jason Weston
Sainbayar Sukhbaatar
ALM
KELM
LRM
58
72
0
28 Jul 2024
Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law
Training Foundation Models as Data Compression: On Information, Model Weights and Copyright Law
Giorgio Franceschelli
Claudia Cevenini
Mirco Musolesi
44
0
0
18 Jul 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous
  Preferences
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen
Yi Chen
Aniket Rege
Ramya Korlakai Vinayak
40
17
0
12 Jun 2024
CLoG: Benchmarking Continual Learning of Image Generation Models
CLoG: Benchmarking Continual Learning of Image Generation Models
Haotian Zhang
Junting Zhou
Haowei Lin
Hang Ye
Jianhua Zhu
Zihao Wang
Liangcai Gao
Yizhou Wang
Yitao Liang
DiffM
VLM
34
1
0
07 Jun 2024
Diffusion-Reward Adversarial Imitation Learning
Diffusion-Reward Adversarial Imitation Learning
Chun-Mao Lai
Hsiang-Chun Wang
Ping-Chun Hsieh
Yu-Chiang Frank Wang
Min-Hung Chen
Shao-Hua Sun
37
8
0
25 May 2024
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
JoonHo Lee
Jae Oh Woo
Juree Seok
Parisa Hassanzadeh
Wooseok Jang
...
Hankyu Moon
Wenjun Hu
Yeong-Dae Kwon
Taehee Lee
Seungjai Min
47
2
0
10 May 2024
LLM Evaluators Recognize and Favor Their Own Generations
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
44
156
0
15 Apr 2024
On the Essence and Prospect: An Investigation of Alignment Approaches
  for Big Models
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
38
16
0
07 Mar 2024
Aligning Human Intent from Imperfect Demonstrations with
  Confidence-based Inverse soft-Q Learning
Aligning Human Intent from Imperfect Demonstrations with Confidence-based Inverse soft-Q Learning
Xizhou Bu
Wenjuan Li
Zhengxiong Liu
Zhiqiang Ma
Panfeng Huang
20
1
0
18 Dec 2023
Vision-Language Models as a Source of Rewards
Vision-Language Models as a Source of Rewards
Kate Baumli
Satinder Baveja
Feryal M. P. Behbahani
Harris Chan
Gheorghe Comanici
...
Yannick Schroecker
Stephen Spencer
Richie Steigerwald
Luyu Wang
Lei Zhang
VLM
LRM
42
26
0
14 Dec 2023
Scalable AI Safety via Doubly-Efficient Debate
Scalable AI Safety via Doubly-Efficient Debate
Jonah Brown-Cohen
Geoffrey Irving
Georgios Piliouras
24
15
0
23 Nov 2023
From "Thumbs Up" to "10 out of 10": Reconsidering Scalar Feedback in
  Interactive Reinforcement Learning
From "Thumbs Up" to "10 out of 10": Reconsidering Scalar Feedback in Interactive Reinforcement Learning
Hang Yu
Reuben M. Aronson
Katherine H. Allen
E. Short
42
3
0
17 Nov 2023
When does In-context Learning Fall Short and Why? A Study on
  Specification-Heavy Tasks
When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks
Hao Peng
Xiaozhi Wang
Jianhui Chen
Weikai Li
Y. Qi
...
Zhili Wu
Kaisheng Zeng
Bin Xu
Lei Hou
Juanzi Li
31
28
0
15 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well?
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
35
15
0
10 Nov 2023
Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
213
192
0
20 Oct 2023
SemiReward: A General Reward Model for Semi-supervised Learning
SemiReward: A General Reward Model for Semi-supervised Learning
Siyuan Li
Weiyang Jin
Zedong Wang
Fang Wu
Zicheng Liu
Cheng Tan
Stan Z. Li
35
9
0
04 Oct 2023
SELF: Self-Evolution with Language Feedback
SELF: Self-Evolution with Language Feedback
Jianqiao Lu
Wanjun Zhong
Wenyong Huang
Yufei Wang
Qi Zhu
...
Weichao Wang
Xingshan Zeng
Lifeng Shang
Xin Jiang
Qun Liu
LRM
SyDa
26
6
0
01 Oct 2023
Beyond Reverse KL: Generalizing Direct Preference Optimization with
  Diverse Divergence Constraints
Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
Chaoqi Wang
Yibo Jiang
Yuguang Yang
Han Liu
Yuxin Chen
36
82
0
28 Sep 2023
Designing Fiduciary Artificial Intelligence
Designing Fiduciary Artificial Intelligence
Sebastian Benthall
David Shekman
51
4
0
27 Jul 2023
Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent
  Problems in AI Alignment using Large-Language Models
Of Models and Tin Men: A Behavioural Economics Study of Principal-Agent Problems in AI Alignment using Large-Language Models
S. Phelps
Rebecca E. Ranson
LLMAG
34
1
0
20 Jul 2023
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge
Laura Cabello
Jiaang Li
Ilias Chalkidis
ELM
AI4MH
LRM
16
2
0
05 Jun 2023
Learning Interpretable Models of Aircraft Handling Behaviour by
  Reinforcement Learning from Human Feedback
Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback
Tom Bewley
J. Lawry
Arthur G. Richards
30
1
0
26 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Rui Pan
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
18
404
0
13 Apr 2023
A Human-Centered Safe Robot Reinforcement Learning Framework with
  Interactive Behaviors
A Human-Centered Safe Robot Reinforcement Learning Framework with Interactive Behaviors
Shangding Gu
Alap Kshirsagar
Yali Du
Guang Chen
Jan Peters
Alois C. Knoll
34
14
0
25 Feb 2023
COACH: Cooperative Robot Teaching
COACH: Cooperative Robot Teaching
Cunjun Yu
Yiqing Xu
Linfeng Li
David Hsu
29
5
0
13 Feb 2023
Goal Alignment: A Human-Aware Account of Value Alignment Problem
Goal Alignment: A Human-Aware Account of Value Alignment Problem
Malek Mechergui
S. Sreedharan
18
2
0
02 Feb 2023
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
70
327
0
07 Dec 2022
Actively Learning Costly Reward Functions for Reinforcement Learning
Actively Learning Costly Reward Functions for Reinforcement Learning
André Eberhard
Houssam Metni
G. Fahland
A. Stroh
Pascal Friederich
OffRL
35
0
0
23 Nov 2022
Rewards Encoding Environment Dynamics Improves Preference-based
  Reinforcement Learning
Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning
Katherine Metcalf
Miguel Sarabia
B. Theobald
OffRL
38
4
0
12 Nov 2022
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALM
ELM
28
122
0
04 Nov 2022
12
Next