Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2301.03652
Cited By
On The Fragility of Learned Reward Functions
9 January 2023
Lev McKinney
Yawen Duan
David M. Krueger
Adam Gleave
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"On The Fragility of Learned Reward Functions"
5 / 5 papers shown
Title
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Wei Shen
Guanlin Liu
Zheng Wu
Ruofei Zhu
Qingping Yang
Chao Xin
Yu Yue
Lin Yan
151
14
0
28 Mar 2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Shujun Liu
Xiaoyu Shen
Yuhang Lai
Siyuan Wang
Shengbin Yue
Zengfeng Huang
Xuanjing Huang
Zhongyu Wei
124
1
0
04 Jul 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
124
3
0
21 May 2024
Learning to Watermark LLM-generated Text via Reinforcement Learning
Xiaojun Xu
Yuanshun Yao
Yang Liu
94
14
0
13 Mar 2024
Compositional preference models for aligning LMs
Dongyoung Go
Tomasz Korbak
Germán Kruszewski
Jos Rozen
Marc Dymetman
90
20
0
17 Oct 2023
1