Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.17031
Cited By
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
24 March 2024
Shengyi Huang
Michael Noukhovitch
Arian Hosseini
Kashif Rasul
Weixun Wang
Lewis Tunstall
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (139★)
Papers citing
"The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization"
16 / 16 papers shown
Title
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
Youssef Mroueh
OffRL
152
10
0
09 Mar 2025
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
Shashank Gupta
Chaitanya Ahuja
Tsung-Yu Lin
Sreya Dutta Roy
Harrie Oosterhuis
Maarten de Rijke
Satya Narayan Shukla
97
1
0
02 Mar 2025
Cautious Optimizers: Improving Training with One Line of Code
Kaizhao Liang
Lizhang Chen
B. Liu
Qiang Liu
ODL
197
9
0
25 Nov 2024
Cross-lingual Transfer of Reward Models in Multilingual Alignment
Jiwoo Hong
Noah Lee
Rodrigo Martínez-Castaño
César Rodríguez
James Thorne
112
5
0
23 Oct 2024
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
Rishabh Agarwal
Rameswar Panda
OffRL
136
11
0
23 Oct 2024
Science Out of Its Ivory Tower: Improving Accessibility with Reinforcement Learning
Haining Wang
Jason Clark
Hannah McKelvey
Leila Sterman
Zheng Gao
Zuoyu Tian
Sandra Kübler
Xiaozhong Liu
86
1
0
22 Oct 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
133
31
0
23 Aug 2024
Variational Best-of-N Alignment
Afra Amini
Tim Vieira
Ryan Cotterell
Ryan Cotterell
BDL
85
22
0
08 Jul 2024
Mixtral of Experts
Albert Q. Jiang
Alexandre Sablayrolles
Antoine Roux
A. Mensch
Blanche Savary
...
Théophile Gervet
Thibaut Lavril
Thomas Wang
Timothée Lacroix
William El Sayed
MoE
LLMAG
153
1,112
0
08 Jan 2024
Language Model Alignment with Elastic Reset
Michael Noukhovitch
Samuel Lavoie
Florian Strub
Aaron Courville
KELM
141
26
0
06 Dec 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
174
624
0
18 Oct 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
391
4,388
0
09 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
387
3,981
0
29 May 2023
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Yann Dubois
Xuechen Li
Rohan Taori
Tianyi Zhang
Ishaan Gulrajani
Jimmy Ba
Carlos Guestrin
Percy Liang
Tatsunori B. Hashimoto
ALM
130
600
0
22 May 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
880
12,973
0
04 Mar 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
466
1,734
0
18 Sep 2019
1