Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.20727
Cited By
Faster WIND: Accelerating Iterative Best-of-
N
N
N
Distillation for LLM Alignment
20 February 2025
Tong Yang
Jincheng Mei
H. Dai
Zixin Wen
Shicong Cen
Dale Schuurmans
Yuejie Chi
Bo Dai
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment"
45 / 45 papers shown
Title
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Shicong Cen
Jincheng Mei
Katayoon Goshvadi
Hanjun Dai
Tong Yang
Sherry Yang
Dale Schuurmans
Yuejie Chi
Bo Dai
OffRL
115
35
0
20 Feb 2025
Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games
Tong Yang
Bo Dai
Lin Xiao
Yuejie Chi
OffRL
92
2
0
13 Feb 2025
In-Context Learning with Representations: Contextual Generalization of Trained Transformers
Tong Yang
Yu Huang
Yingbin Liang
Yuejie Chi
MLT
77
9
0
19 Aug 2024
Variational Best-of-N Alignment
Afra Amini
Tim Vieira
Ryan Cotterell
Ryan Cotterell
BDL
68
22
0
08 Jul 2024
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Yuheng Zhang
Dian Yu
Baolin Peng
Linfeng Song
Ye Tian
Mingyue Huo
Nan Jiang
Haitao Mi
Dong Yu
184
18
0
30 Jun 2024
BoNBoN Alignment for Large Language Models and the Sweetness of Best-of-n Sampling
Lin Gui
Cristina Garbacea
Victor Veitch
BDL
LM&MA
79
46
0
02 Jun 2024
SimPO: Simple Preference Optimization with a Reference-Free Reward
Yu Meng
Mengzhou Xia
Danqi Chen
116
456
0
23 May 2024
Self-Play Preference Optimization for Language Model Alignment
Yue Wu
Zhiqing Sun
Huizhuo Yuan
Kaixuan Ji
Yiming Yang
Quanquan Gu
97
139
0
01 May 2024
REBEL: Reinforcement Learning via Regressing Relative Rewards
Zhaolin Gao
Jonathan D. Chang
Wenhao Zhan
Owen Oertell
Gokul Swamy
Kianté Brantley
Thorsten Joachims
J. Andrew Bagnell
Jason D. Lee
Wen Sun
OffRL
62
39
0
25 Apr 2024
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Corby Rosset
Ching-An Cheng
Arindam Mitra
Michael Santacroce
Ahmed Hassan Awadallah
Tengyang Xie
185
131
0
04 Apr 2024
Asymptotics of Language Model Alignment
Joy Qiping Yang
Salman Salamatian
Ziteng Sun
A. Suresh
Ahmad Beirami
80
27
0
02 Apr 2024
Generalized Preference Optimization: A Unified Approach to Offline Alignment
Yunhao Tang
Z. Guo
Zeyu Zheng
Daniele Calandriello
Rémi Munos
Mark Rowland
Pierre Harvey Richemond
Michal Valko
Bernardo Avila-Pires
Bilal Piot
53
113
0
08 Feb 2024
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
255
537
0
02 Feb 2024
Transforming and Combining Rewards for Aligning Large Language Models
Zihao Wang
Chirag Nagpal
Jonathan Berant
Jacob Eisenstein
Alex DÁmour
Oluwasanmi Koyejo
Victor Veitch
51
13
0
01 Feb 2024
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
Haoran Xu
Amr Sharaf
Yunmo Chen
Weiting Tan
Lingfeng Shen
Benjamin Van Durme
Kenton W. Murray
Young Jin Kim
ALM
101
248
0
16 Jan 2024
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
Gokul Swamy
Christoph Dann
Rahul Kidambi
Zhiwei Steven Wu
Alekh Agarwal
OffRL
94
111
0
08 Jan 2024
Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami
Alekh Agarwal
Jonathan Berant
Alex DÁmour
Jacob Eisenstein
Chirag Nagpal
A. Suresh
91
55
0
03 Jan 2024
Nash Learning from Human Feedback
Rémi Munos
Michal Valko
Daniele Calandriello
M. G. Azar
Mark Rowland
...
Nikola Momchev
Olivier Bachem
D. Mankowitz
Doina Precup
Bilal Piot
95
143
0
01 Dec 2023
On the Convergence of Encoder-only Shallow Transformers
Yongtao Wu
Fanghui Liu
Grigorios G. Chrysos
Volkan Cevher
68
6
0
02 Nov 2023
A General Theoretical Paradigm to Understand Learning from Human Preferences
M. G. Azar
Mark Rowland
Bilal Piot
Daniel Guo
Daniele Calandriello
Michal Valko
Rémi Munos
174
621
0
18 Oct 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
349
4,312
0
09 Jun 2023
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
Dongfu Jiang
Xiang Ren
Bill Yuchen Lin
ELM
70
319
0
05 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
385
3,981
0
29 May 2023
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Yann Dubois
Xuechen Li
Rohan Taori
Tianyi Zhang
Ishaan Gulrajani
Jimmy Ba
Carlos Guestrin
Percy Liang
Tatsunori B. Hashimoto
ALM
125
595
0
22 May 2023
SLiC-HF: Sequence Likelihood Calibration with Human Feedback
Yao-Min Zhao
Rishabh Joshi
Tianqi Liu
Misha Khalman
Mohammad Saleh
Peter J. Liu
70
297
0
17 May 2023
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong
Wei Xiong
Deepanshu Goyal
Yihan Zhang
Winnie Chow
Boyao Wang
Shizhe Diao
Jipeng Zhang
Kashun Shum
Tong Zhang
ALM
71
455
0
13 Apr 2023
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
Zheng Yuan
Hongyi Yuan
Chuanqi Tan
Wei Wang
Songfang Huang
Feiran Huang
ALM
159
374
0
11 Apr 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.4K
14,359
0
15 Mar 2023
Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence
S. Pattathil
Kai Zhang
Asuman Ozdaglar
67
14
0
23 Oct 2022
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
96
547
0
19 Oct 2022
Sharper Utility Bounds for Differentially Private Models
Yilin Kang
Yong Liu
Jian Li
Weiping Wang
FedML
61
3
0
22 Apr 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
249
2,561
0
12 Apr 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
874
12,973
0
04 Mar 2022
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano
Jacob Hilton
S. Balaji
Jeff Wu
Ouyang Long
...
Gretchen Krueger
Kevin Button
Matthew Knight
B. Chess
John Schulman
ALM
RALM
179
1,275
0
17 Dec 2021
A General Language Assistant as a Laboratory for Alignment
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
...
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
118
779
0
01 Dec 2021
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
283
4,408
0
27 Oct 2021
Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization
Shicong Cen
Yuting Wei
Yuejie Chi
85
78
0
31 May 2021
Stability and Deviation Optimal Risk Bounds with Convergence Rate
O
(
1
/
n
)
O(1/n)
O
(
1/
n
)
Yegor Klochkov
Nikita Zhivotovskiy
58
62
0
22 Mar 2021
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
D. Song
Jacob Steinhardt
ELM
RALM
176
4,434
0
07 Sep 2020
Learning to summarize from human feedback
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
234
2,147
0
02 Sep 2020
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
Chaoyue Liu
Libin Zhu
M. Belkin
ODL
73
262
0
29 Feb 2020
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
Yejin Choi
170
2,485
0
19 May 2019
Stability and Generalization of Learning Algorithms that Converge to Global Optima
Zachary B. Charles
Dimitris Papailiopoulos
MLT
40
163
0
23 Oct 2017
Deep reinforcement learning from human preferences
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
160
3,302
0
12 Jun 2017
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
Hamed Karimi
J. Nutini
Mark Schmidt
280
1,220
0
16 Aug 2016
1