Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.19733
Cited By
Iterative Reasoning Preference Optimization
30 April 2024
Richard Yuanzhe Pang
Weizhe Yuan
Kyunghyun Cho
He He
Sainbayar Sukhbaatar
Jason Weston
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Iterative Reasoning Preference Optimization"
31 / 81 papers shown
Title
Boosting Deductive Reasoning with Step Signals In RLHF
Jiajun Li
Yipin Zhang
Wei Shen
Yuzi Yan
Jian Xie
Dong Yan
LRM
ReLM
37
0
0
12 Oct 2024
Language Imbalance Driven Rewarding for Multilingual Self-improving
Wen Yang
Junhong Wu
Chen Wang
Chengqing Zong
J.N. Zhang
ALM
LRM
71
4
0
11 Oct 2024
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin
Sadhika Malladi
Adithya Bhaskar
Danqi Chen
Sanjeev Arora
Boris Hanin
99
16
0
11 Oct 2024
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Weize Chen
Jiarui Yuan
Chen Qian
Cheng Yang
Zhiyuan Liu
Maosong Sun
LLMAG
36
4
0
10 Oct 2024
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang
Zhihan Liu
Boyi Liu
Wenjie Qu
Yingxiang Yang
Y. Liu
Liyu Chen
Tao Sun
Ziyi Wang
101
3
0
10 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Z. Z. Ren
63
6
0
10 Oct 2024
Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback
Sanjiban Choudhury
Paloma Sodhi
LLMAG
34
4
0
07 Oct 2024
MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
Guanzhen Li
Yuxi Xie
Min-Yen Kan
VLM
142
0
0
06 Oct 2024
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
23
3
0
02 Oct 2024
Evaluation of Large Language Models for Summarization Tasks in the Medical Domain: A Narrative Review
Emma Croxford
Yanjun Gao
Nicholas Pellegrino
Karen K. Wong
Graham Wills
Elliot First
Frank J. Liao
Cherodeep Goswami
Brian Patterson
Majid Afshar
HILM
ELM
LM&MA
37
1
0
26 Sep 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
39
12
0
23 Sep 2024
Policy Filtration in RLHF to Fine-Tune LLM for Code Generation
Wei Shen
Chuheng Zhang
OffRL
41
6
0
11 Sep 2024
Sparse Rewards Can Self-Train Dialogue Agents
B. Lattimer
Varun Gangal
Ryan McDonald
Yi Yang
LLMAG
34
2
0
06 Sep 2024
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Hritik Bansal
Arian Hosseini
Rishabh Agarwal
Vinh Q. Tran
Mehran Kazemi
SyDa
OffRL
LRM
42
38
0
29 Aug 2024
Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation
Shiming Xie
Hong Chen
Fred Yu
Zeye Sun
Xiuyu Wu
35
0
0
20 Aug 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son
William Bankes
Sayak Ray Chowdhury
Brooks Paige
Ilija Bogunovic
42
4
0
26 Jul 2024
RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Amrith Rajagopal Setlur
Saurabh Garg
Xinyang Geng
Naman Garg
Virginia Smith
Aviral Kumar
42
48
0
20 Jun 2024
Bootstrapping Language Models with DPO Implicit Rewards
Changyu Chen
Zichen Liu
Chao Du
Tianyu Pang
Qian Liu
Arunesh Sinha
Pradeep Varakantham
Min-Bin Lin
SyDa
ALM
62
23
0
14 Jun 2024
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
Xuan Zhang
Chao Du
Tianyu Pang
Qian Liu
Wei Gao
Min-Bin Lin
LRM
AI4CE
44
34
0
13 Jun 2024
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
Rafael Rafailov
Yaswanth Chittepu
Ryan Park
Harshit S. Sikchi
Joey Hejna
Bradley Knox
Chelsea Finn
S. Niekum
58
51
0
05 Jun 2024
Self-Improving Robust Preference Optimization
Eugene Choi
Arash Ahmadian
Matthieu Geist
Oilvier Pietquin
M. G. Azar
31
8
0
03 Jun 2024
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training
Maximillian Chen
Ruoxi Sun
Sercan Ö. Arik
Tomas Pfister
LLMAG
34
6
0
31 May 2024
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie
Dylan J. Foster
Akshay Krishnamurthy
Corby Rosset
Ahmed Hassan Awadallah
Alexander Rakhlin
49
33
0
31 May 2024
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment
Yueqin Yin
Zhendong Wang
Yujia Xie
Weizhu Chen
Mingyuan Zhou
35
4
0
31 May 2024
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
44
37
0
30 May 2024
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Shenao Zhang
Donghan Yu
Hiteshi Sharma
Ziyi Yang
Shuohang Wang
Hany Hassan
Zhaoran Wang
LRM
48
28
0
29 May 2024
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Corby Rosset
Ching-An Cheng
Arindam Mitra
Michael Santacroce
Ahmed Hassan Awadallah
Tengyang Xie
152
114
0
04 Apr 2024
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
Arka Pal
Deep Karkhanis
Samuel Dooley
Manley Roberts
Siddartha Naidu
Colin White
OSLM
46
127
0
20 Feb 2024
Self-Rewarding Language Models
Weizhe Yuan
Richard Yuanzhe Pang
Kyunghyun Cho
Xian Li
Sainbayar Sukhbaatar
Jing Xu
Jason Weston
ReLM
SyDa
ALM
LRM
242
298
0
18 Jan 2024
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization
Shuaijie She
Wei Zou
Shujian Huang
Wenhao Zhu
Xiang Liu
Xiang Geng
Jiajun Chen
LRM
75
31
0
12 Jan 2024
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh
John D. Co-Reyes
Rishabh Agarwal
Ankesh Anand
Piyush Patil
...
Yamini Bansal
Ethan Dyer
Behnam Neyshabur
Jascha Narain Sohl-Dickstein
Noah Fiedel
ALM
LRM
ReLM
SyDa
157
146
0
11 Dec 2023
Previous
1
2