Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.19733
Cited By
Iterative Reasoning Preference Optimization
30 April 2024
Richard Yuanzhe Pang
Weizhe Yuan
Kyunghyun Cho
He He
Sainbayar Sukhbaatar
Jason Weston
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Iterative Reasoning Preference Optimization"
50 / 81 papers shown
Title
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Z. Li
Xi Chen
Tianyi Lin
0
0
0
16 May 2025
Reasoning Models Don't Always Say What They Think
Yanda Chen
Joe Benton
Ansh Radhakrishnan
Jonathan Uesato
Carson E. Denison
...
Vlad Mikulik
Samuel R. Bowman
Jan Leike
Jared Kaplan
E. Perez
ReLM
LRM
68
12
1
08 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
39
0
0
05 May 2025
Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
Wataru Kawakami
Keita Suzuki
Junichiro Iwasawa
LRM
75
0
0
25 Apr 2025
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao
B. Zhu
Qianru Sun
Hanwang Zhang
MLLM
LRM
86
0
0
25 Apr 2025
LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
Yuan-Hong Liao
Sven Elflein
Liu He
Laura Leal-Taixe
Yejin Choi
Sanja Fidler
David Acuna
ReLM
LRM
VLM
129
0
0
21 Apr 2025
Teaching Large Language Models to Reason through Learning and Forgetting
Tianwei Ni
Allen Nie
Sapana Chaudhary
Yao Liu
Huzefa Rangwala
Rasool Fakoor
ReLM
CLL
LRM
142
0
0
15 Apr 2025
ReZero: Enhancing LLM search ability by trying one-more-time
Alan Dao
Thinh Le
RALM
LRM
40
1
0
15 Apr 2025
Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Shuai Zhao
Linchao Zhu
Yi Yang
39
2
0
14 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
45
2
0
12 Apr 2025
Playpen: An Environment for Exploring Learning Through Conversational Interaction
Nicola Horst
Davide Mazzaccara
Antonia Schmidt
Michael Sullivan
Filippo Momentè
...
Alexander Koller
Oliver Lemon
David Schlangen
Mario Giulianelli
Alessandro Suglia
OffRL
34
0
0
11 Apr 2025
Algorithm Discovery With LLMs: Evolutionary Search Meets Reinforcement Learning
Anja Surina
Amin Mansouri
Lars Quaedvlieg
Amal Seddas
Maryna Viazovska
Emmanuel Abbe
Çağlar Gülçehre
38
1
0
07 Apr 2025
Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Ran Xu
W. Shi
Yuchen Zhuang
Yue Yu
Joyce C. Ho
Haoyu Wang
Carl Yang
26
1
0
07 Apr 2025
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Pedro Ferreira
Wilker Aziz
Ivan Titov
LRM
26
0
0
07 Apr 2025
Entropy-Based Adaptive Weighting for Self-Training
Xiaoxuan Wang
Yihe Deng
Mingyu Derek Ma
Wei Wang
LRM
52
0
0
31 Mar 2025
CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment
Jiangnan Li
Thuy-Trang Vu
Christian Herold
Amirhossein Tebbifakhr
Shahram Khadivi
Gholamreza Haffari
33
0
0
31 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
54
0
0
29 Mar 2025
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Songjun Tu
Jiahao Lin
Xiangyu Tian
Qichao Zhang
Linjing Li
...
Nan Xu
Wei He
Xiangyuan Lan
D. Jiang
Dongbin Zhao
LRM
58
3
0
17 Mar 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Weiyun Wang
Zhangwei Gao
L. Chen
Zhe Chen
Jinguo Zhu
...
Lewei Lu
Haodong Duan
Yu Qiao
Jifeng Dai
Wenhai Wang
LRM
65
11
0
13 Mar 2025
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin
Hansi Zeng
Zhenrui Yue
Dong Wang
Sercan Ö. Arik
Dong Wang
Hamed Zamani
J. Han
RALM
ReLM
KELM
OffRL
AI4TS
LRM
84
29
0
12 Mar 2025
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Sara Rajaee
Kumar Pratik
Gabriele Cesa
Arash Behboodi
OffRL
LRM
61
0
0
12 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
168
0
0
10 Mar 2025
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Ling Team
B. Zeng
Chenyu Huang
Chao Zhang
Changxin Tian
...
Zhaoxin Huan
Zujie Wen
Zhenhang Sun
Zhuoxuan Du
Z. He
MoE
ALM
109
2
0
07 Mar 2025
Preference Learning Unlocks LLMs' Psycho-Counseling Skills
Mian Zhang
S. Eack
Zhiyu Zoey Chen
77
1
0
27 Feb 2025
Self-rewarding correction for mathematical reasoning
Wei Xiong
Hanning Zhang
Chenlu Ye
Lichang Chen
Nan Jiang
Tong Zhang
ReLM
KELM
LRM
75
9
0
26 Feb 2025
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang
Bingcong Li
Christoph Dann
Niao He
OffRL
85
0
0
26 Feb 2025
AMPO: Active Multi-Preference Optimization
Taneesh Gupta
Rahul Madhavan
Xuchao Zhang
Chetan Bansal
Saravan Rajmohan
55
0
0
25 Feb 2025
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Weizhe Yuan
Jane Dwivedi-Yu
Song Jiang
Karthik Padthe
Yang Li
...
Ilia Kulikov
Kyunghyun Cho
Yuandong Tian
Jason Weston
Xian Li
ReLM
LRM
64
10
0
24 Feb 2025
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao
Yige Yuan
Ziyang Chen
Mingxiao Li
Shangsong Liang
Z. Ren
V. Honavar
97
5
0
21 Feb 2025
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Shicong Cen
Jincheng Mei
Katayoon Goshvadi
Hanjun Dai
Tong Yang
Sherry Yang
Dale Schuurmans
Yuejie Chi
Bo Dai
OffRL
65
23
0
20 Feb 2025
Uncertainty-Aware Step-wise Verification with Generative Reward Models
Zihuiwen Ye
Luckeciano C. Melo
Younesse Kaddar
Phil Blunsom
Shivalika Singh
Yarin Gal
LRM
49
0
0
16 Feb 2025
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models
Xin Zhou
Yiwen Guo
Ruotian Ma
Tao Gui
Qi Zhang
Xuanjing Huang
LRM
92
2
0
13 Feb 2025
PIPA: Preference Alignment as Prior-Informed Statistical Estimation
Junbo Li
Zhangyang Wang
Qiang Liu
OffRL
104
0
0
09 Feb 2025
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy
Yuhui Chen
Shuai Tian
Shugao Liu
Yingting Zhou
Haoran Li
Dongbin Zhao
OffRL
106
1
0
08 Feb 2025
Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping
Pu Yang
Yunzhen Feng
Ziyuan Chen
Yuhang Wu
Zhuoyuan Li
DiffM
101
0
0
31 Jan 2025
SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
Ran Xu
Hui Liu
Sreyashi Nag
Zhenwei Dai
Yaochen Xie
...
Chen Luo
Yang Li
Joyce C. Ho
Carl Yang
Qi He
RALM
73
8
0
28 Jan 2025
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi
Xiao-Chang Liu
Iat Long Iong
Hanyu Lai
Xingwu Sun
...
Shuntian Yao
Tianjie Zhang
Wei Xu
J. Tang
Yuxiao Dong
103
14
0
28 Jan 2025
Geometric-Averaged Preference Optimization for Soft Preference Labels
Hiroki Furuta
Kuang-Huei Lee
Shixiang Shane Gu
Y. Matsuo
Aleksandra Faust
Heiga Zen
Izzeddin Gur
58
7
0
31 Dec 2024
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Xingyu Chen
Jiahao Xu
Tian Liang
Zhiwei He
Jianhui Pang
...
Z. Zhang
Rui Wang
Zhaopeng Tu
Haitao Mi
Dong Yu
LRM
ReLM
58
95
0
30 Dec 2024
Diving into Self-Evolving Training for Multimodal Reasoning
Wei Liu
Junlong Li
Xiwen Zhang
Fan Zhou
Yu Cheng
Junxian He
ReLM
LRM
41
11
0
23 Dec 2024
B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners
Weihao Zeng
Yuzhen Huang
Lulu Zhao
Yijun Wang
Zifei Shan
Junxian He
LRM
43
7
0
23 Dec 2024
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing
Umar Khalid
Hasan Iqbal
Azib Farooq
Nazanin Rahnavard
Jing Hua
...
H. Iqbal
Azib Farooq
Nazanin Rahnavard
Jing Hua
Chen Chen
77
0
0
13 Dec 2024
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
Yogesh Kulkarni
Pooyan Fazli
VLM
103
2
0
01 Dec 2024
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
96
16
0
25 Nov 2024
Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning
Song Jiang
Da JU
Andrew Cohen
Sasha Mitts
Aaron Foss
Justine T Kao
Xian Li
Yuandong Tian
67
2
0
21 Nov 2024
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Weiyun Wang
Zhe Chen
Wenhai Wang
Yue Cao
Yangzhou Liu
...
Jinguo Zhu
X. Zhu
Lewei Lu
Yu Qiao
Jifeng Dai
LRM
62
48
1
15 Nov 2024
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
Yihe Deng
Paul Mineiro
LRM
26
3
0
29 Oct 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Haodong Duan
Conghui He
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
34
7
0
23 Oct 2024
Understanding Likelihood Over-optimisation in Direct Alignment Algorithms
Zhengyan Shi
Sander Land
Acyr Locatelli
Matthieu Geist
Max Bartolo
46
4
0
15 Oct 2024
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu
Janice Lan
Weizhe Yuan
Jiantao Jiao
Jason Weston
Sainbayar Sukhbaatar
LRM
24
15
0
14 Oct 2024
1
2
Next