Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.00782
Cited By
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
30 June 2024
Zimu Lu
Aojun Zhou
Ke Wang
Houxing Ren
Weikang Shi
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning"
23 / 23 papers shown
Title
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Pengxiang Li
Zhi Gao
Bofei Zhang
Yapeng Mi
Xiaojian Ma
...
Tao Yuan
Yuwei Wu
Yunde Jia
Song-Chun Zhu
Qing Li
LLMAG
75
0
0
30 Apr 2025
LEMMA: Learning from Errors for MatheMatical Advancement in LLMs
Zhuoshi Pan
Yu Li
Honglin Lin
Qizhi Pei
Zinan Tang
Wei Wu
Chenlin Ming
H. Vicky Zhao
Zeang Sheng
Lijun Wu
LRM
59
2
0
21 Mar 2025
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Sara Rajaee
Kumar Pratik
Gabriele Cesa
Arash Behboodi
OffRL
LRM
63
0
0
12 Mar 2025
Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
Joykirat Singh
Tanmoy Chakraborty
A. Nambi
AI4Cl
LRM
ReLM
60
1
0
04 Mar 2025
Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Jiazheng Li
Yuxiang Zhou
Junru Lu
Gladys Tyen
Lin Gui
Cesare Aloisi
Yulan He
LRM
39
2
0
26 Feb 2025
Self-rewarding correction for mathematical reasoning
Wei Xiong
Hanning Zhang
Chenlu Ye
Lichang Chen
Nan Jiang
Tong Zhang
ReLM
KELM
LRM
75
10
0
26 Feb 2025
CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Shuming Shi
Ruobing Zuo
Gaolei He
Jianlin Wang
Chenyang Xu
Zhengfeng Yang
71
0
0
25 Feb 2025
Eeyore: Realistic Depression Simulation via Supervised and Preference Optimization
Siyang Liu
Bianca Brie
Wenda Li
Laura Biester
Andrew Lee
J. Pennebaker
Rada Mihalcea
45
0
0
21 Feb 2025
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu
Luca Viano
Yihang Chen
Zhenyu Zhu
Kimon Antonakopoulos
Quanquan Gu
V. Cevher
62
0
0
18 Feb 2025
PIPA: Preference Alignment as Prior-Informed Statistical Estimation
Junbo Li
Zhangyang Wang
Qiang Liu
OffRL
108
0
0
09 Feb 2025
Mars-PO: Multi-Agent Reasoning System Preference Optimization
Xiaoxuan Lou
Chaojie Wang
Bo An
LLMAG
LRM
74
0
0
28 Nov 2024
Process Reward Model with Q-Value Rankings
W. Li
Yixuan Li
LRM
62
15
0
15 Oct 2024
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
Zimu Lu
Aojun Zhou
Ke Wang
Houxing Ren
Weikang Shi
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
79
10
0
10 Oct 2024
Subtle Errors Matter: Preference Learning via Error-injected Self-editing
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Chak Tou Leong
Liangyou Li
Xin Jiang
Lifeng Shang
Qun Liu
Wenjie Li
LRM
223
0
0
09 Oct 2024
Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards
Heejin Do
Sangwon Ryu
Gary Geunbae Lee
34
2
0
26 Sep 2024
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pranav Putta
Edmund Mills
Naman Garg
S. Motwani
Chelsea Finn
Divyansh Garg
Rafael Rafailov
LLMAG
LRM
31
69
0
13 Aug 2024
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs
Zimu Lu
Aojun Zhou
Houxing Ren
Ke Wang
Weikang Shi
Junting Pan
Mingjie Zhan
Hongsheng Li
SyDa
LRM
53
45
0
26 Feb 2024
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh
Winnie Xu
Niklas Muennighoff
Dan Jurafsky
Douwe Kiela
182
463
0
02 Feb 2024
Self-Rewarding Language Models
Weizhe Yuan
Richard Yuanzhe Pang
Kyunghyun Cho
Xian Li
Sainbayar Sukhbaatar
Jing Xu
Jason Weston
ReLM
SyDa
ALM
LRM
244
304
0
18 Jan 2024
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi
Mirac Suzgun
Markus Freitag
Xuezhi Wang
Suraj Srivats
...
Yi Tay
Sebastian Ruder
Denny Zhou
Dipanjan Das
Jason W. Wei
ReLM
LRM
174
337
0
06 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
384
12,150
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
447
8,699
0
28 Jan 2022
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
301
1,620
0
18 Sep 2019
1