ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.14655
  4. Cited By
Multi-turn Reinforcement Learning from Preference Human Feedback

Multi-turn Reinforcement Learning from Preference Human Feedback

23 May 2024
Lior Shani
Aviv Rosenberg
Asaf B. Cassel
Oran Lang
Daniele Calandriello
Avital Zipori
Hila Noga
Orgad Keller
Bilal Piot
Idan Szpektor
Avinatan Hassidim
Yossi Matias
Rémi Munos
ArXiv (abs)PDFHTML

Papers citing "Multi-turn Reinforcement Learning from Preference Human Feedback"

30 / 30 papers shown
Title
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Daniel Jiang
Jalaj Bhandari
Yukai Yang
Rémi Munos
Tyler Lu
OffRL
517
0
0
26 Nov 2025
BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?
BD-Net: Has Depth-Wise Convolution Ever Been Applied in Binary Neural Networks?
DoYoung Kim
Jin-Seop Lee
Noo-Ri Kim
SungJoon Lee
Jee-Hyong Lee
MQ
60
3
0
19 Nov 2025
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
Marwa Abdulhai
Ryan Cheng
Donovan Clay
Tim Althoff
Sergey Levine
Natasha Jaques
52
1
0
31 Oct 2025
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Fei Wei
Daoyuan Chen
Ce Wang
Yilun Huang
Yushuo Chen
Xuchen Pan
Yaliang Li
Bolin Ding
OffRLLLMAG
286
0
0
29 Oct 2025
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
Abhijnan Nath
Nikhil Krishnaswamy
56
0
0
26 Oct 2025
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping
Jing Wang
Jiajun Liang
Jie Liu
Henglin Liu
Gongye Liu
...
Zhenyu Xie
Xintao Wang
Meng Wang
Pengfei Wan
Xiaodan Liang
60
0
0
25 Oct 2025
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Yujie Zhou
Pengyang Ling
Jiazi Bu
Yibin Wang
Yuhang Zang
Jiaqi Wang
Li Niu
Guangtao Zhai
DiffM
161
3
0
02 Oct 2025
POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization
POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization
Ziqing Wang
Yibo Wen
William Pattie
Xiao Luo
Weimin Wu
Jerry Yao-Chieh Hu
Abhishek Pandey
Han Liu
Kaize Ding
88
0
0
26 Sep 2025
ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
ReSURE: Regularizing Supervision Unreliability for Multi-turn Dialogue Fine-tuning
Yiming Du
Yifan Xiang
Bin Liang
Dahua Lin
Kam-Fai Wong
Fei Tan
OffRL
124
1
0
27 Aug 2025
Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs
Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs
Ying Liu
Can Li
Ting Zhang
Mei Wang
Qiannan Zhu
Jian Li
Hua Huang
ELM
96
0
0
08 Aug 2025
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
Shanbo Cheng
Yu Bao
Longxiang Zhang
Yu Lu
Ningxin Peng
...
Wenhao Zhu
Liehao Zou
Lu Lu
Yuping Wang
Yonghui Wu
VLM
374
1
0
23 Jul 2025
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
Licheng Liu
Zihan Wang
Linjie Li
Chenwei Xu
Yiping Lu
Han Liu
Avirup Sil
Manling Li
KELMReLMLRM
154
3
0
18 Jul 2025
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
Y. Jiang
Yuwen Xiong
Yufeng Yuan
Chao Xin
Wenyuan Xu
Yu Yue
Qianchuan Zhao
Lin Yan
LRM
217
9
0
12 Jun 2025
Reinforce LLM Reasoning through Multi-Agent Reflection
Yurun Yuan
Tengyang Xie
LRM
181
14
0
10 Jun 2025
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective
Shenghua He
Tian Xia
Xuan Zhou
Hui Wei
OffRL
166
2
0
03 Jun 2025
Accelerating Nash Learning from Human Feedback via Mirror Prox
Accelerating Nash Learning from Human Feedback via Mirror Prox
D. Tiapkin
Daniele Calandriello
Denis Belomestny
Eric Moulines
Alexey Naumov
Kashif Rasul
Michal Valko
Pierre Ménard
162
2
0
26 May 2025
Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging
Hongjin Qian
Zhengyang Liang
RALMLRM
383
6
0
14 May 2025
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
Souradip Chakraborty
Mohammadreza Pourreza
Ruoxi Sun
Yiwen Song
Nino Scherrer
...
Furong Huang
Amrit Singh Bedi
Ahmad Beirami
Hamid Palangi
Tomas Pfister
440
2
0
02 Apr 2025
Don't lie to your friends: Learning what you know from collaborative self-play
Don't lie to your friends: Learning what you know from collaborative self-play
Jacob Eisenstein
Reza Aghajani
Adam Fisch
Dheeru Dua
Fantine Huot
Mirella Lapata
Vicky Zayats
Jonathan Berant
316
5
0
18 Mar 2025
Learning from Failures in Multi-Attempt Reinforcement Learning
Stephen Chung
Wenyu Du
Jie Fu
LRM
184
4
0
04 Mar 2025
M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality
M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality
Ziyan Wang
Zhicheng Zhang
Fei Fang
Yali Du
384
6
0
03 Mar 2025
Self-rewarding correction for mathematical reasoning
Self-rewarding correction for mathematical reasoning
Wei Xiong
Hanning Zhang
Chenlu Ye
Lichang Chen
Nan Jiang
Tong Zhang
ReLMKELMLRM
316
35
0
26 Feb 2025
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Yuheng Zhang
Dian Yu
Tao Ge
Linfeng Song
Zhichen Zeng
Haitao Mi
Nan Jiang
Dong Yu
252
9
0
24 Feb 2025
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu
Luca Viano
Yihang Chen
Zhenyu Zhu
Kimon Antonakopoulos
Quanquan Gu
Volkan Cevher
412
2
0
18 Feb 2025
CollabLLM: From Passive Responders to Active Collaborators
CollabLLM: From Passive Responders to Active Collaborators
Shirley Wu
Michel Galley
Baolin Peng
Hao Cheng
Gavin Li
Yao Dou
Weixin Cai
James Zou
J. Leskovec
Jianfeng Gao
386
0
0
02 Feb 2025
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking
Sebastian Farquhar
Vikrant Varma
David Lindner
David Elson
Caleb Biddulph
Ian Goodfellow
Rohin Shah
348
10
0
22 Jan 2025
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
Rishabh Agarwal
Rameswar Panda
OffRL
442
32
0
23 Oct 2024
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization
Guanlin Liu
Kaixuan Ji
Ning Dai
Zheng Wu
Chen Dun
Q. Gu
Lin Yan
Quanquan Gu
Lin Yan
OffRLLRM
263
17
0
11 Oct 2024
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHFInternational Conference on Learning Representations (ICLR), 2024
Zhaolin Gao
Wenhao Zhan
Jonathan D. Chang
Gokul Swamy
Kianté Brantley
Jason D. Lee
Wen Sun
OffRL
323
13
0
06 Oct 2024
Reinforcement Learning for Generative AI: A Survey
Reinforcement Learning for Generative AI: A Survey
Yuanjiang Cao
Quan.Z Sheng
Julian McAuley
Lina Yao
SyDa
387
19
0
28 Aug 2023
1