ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2009.01325
  4. Cited By
Learning to summarize from human feedback
v1v2v3 (latest)

Learning to summarize from human feedback

2 September 2020
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
    ALM
ArXiv (abs)PDFHTML

Papers citing "Learning to summarize from human feedback"

50 / 1,548 papers shown
Title
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
Soichiro Nishimori
Yu Zhang
Thanawat Lodkaew
Masashi Sugiyama
NoLa
46
0
0
30 May 2025
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Jingyan Shen
Jiarui Yao
Rui Yang
Yifan Sun
Feng Luo
Boyao Wang
Tong Zhang
Han Zhao
38
0
0
30 May 2025
Whispers of Many Shores: Cultural Alignment through Collaborative Cultural Expertise
Whispers of Many Shores: Cultural Alignment through Collaborative Cultural Expertise
Shuai Feng
Wei-Chuang Chan
Srishti Chouhan
Junior Francisco Garcia Ayala
Srujananjali Medicherla
Kyle Clark
Mingwei Shi
40
0
0
30 May 2025
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
Lang Cao
Jingxian Xu
Hanbing Liu
Jinyu Wang
Mengyu Zhou
Haoyu Dong
Shi Han
Dongmei Zhang
LRMOffRLLMTDReLM
68
0
0
29 May 2025
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten
M. E. Ildiz
Xuechen Zhang
Hrayr Harutyunyan
A. S. Rawat
Samet Oymak
LRM
79
0
0
29 May 2025
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Sheng Ouyang
Yulan Hu
Ge Chen
Qingyang Li
Fuzheng Zhang
Yong Liu
41
0
0
29 May 2025
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Yuu Jinnai
OT
50
0
0
29 May 2025
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Yunqiao Yang
Houxing Ren
Zimu Lu
Ke Wang
Weikang Shi
A-Long Zhou
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
60
0
0
29 May 2025
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data
Seohyeong Lee
Eunwon Kim
Hwaran Lee
Buru Chang
86
0
0
29 May 2025
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
Kaiyang Guo
Yinchuan Li
Zhitang Chen
75
0
0
29 May 2025
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Tian Qin
Core Francisco Park
Mujin Kwun
Aaron Walsman
Eran Malach
Nikhil Anand
Hidenori Tanaka
David Alvarez-Melis
ReLMOffRLLRM
93
0
0
28 May 2025
ValueSim: Generating Backstories to Model Individual Value Systems
ValueSim: Generating Backstories to Model Individual Value Systems
Bangde Du
Ziyi Ye
Zhijing Wu
Jankowska Monika
Shuqi Zhu
Qingyao Ai
Yujia Zhou
Yiqun Liu
23
0
0
28 May 2025
Text2Grad: Reinforcement Learning from Natural Language Feedback
Text2Grad: Reinforcement Learning from Natural Language Feedback
Hanyang Wang
Lu Wang
Chaoyun Zhang
Tianjun Mao
Si Qin
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
88
0
0
28 May 2025
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
Saleh Afzoon
Zahra Jahanandish
Phuong Thao Huynh
Amin Beheshti
Usman Naseem
58
0
0
28 May 2025
Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment
SquareχχχPO: Differentially Private and Robust χ2χ^2χ2-Preference Optimization in Offline Direct Alignment
Xingyu Zhou
Yulian Wu
Wenqian Weng
Francesco Orabona
85
0
0
27 May 2025
The Multilingual Divide and Its Impact on Global AI Safety
The Multilingual Divide and Its Impact on Global AI Safety
Aidan Peppin
Julia Kreutzer
Alice Schoenauer Sebag
Kelly Marchisio
Beyza Ermis
...
Wei-Yin Ko
Ahmet Üstün
Matthias Gallé
Marzieh Fadaee
Sara Hooker
ELM
79
1
0
27 May 2025
Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities
Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities
Junyan Zhang
Yubo Gao
Yibo Yan
Jungang Li
Zhaorui Hou
...
Shuliang Liu
Song Dai
Yonghua Hei
Junzhuo Li
Xuming Hu
67
0
0
27 May 2025
Multi-objective Large Language Model Alignment with Hierarchical Experts
Multi-objective Large Language Model Alignment with Hierarchical Experts
Zhuo Li
Guodong DU
Weiyang Guo
Yigeng Zhou
Xiucheng Li
...
Fangming Liu
Yequan Wang
Deheng Ye
Min Zhang
Jing Li
ALMMoE
89
0
0
27 May 2025
Breaking the Performance Ceiling in Complex Reinforcement Learning requires Inference Strategies
Breaking the Performance Ceiling in Complex Reinforcement Learning requires Inference Strategies
Félix Chalumeau
Daniel Rajaonarivonivelomanantsoa
Ruan de Kock
Claude Formanek
Sasha Abramowitz
...
Refiloe Shabe
Arnol Fokam
Siddarth S. Singh
Ulrich A. Mbou Sob
Arnu Pretorius
72
0
0
27 May 2025
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Geon-hyeong Kim
Youngsoo Jang
Yu Jin Kim
Byoungjip Kim
Honglak Lee
Kyunghoon Bae
Moontae Lee
36
2
0
26 May 2025
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi
Minhak Song
Runlong Zhou
Zihan Zhang
Maryam Fazel
S. S. Du
79
0
0
26 May 2025
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Mengdi Li
Jiaye Lin
Xufeng Zhao
Wenhao Lu
P. Zhao
S. Wermter
Di Wang
52
0
0
26 May 2025
Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Abhijnan Nath
Carine Graff
Andrei Bachinin
Nikhil Krishnaswamy
123
1
0
26 May 2025
What Can RL Bring to VLA Generalization? An Empirical Study
What Can RL Bring to VLA Generalization? An Empirical Study
Jijia Liu
Feng Gao
Bingwen Wei
Xinlei Chen
Qingmin Liao
Yi Wu
Chao Yu
Yu Wang
OffRL
317
0
0
26 May 2025
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Yi Liu
Dianqing Liu
Mingye Zhu
Junbo Guo
Yongdong Zhang
Zhendong Mao
112
0
0
26 May 2025
Learning a Pessimistic Reward Model in RLHF
Learning a Pessimistic Reward Model in RLHF
Yinglun Xu
Hangoo Kang
Tarun Suresh
Yuxuan Wan
Gagandeep Singh
OffRL
71
0
0
26 May 2025
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Y. Zhang
Yu Yu
Bo Tang
Yu Zhu
Chuxiong Sun
...
Jie Hu
Zipeng Xie
Zhiyu Li
Feiyu Xiong
Edward Chung
108
0
0
26 May 2025
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers
Rihui Xin
Han Liu
Zecheng Wang
Yupeng Zhang
Dianbo Sui
Xiaolin Hu
Bingning Wang
SyDa
73
1
0
26 May 2025
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
Fengqi Zhu
Rongzhen Wang
Shen Nie
Xiaolu Zhang
Chunwei Wu
...
Jun Zhou
Jianfei Chen
Yankai Lin
Ji-Rong Wen
Chongxuan Li
197
2
0
25 May 2025
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Xiaoqiang Lin
Arun Verma
Zhongxiang Dai
Daniela Rus
See-Kiong Ng
Bryan Kian Hsiang Low
275
0
0
25 May 2025
Incentivizing High-Quality Human Annotations with Golden Questions
Incentivizing High-Quality Human Annotations with Golden Questions
Shang Liu
Zhongze Cai
Hanzhao Wang
Zhongyao Ma
Xiaocheng Li
84
0
0
25 May 2025
Flex-Judge: Think Once, Judge Anywhere
Flex-Judge: Think Once, Judge Anywhere
Jongwoo Ko
S. Kim
Sungwoo Cho
Se-Young Yun
ELMLRM
220
0
0
24 May 2025
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li
Guangda Huzhang
Haibo Zhang
Xiting Wang
Anxiang Zeng
47
0
0
24 May 2025
MOSLIM:Align with diverse preferences in prompts through reward classification
MOSLIM:Align with diverse preferences in prompts through reward classification
Yu Zhang
Wanli Jiang
Zhengyu Yang
30
1
0
24 May 2025
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
C. Wang
Xiaoran Pan
Zihao Pan
Haofan Wang
Yiren Song
LRM
158
0
0
24 May 2025
Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective
Towards Analyzing and Understanding the Limitations of VAPO: A Theoretical Perspective
Jintian Shao
YiMing Cheng
Hongyi Huang
Beiwen Zhang
ZhiYu Wu
You Shan
Mingkai Zheng
LRM
83
0
0
23 May 2025
Dynamic Risk Assessments for Offensive Cybersecurity Agents
Dynamic Risk Assessments for Offensive Cybersecurity Agents
Boyi Wei
Benedikt Stroebl
Jiacen Xu
Joie Zhang
Zhou Li
Peter Henderson
88
0
0
23 May 2025
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Hongji Yang
Yucheng Zhou
Wencheng Han
Jianbing Shen
55
0
0
22 May 2025
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation
Seamus Somerstep
Vinod Raman
Unique Subedi
Yuekai Sun
76
0
0
22 May 2025
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
Beier Luo
Shuoyuan Wang
Yixuan Li
Jianguo Huang
70
0
0
22 May 2025
Latent Principle Discovery for Language Model Self-Improvement
Latent Principle Discovery for Language Model Self-Improvement
Keshav Ramji
Tahira Naseem
Ramón Fernandez Astudillo
LRM
113
0
0
22 May 2025
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
Ilgee Hong
Changlong Yu
Liang Qiu
Weixiang Yan
Zhenghao Xu
...
Qingru Zhang
Qin Lu
Xin Liu
Chao Zhang
Tuo Zhao
OffRLReLMLRM
88
0
0
22 May 2025
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal
Zimin Zhang
Lifan Yuan
Jiawei Han
Hao Peng
180
8
0
21 May 2025
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Eric Hanchen Jiang
Haozheng Luo
Shengyuan Pang
Xiaomin Li
Zhenting Qi
...
Zongyu Lin
Xinfeng Li
Hao Xu
Kai-Wei Chang
Ying Nian Wu
LRM
130
0
0
21 May 2025
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song
Amir Moeini
Peng Wang
Lei Gong
Rohan Chandra
Yanjun Qi
Shangtong Zhang
ReLMLRM
40
3
0
21 May 2025
Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition
Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition
Dong Won Lee
Hae Won Park
C. Breazeal
Louis-Philippe Morency
60
0
0
21 May 2025
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan
Jin Jiang
Zhenbang Ren
Yijun Li
Xudong Cai
...
Mengdi Zhang
Jian Shao
Yongliang Shen
Jun Xiao
Yueting Zhuang
OffRLALMLRM
141
0
0
21 May 2025
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Jiaer Xia
Yuhang Zang
Peng Gao
Yixuan Li
Kaiyang Zhou
OffRLReLMAI4TSVLMLRM
119
0
0
20 May 2025
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Preference Learning with Lie Detectors can Induce Honesty or Evasion
Chris Cundy
Adam Gleave
53
0
0
20 May 2025
WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?
WikiPersonas: What Can We Learn From Personalized Alignment to Famous People?
Zilu Tang
Afra Feyza Akyürek
Ekin Akyürek
Derry Wijaya
121
0
0
19 May 2025
Previous
12345...293031
Next