Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2009.01325
Cited By
Learning to summarize from human feedback
2 September 2020
Nisan Stiennon
Long Ouyang
Jeff Wu
Daniel M. Ziegler
Ryan J. Lowe
Chelsea Voss
Alec Radford
Dario Amodei
Paul Christiano
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Learning to summarize from human feedback"
50 / 1,440 papers shown
Title
Empirical Privacy Variance
Yuzheng Hu
Fan Wu
Ruicheng Xian
Yuhang Liu
Lydia Zakynthinou
Pritish Kamath
Chiyuan Zhang
David A. Forsyth
64
0
0
16 Mar 2025
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences
Siliang Zeng
Yao Liu
Huzefa Rangwala
George Karypis
Mingyi Hong
Rasool Fakoor
49
2
0
15 Mar 2025
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Ivan Kartáč
Mateusz Lango
Ondrej Dusek
ELM
51
1
0
14 Mar 2025
RankPO: Preference Optimization for Job-Talent Matching
Yuyao Zhang
Hao Wu
Yu Wang
Xiaohui Wang
51
0
0
13 Mar 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
OffRL
66
0
0
13 Mar 2025
Ensemble Learning for Large Language Models in Text and Code Generation: A Survey
Mari Ashiga
Wei Jie
Fan Wu
Vardan K. Voskanyan
Fateme Dinmohammadi
P. Brookes
Jingzhi Gong
Zheng Wang
44
0
0
13 Mar 2025
Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment
Zhenyu Liu
Dongfang Li
Xinshuo Hu
X. Zhao
Yibin Chen
Baotian Hu
Min-Ling Zhang
49
1
0
13 Mar 2025
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei
Yijun Yang
Junliang Xing
Yuanchun Shi
Zongqing Lu
Deheng Ye
OffRL
LRM
49
1
0
11 Mar 2025
Robust Multi-Objective Controlled Decoding of Large Language Models
Seongho Son
William Bankes
Sangwoong Yoon
Shyam Sundhar Ramesh
Xiaohang Tang
Ilija Bogunovic
39
0
0
11 Mar 2025
LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model
Xinxin Zhao
Haoyang Li
J. Zhang
Xinmei Huang
Tieying Zhang
Jianjun Chen
Rui Shi
C. Li
Hong Chen
52
0
0
10 Mar 2025
Mitigating Preference Hacking in Policy Optimization with Pessimism
Dhawal Gupta
Adam Fisch
Christoph Dann
Alekh Agarwal
76
0
0
10 Mar 2025
Sometimes the Model doth Preach: Quantifying Religious Bias in Open LLMs through Demographic Analysis in Asian Nations
Hari Shankar
Vedanta S P
Tejas Cavale
Ponnurangam Kumaraguru
Abhijnan Chakraborty
63
0
0
10 Mar 2025
Combinatorial Optimization via LLM-driven Iterated Fine-tuning
Pranjal Awasthi
Sreenivas Gollapudi
Ravi Kumar
Kamesh Munagala
68
0
0
10 Mar 2025
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Zelei Cheng
Xin-Qiang Cai
Yuting Tang
Pushi Zhang
Boming Yang
Masashi Sugiyama
Xinyu Xing
49
0
0
10 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
186
0
0
10 Mar 2025
Alignment for Efficient Tool Calling of Large Language Models
Hongshen Xu
Zihan Wang
Zichen Zhu
Lei Pan
Xingyu Chen
Lu Chen
Kai Yu
49
0
0
09 Mar 2025
Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting
Yufei Li
John Nham
Ganesh Jawahar
Lei Shu
David C. Uthus
Yun-hsuan Sung
Chengrun Yang
Itai Rolnick
Yi Qiao
Cong Liu
OffRL
65
0
0
09 Mar 2025
ROCM: RLHF on consistency models
Shivanshu Shekhar
Tong Zhang
40
0
0
08 Mar 2025
Language Model Personalization via Reward Factorization
Idan Shenfeld
Felix Faltings
Pulkit Agrawal
Aldo Pacchiano
48
1
0
08 Mar 2025
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
Hyungkyu Kang
Min-hwan Oh
OffRL
47
0
0
07 Mar 2025
L
2
^2
2
M: Mutual Information Scaling Law for Long-Context Language Modeling
Zhuo Chen
Oriol Mayné i Comas
Zhuotao Jin
Di Luo
Marin Soljacic
67
1
0
06 Mar 2025
An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Z. Chen
Yingqian Min
Beichen Zhang
Jie Chen
Jinhao Jiang
...
Xu Miao
Yunfan LU
Lei Fang
Zhongyuan Wang
Zhicheng Dou
ReLM
OffRL
LRM
83
17
0
06 Mar 2025
Mixed Likelihood Variational Gaussian Processes
Kaiwen Wu
Craig Sanders
Benjamin Letham
Phillip Guan
79
0
0
06 Mar 2025
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Ruizhe Chen
Wenhao Chai
Zhifei Yang
Xiaotian Zhang
Qiufeng Wang
Tony Q.S. Quek
Soujanya Poria
Zuozhu Liu
50
0
0
06 Mar 2025
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
Erik Jones
Arjun Patrawala
Jacob Steinhardt
49
0
0
06 Mar 2025
LLMs Can Generate a Better Answer by Aggregating Their Own Responses
Zichong Li
Xinyu Feng
Yuheng Cai
Zixuan Zhang
Tianyi Liu
Chen Liang
Weizhu Chen
Haoyu Wang
T. Zhao
LRM
55
1
0
06 Mar 2025
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Mahfuz Ahmed Anik
Abdur Rahman
Azmine Toushik Wasi
Md Manjurul Ahsan
47
0
0
05 Mar 2025
Human Implicit Preference-Based Policy Fine-tuning for Multi-Agent Reinforcement Learning in USV Swarm
H. Kim
Kanghoon Lee
J. Park
Jiachen Li
Jinkyoo Park
62
1
0
05 Mar 2025
Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
Alessio Galatolo
Zhenbang Dai
Katie Winkle
Meriem Beloucif
55
0
0
05 Mar 2025
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Songming Zhang
Xue Zhang
Tong Zhang
Bojie Hu
Yufeng Chen
Jinan Xu
52
1
0
04 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
103
0
0
03 Mar 2025
Dynamic Search for Inference-Time Alignment in Diffusion Models
Xiner Li
Masatoshi Uehara
Xingyu Su
Gabriele Scalia
Tommaso Biancalani
Aviv Regev
Sergey Levine
Shuiwang Ji
47
0
0
03 Mar 2025
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Gokul Swamy
Sanjiban Choudhury
Wen Sun
Zhiwei Steven Wu
J. Andrew Bagnell
OffRL
47
7
0
03 Mar 2025
PABBO: Preferential Amortized Black-Box Optimization
Xinyu Zhang
Daolang Huang
Samuel Kaski
Julien Martinelli
34
0
0
02 Mar 2025
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
Wenjie Qiu
Yi-Chen Li
Xuqin Zhang
Tianyi Zhang
Y. Zhang
Zongzhang Zhang
Yang Yu
ALM
51
0
0
01 Mar 2025
Distributionally Robust Reinforcement Learning with Human Feedback
Debmalya Mandal
Paulius Sasnauskas
Goran Radanović
39
1
0
01 Mar 2025
Robust Multi-Objective Preference Alignment with Online DPO
Raghav Gupta
Ryan Sullivan
Yunxuan Li
Samrat Phatale
Abhinav Rastogi
42
0
0
01 Mar 2025
Plan2Align: Predictive Planning Based Test-Time Preference Alignment in Paragraph-Level Machine Translation
Kuang-Da Wang
Teng-Ruei Chen
Yu-Heng Hung
Shuoyang Ding
Yueh-Hua Wu
Yu-Chun Wang
Chao-Han Huck Yang
Wen-Chih Peng
Ping-Chun Hsieh
74
0
0
28 Feb 2025
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
Shalev Lifshitz
Sheila A. McIlraith
Yilun Du
LRM
57
6
0
27 Feb 2025
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stañczak
Nicholas Meade
Mehar Bhatia
Hattie Zhou
Konstantin Böttinger
...
Timothy P. Lillicrap
Ana Marasović
Sylvie Delacroix
Gillian K. Hadfield
Siva Reddy
164
0
0
27 Feb 2025
Multi-Turn Code Generation Through Single-Step Rewards
A. Jain
Gonzalo Gonzalez-Pumariega
Wayne Chen
Alexander M. Rush
Wenting Zhao
Sanjiban Choudhury
LRM
47
1
0
27 Feb 2025
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
Jiaxin Deng
Shiyao Wang
Kuo Cai
Lejian Ren
Qigen Hu
Weifeng Ding
Qiang Luo
Guorui Zhou
79
3
0
26 Feb 2025
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Jiani Zheng
Lu Wang
Fangkai Yang
C. Zhang
Lingrui Mei
Wenjie Yin
Qingwei Lin
Dongmei Zhang
Saravan Rajmohan
Qi Zhang
OffRL
64
2
0
26 Feb 2025
Controlled Diversity: Length-optimized Natural Language Generation
Diana Marie Schenke
Timo Baumann
49
0
0
26 Feb 2025
When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
Yijiang River Dong
Tiancheng Hu
Yinhong Liu
Ahmet Üstün
Nigel Collier
86
1
0
26 Feb 2025
What is the Alignment Objective of GRPO?
Milan Vojnovic
Se-Young Yun
70
2
0
25 Feb 2025
Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data
Siqi Guo
Ilgee Hong
Vicente Balmaseda
Changlong Yu
Liang Qiu
Xin Liu
Haoming Jiang
Tuo Zhao
Tianbao Yang
50
0
0
25 Feb 2025
CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
Shuming Shi
Ruobing Zuo
Gaolei He
Jianlin Wang
Chenyang Xu
Zhengfeng Yang
65
0
0
25 Feb 2025
MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment
Tianze Wang
Dongnan Gui
Yifan Hu
Shuhang Lin
Linjun Zhang
40
0
0
25 Feb 2025
NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
Yashan Wang
Shangda Wu
Jianhuai Hu
Xingjian Du
Yueqi Peng
Yongxin Huang
Shuai Fan
Xiaobing Li
Feng Yu
Maosong Sun
107
2
0
25 Feb 2025
Previous
1
2
3
4
5
6
...
27
28
29
Next