ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2204.05862
  4. Cited By
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

12 April 2022
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
Nova Dassarma
Dawn Drain
Stanislav Fort
Deep Ganguli
T. Henighan
Nicholas Joseph
Saurav Kadavath
John Kernion
Tom Conerly
S. E. Showk
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
Tristan Hume
Scott R. Johnston
Shauna Kravec
Liane Lovitt
Neel Nanda
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
ArXiv (abs)PDFHTML

Papers citing "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"

50 / 655 papers shown
Title
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Wei Shen
Guanlin Liu
Zheng Wu
Ruofei Zhu
Qingping Yang
Chao Xin
Yu Yue
Lin Yan
164
14
0
28 Mar 2025
3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models
3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models
Yize Zhang
Mengchen Zhang
Tong Wu
Tengfei Wang
Gordon Wetzstein
Dahua Lin
Ziwei Liu
ELM
200
1
0
27 Mar 2025
MultiScale Contextual Bandits for Long Term Objectives
MultiScale Contextual Bandits for Long Term Objectives
Richa Rastogi
Yuta Saito
Thorsten Joachims
OffRL
86
0
0
22 Mar 2025
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications
Jian Guan
Jian Wu
Jia-Nan Li
Chuanqi Cheng
Wei Wu
LM&MA
181
3
0
21 Mar 2025
HAPI: A Model for Learning Robot Facial Expressions from Human Preferences
HAPI: A Model for Learning Robot Facial Expressions from Human Preferences
Dongsheng Yang
Qianying Liu
Wataru Sato
Takashi Minato
Chaoran Liu
Shin’ya Nishida
63
0
0
21 Mar 2025
Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning
Chen Li
Nazhou Liu
Kai Yang
138
10
0
20 Mar 2025
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
Jia-Nan Li
Jian Guan
Songhao Wu
Wei Wu
Rui Yan
175
3
0
19 Mar 2025
Rolling Forward: Enhancing LightGCN with Causal Graph Convolution for Credit Bond Recommendation
Rolling Forward: Enhancing LightGCN with Causal Graph Convolution for Credit Bond Recommendation
Ashraf Ghiye
Baptiste Barreau
Laurent Carlier
Michalis Vazirgiannis
135
7
0
18 Mar 2025
MAP: Multi-user Personalization with Collaborative LLM-powered Agents
MAP: Multi-user Personalization with Collaborative LLM-powered Agents
Christine P. Lee
Jihye Choi
Bilge Mutlu
LLMAG
180
1
1
17 Mar 2025
D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
Jia Zhang
Chen-Xi Zhang
Yang Liu
Yi-Xuan Jin
Xiao-Wen Yang
Bo Zheng
Yi Liu
Lan-Zhe Guo
145
3
0
14 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
105
3
0
13 Mar 2025
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Fine-Tuning Diffusion Generative Models via Rich Preference Optimization
Hanyang Zhao
Haoxian Chen
Yucheng Guo
Genta Indra Winata
Tingting Ou
Ziyu Huang
D. Yao
Wenpin Tang
141
0
0
13 Mar 2025
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
OffRL
123
0
0
13 Mar 2025
Prompt Inversion Attack against Collaborative Inference of Large Language Models
Prompt Inversion Attack against Collaborative Inference of Large Language Models
Wenjie Qu
Yuguang Zhou
Yongji Wu
Tingsong Xiao
Binhang Yuan
Yongbin Li
Jiaheng Zhang
138
0
0
12 Mar 2025
Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Kechun Xu
Xunlong Xia
Kaixuan Wang
Yifei Yang
Yunxuan Mao
Bing Deng
R. Xiong
Yansen Wang
OffRL
193
0
0
12 Mar 2025
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Sara Rajaee
Kumar Pratik
Gabriele Cesa
Arash Behboodi
OffRLLRM
125
0
0
12 Mar 2025
A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models
A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models
Miao Zhang
Zhenlong Fang
Tianyi Wang
Qin Zhang
Shuai Lu
Junfeng Jiao
Tianyu Shi
AI4CE
123
5
0
11 Mar 2025
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
UC-MOA: Utility-Conditioned Multi-Objective Alignment for Distributional Pareto-Optimality
Zelei Cheng
Xin-Qiang Cai
Yuting Tang
Pushi Zhang
Boming Yang
Masashi Sugiyama
Xinyu Xing
157
0
0
10 Mar 2025
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
Jongwoo Ko
Tianyi Chen
Sungnyun Kim
Tianyu Ding
Luming Liang
Ilya Zharkov
Se-Young Yun
VLM
464
2
0
10 Mar 2025
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
Wenzhuo Xu
Zhipeng Wei
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Xinming Zhang
AAML
92
0
0
10 Mar 2025
Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
Anar Yeginbergen
Maite Oronoz
Rodrigo Agerri
138
0
0
07 Mar 2025
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
Hyungkyu Kang
Min-hwan Oh
OffRL
121
0
0
07 Mar 2025
Superintelligence Strategy: Expert Version
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
118
3
0
07 Mar 2025
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models
Ruizhe Chen
Wenhao Chai
Zhifei Yang
Xiaotian Zhang
Qiufeng Wang
Tony Q.S. Quek
Soujanya Poria
Zuozhu Liu
144
1
0
06 Mar 2025
SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning
SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning
Chen Li
Yinyi Luo
Anudeep Bolimera
Uzair Ahmed
Siyang Song
Hrishikesh Gokhale
Marios Savvides
LRMAI4CE
128
1
0
06 Mar 2025
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Mahfuz Ahmed Anik
Abdur Rahman
Azmine Toushik Wasi
Md Manjurul Ahsan
96
5
0
05 Mar 2025
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Songming Zhang
Xue Zhang
Tong Zhang
Bojie Hu
Yufeng Chen
Jinan Xu
125
1
0
04 Mar 2025
Instruct-of-Reflection: Enhancing Large Language Models Iterative Reflection Capabilities via Dynamic-Meta Instruction
Liping Liu
Chunhong Zhang
Likang Wu
Chuang Zhao
Zheng Hu
Ming He
Jianping Fan
LLMAGLRM
73
2
0
02 Mar 2025
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
Wenjie Qiu
Yi-Chen Li
Xuqin Zhang
Tianyi Zhang
Yiming Zhang
Zongzhang Zhang
Yang Yu
ALM
111
1
0
01 Mar 2025
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Zachary Yahn
Yichang Xu
Ling Liu
137
22
0
01 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
387
1
0
27 Feb 2025
Shh, don't say that! Domain Certification in LLMs
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde
Alasdair Paren
Preetham Arvind
Maxime Kayser
Tom Rainforth
Thomas Lukasiewicz
Guohao Li
Philip Torr
Adel Bibi
122
2
0
26 Feb 2025
Reward Shaping to Mitigate Reward Hacking in RLHF
Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu
Xuandong Zhao
Chengyuan Yao
Han Wang
Qi Han
Yanghua Xiao
205
14
0
26 Feb 2025
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
Jiawei Huang
Bingcong Li
Christoph Dann
Niao He
OffRL
271
3
0
26 Feb 2025
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
Taneesh Gupta
Rahul Madhavan
Xuchao Zhang
Chetan Bansal
Saravan Rajmohan
115
0
0
25 Feb 2025
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
Matthew Barker
Andrew Bell
Evan Thomas
James Carr
Thomas Andrews
Umang Bhatt
167
2
0
25 Feb 2025
Advantage-Guided Distillation for Preference Alignment in Small Language Models
Advantage-Guided Distillation for Preference Alignment in Small Language Models
Shiping Gao
Fanqi Wan
Jiajian Guo
Xiaojun Quan
Qifan Wang
ALM
159
0
0
25 Feb 2025
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Xu Chu
Zhixin Zhang
Tianyu Jia
Yujie Jin
145
0
0
25 Feb 2025
Aligning Compound AI Systems via System-level DPO
Aligning Compound AI Systems via System-level DPO
Xiangwen Wang
Yibo Jacky Zhang
Zhoujie Ding
Katherine Tsai
Haolun Wu
Sanmi Koyejo
71
1
0
24 Feb 2025
Is Free Self-Alignment Possible?
Is Free Self-Alignment Possible?
Dyah Adila
Changho Shin
Yijing Zhang
Frederic Sala
MoMe
201
2
0
24 Feb 2025
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
Michal Bravansky
Vaclav Kubon
Suhas Hariharan
Robert Kirk
136
1
0
24 Feb 2025
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents
Taiyi Wang
Zhihao Wu
Jianheng Liu
Jianye Hao
Jun Wang
Kun Shao
OffRL
126
29
0
24 Feb 2025
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Yuheng Zhang
Dian Yu
Tao Ge
Linfeng Song
Zhichen Zeng
Haitao Mi
Nan Jiang
Dong Yu
138
4
0
24 Feb 2025
RLTHF: Targeted Human Feedback for LLM Alignment
RLTHF: Targeted Human Feedback for LLM Alignment
Yifei Xu
Tusher Chakraborty
Emre Kıcıman
Bibek Aryal
Eduardo Rodrigues
...
Rafael Padilha
Leonardo Nunes
Shobana Balakrishnan
Songwu Lu
Ranveer Chandra
169
2
0
24 Feb 2025
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Chenghua Huang
Lu Wang
Fangkai Yang
Pu Zhao
Hao Sun
Qingwei Lin
Dongmei Zhang
Saravan Rajmohan
Qi Zhang
OffRL
87
1
0
24 Feb 2025
PiCO: Peer Review in LLMs based on the Consistency Optimization
PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning
Shuo Yang
Yu-Yang Liu
Jia-Yu Yao
Zhen-Hui Liu
Yu Wang
Ming Pang
Li Yuan
ALM
217
9
0
24 Feb 2025
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models
Simeng Han
Frank Palma Gomez
Tu Vu
Zefei Li
Daniel Cer
Hansi Zeng
Chris Tar
Arman Cohan
Gustavo Hernández Ábrego
120
3
0
24 Feb 2025
Spontaneous Giving and Calculated Greed in Language Models
Spontaneous Giving and Calculated Greed in Language Models
Yuxuan Li
Hirokazu Shirado
ReLMLRMAI4CE
108
2
0
24 Feb 2025
Post-edits Are Preferences Too
Post-edits Are Preferences Too
Nathaniel Berger
Stefan Riezler
M. Exel
Matthias Huck
133
2
0
24 Feb 2025
Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes
Do LLMs Understand the Safety of Their Inputs? Training-Free Moderation via Latent Prototypes
Maciej Chrabąszcz
Filip Szatkowski
Bartosz Wójcik
Jan Dubiñski
Tomasz Trzciñski
Sebastian Cygert
88
0
0
22 Feb 2025
Previous
12345...121314
Next