ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2204.05862
  4. Cited By
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

12 April 2022
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
Nova Dassarma
Dawn Drain
Stanislav Fort
Deep Ganguli
T. Henighan
Nicholas Joseph
Saurav Kadavath
John Kernion
Tom Conerly
S. E. Showk
Nelson Elhage
Zac Hatfield-Dodds
Danny Hernandez
Tristan Hume
Scott R. Johnston
Shauna Kravec
Liane Lovitt
Neel Nanda
Catherine Olsson
Dario Amodei
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
ArXiv (abs)PDFHTML

Papers citing "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"

50 / 654 papers shown
Title
Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
Exploring Empty Spaces: Human-in-the-Loop Data Augmentation
Catherine Yeh
Donghao Ren
Yannick Assogba
Dominik Moritz
Fred Hohman
105
0
0
01 Oct 2024
The Crucial Role of Samplers in Online Direct Preference Optimization
The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi
Runlong Zhou
Simon S. Du
128
11
0
29 Sep 2024
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng
Chi Zhang
Zilingfeng Ye
Xibin Wu
Wang Zhang
Ru Zhang
Size Zheng
Haibin Lin
Chuan Wu
AI4CE
238
240
0
28 Sep 2024
AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure
AI Delegates with a Dual Focus: Ensuring Privacy and Strategic Self-Disclosure
Xi Chen
Zhiyang Zhang
Fangkai Yang
Xiaoting Qin
Chao Du
Xi Cheng
Hangxin Liu
Qingwei Lin
Saravan Rajmohan
Dongmei Zhang
39
1
0
26 Sep 2024
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
Kaden Uhlig
Joern Wuebker
Raphael Reinauer
John DeNero
109
0
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MUAAML
209
53
0
26 Sep 2024
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs
Jiahao Yu
Yangguang Shao
Hanwen Miao
Junzheng Shi
SILMAAML
174
11
0
23 Sep 2024
RRM: Robust Reward Model Training Mitigates Reward Hacking
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu
Wei Xiong
Jie Jessie Ren
Lichang Chen
Junru Wu
...
Yuan Liu
Bilal Piot
Abe Ittycheriah
Aviral Kumar
Mohammad Saleh
AAML
97
23
0
20 Sep 2024
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
Mingjie Liu
Yun-Da Tsai
Wenfei Zhou
Haoxing Ren
SyDa3DV
122
17
0
19 Sep 2024
From Lists to Emojis: How Format Bias Affects Model Alignment
From Lists to Emojis: How Format Bias Affects Model Alignment
Xuanchang Zhang
Wei Xiong
Lichang Chen
Dinesh Manocha
Heng Huang
Tong Zhang
ALM
110
13
0
18 Sep 2024
REAL: Response Embedding-based Alignment for LLMs
REAL: Response Embedding-based Alignment for LLMs
Honggen Zhang
Xufeng Zhao
Igor Molybog
June Zhang
87
2
0
17 Sep 2024
Your Weak LLM is Secretly a Strong Teacher for Alignment
Your Weak LLM is Secretly a Strong Teacher for Alignment
Leitian Tao
Yixuan Li
153
9
0
13 Sep 2024
Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data
Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data
Atilla Akkus
Mingjie Li
Junjie Chu
Junjie Chu
Michael Backes
Sinem Sav
Sinem Sav
SILMSyDa
130
4
0
12 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
257
32
0
10 Sep 2024
On the Limited Generalization Capability of the Implicit Reward Model
  Induced by Direct Preference Optimization
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
Yong Lin
Skyler Seto
Maartje ter Hoeve
Katherine Metcalf
B. Theobald
Xuan Wang
Yizhe Zhang
Chen Huang
Tong Zhang
107
15
0
05 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
135
2
0
05 Sep 2024
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning
Wei Chen
Zhen Huang
Liang Xie
Binbin Lin
Houqiang Li
...
Deng Cai
Yonggang Zhang
Wenxiao Wang
Xu Shen
Jieping Ye
152
10
0
03 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
140
18
0
01 Sep 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Tong Nie
Junlin He
Yuewen Mei
Guoyang Qin
Guilong Li
Jian Sun
Wei Ma
139
4
0
30 Aug 2024
HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications
HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications
Rishi Kalra
Zekun Wu
Ayesha Gulley
Airlie Hilliard
Xin Guan
Adriano Soares Koshiyama
Philip C. Treleaven
RALMAILaw
112
7
0
29 Aug 2024
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation
  Strategy of Consistency Model
ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang
Zhihui Wang
Siqi Yin
Guangxiao Ma
Peng Zhang
Boxi Wu
DiffM
149
0
0
28 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip Torr
Mohamed Elhoseiny
Adel Bibi
207
15
0
27 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALMELM
199
32
0
23 Aug 2024
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Chenglong Wang
Yang Gan
Yifu Huo
Yongyu Mu
Murun Yang
...
Chunliang Zhang
Tongran Liu
Quan Du
Di Yang
Jingbo Zhu
VLM
175
6
0
22 Aug 2024
Personality Alignment of Large Language Models
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
134
8
0
21 Aug 2024
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
Shachar Don-Yehiya
Leshem Choshen
Omri Abend
78
2
0
15 Aug 2024
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization
Yuxin Jiang
Bo Huang
Yufei Wang
Xingshan Zeng
Liangyou Li
Yasheng Wang
Xin Jiang
Lifeng Shang
Ruiming Tang
Wei Wang
131
7
0
14 Aug 2024
Natural Language Outlines for Code: Literate Programming in the LLM Era
Natural Language Outlines for Code: Literate Programming in the LLM Era
Kensen Shi
Deniz Altınbüken
Saswat Anand
Mihai Christodorescu
Katja Grünwedel
...
Tobias Welp
Pengcheng Yin
Manzil Zaheer
Satish Chandra
Charles Sutton
157
7
0
09 Aug 2024
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models
Zi Liang
Haibo Hu
Qingqing Ye
Yaxin Xiao
Haoyang Li
AAMLELMSILM
146
9
0
05 Aug 2024
A Survey on Self-play Methods in Reinforcement Learning
A Survey on Self-play Methods in Reinforcement Learning
Chao Yu
Zelai Xu
Chengdong Ma
Chao Yu
Weijuan Tu
...
Deheng Ye
Wenbo Ding
Yaodong Yang
Yu Wang
Yu Wang
SyDaSSLOnRL
185
9
0
02 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAMLMU
133
63
0
01 Aug 2024
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma: Generative AI Content Moderation Based on Gemma
Wenjun Zeng
Yuchi Liu
Ryan Mullins
Ludovic Peran
Joe Fernandez
...
Drew Proud
Piyush Kumar
Bhaktipriya Radharapu
Olivia Sturman
O. Wahltinez
AI4MH
115
49
0
31 Jul 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son
William Bankes
Sayak Ray Chowdhury
Brooks Paige
Ilija Bogunovic
131
4
0
26 Jul 2024
Ontology of Belief Diversity: A Community-Based Epistemological Approach
Ontology of Belief Diversity: A Community-Based Epistemological Approach
Tyler Fischella
Erin van Liemt
Qiuyi
Qiuyi Zhang
51
0
0
25 Jul 2024
Towards Aligning Language Models with Textual Feedback
Towards Aligning Language Models with Textual Feedback
Sauc Abadal Lloret
Shehzaad Dhuliawala
K. Murugesan
Mrinmaya Sachan
VLM
120
1
0
24 Jul 2024
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
Arun Verma
Zhongxiang Dai
Xiaoqiang Lin
Patrick Jaillet
K. H. Low
193
6
0
24 Jul 2024
A Survey on Employing Large Language Models for Text-to-SQL Tasks
A Survey on Employing Large Language Models for Text-to-SQL Tasks
Liang Shi
Zhengju Tang
Nan Zhang
Xiaotong Zhang
Zhi Yang
212
32
0
21 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
147
36
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
120
32
0
12 Jul 2024
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&RoKELM
143
0
0
11 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language
  Mixture
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
112
13
0
10 Jul 2024
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Qizhang Feng
Siva Rajesh Kasa
Santhosh Kumar Kasa
Hyokun Yun
C. Teo
S. Bodapati
149
8
0
08 Jul 2024
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Shujun Liu
Xiaoyu Shen
Yuhang Lai
Siyuan Wang
Shengbin Yue
Zengfeng Huang
Xuanjing Huang
Zhongyu Wei
124
1
0
04 Jul 2024
Aligning Human Motion Generation with Human Perceptions
Aligning Human Motion Generation with Human Perceptions
Haoru Wang
Wentao Zhu
Luyi Miao
Yishu Xu
Feng Gao
Qi Tian
Yizhou Wang
EGVM
137
4
0
02 Jul 2024
Cost-Effective Proxy Reward Model Construction with On-Policy and Active
  Learning
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning
Yifang Chen
Shuohang Wang
Ziyi Yang
Hiteshi Sharma
Nikos Karampatziakis
Donghan Yu
Kevin Jamieson
Simon Shaolei Du
Yelong Shen
OffRL
102
5
0
02 Jul 2024
To Forget or Not? Towards Practical Knowledge Unlearning for Large
  Language Models
To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models
Bozhong Tian
Xiaozhuan Liang
Siyuan Cheng
Qingbin Liu
Mengru Wang
Dianbo Sui
Xi Chen
Huajun Chen
Xin Xu
MU
89
14
0
02 Jul 2024
Badllama 3: removing safety finetuning from Llama 3 in minutes
Badllama 3: removing safety finetuning from Llama 3 in minutes
Dmitrii Volkov
51
5
0
01 Jul 2024
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
Yuheng Zhang
Dian Yu
Baolin Peng
Linfeng Song
Ye Tian
Mingyue Huo
Nan Jiang
Haitao Mi
Dong Yu
230
18
0
30 Jun 2024
When Search Engine Services meet Large Language Models: Visions and
  Challenges
When Search Engine Services meet Large Language Models: Visions and Challenges
Haoyi Xiong
Jiang Bian
Yuchen Li
Xuhong Li
Jundong Li
Shuaiqiang Wang
D. Yin
Sumi Helal
141
36
0
28 Jun 2024
PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical
  and Chemistry
PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
Linqing Chen
Weilei Wang
Zilong Bai
Peng Xu
Yan Fang
...
Lisha Zhang
Fu Bian
Zhongkai Ye
Lidong Pei
Changyang Tu
AI4MHLM&MA
107
3
0
26 Jun 2024
Previous
123...789...121314
Next