ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDaMoMe
ArXiv (abs)PDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,202 papers shown
Title
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
Rishabh Agarwal
Rameswar Panda
OffRL
183
11
0
23 Oct 2024
Navigating Noisy Feedback: Enhancing Reinforcement Learning with
  Error-Prone Language Models
Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models
Muhan Lin
Shuyang Shi
Yue (Sophie) Guo
Behdad Chalaki
Vaishnav Tadiparthi
Ehsan Moradi-Pari
Simon Stepputtis
Joseph Campbell
Katia Sycara
66
2
0
22 Oct 2024
Trustworthy Alignment of Retrieval-Augmented Large Language Models via
  Reinforcement Learning
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning
Zongmeng Zhang
Yufeng Shi
Jinhua Zhu
Wengang Zhou
Xiang Qi
Peng Zhang
Haoyang Li
RALMHILM
39
0
0
22 Oct 2024
MiniPLM: Knowledge Distillation for Pre-Training Language Models
MiniPLM: Knowledge Distillation for Pre-Training Language Models
Yuxian Gu
Hao Zhou
Fandong Meng
Jie Zhou
Minlie Huang
208
7
0
22 Oct 2024
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
  and Style
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu
Zijun Yao
Rui Min
Yixin Cao
Lei Hou
Juanzi Li
OffRLALM
121
42
0
21 Oct 2024
On The Global Convergence Of Online RLHF With Neural Parametrization
On The Global Convergence Of Online RLHF With Neural Parametrization
Mudit Gaur
Amrit Singh Bedi
Raghu Pasupathy
Vaneet Aggarwal
77
1
0
21 Oct 2024
M-RewardBench: Evaluating Reward Models in Multilingual Settings
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Srishti Gureja
Lester James V. Miranda
Shayekh Bin Islam
Rishabh Maheshwary
Drishti Sharma
Gusti Winata
Nathan Lambert
Sebastian Ruder
Sara Hooker
Marzieh Fadaee
LRM
138
24
0
20 Oct 2024
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning
H. Fernando
Han Shen
Parikshit Ram
Yi Zhou
Horst Samulowitz
Nathalie Baracaldo
Tianyi Chen
CLL
169
4
0
20 Oct 2024
How to Evaluate Reward Models for RLHF
How to Evaluate Reward Models for RLHF
Evan Frick
Tianle Li
Connor Chen
Wei-Lin Chiang
Anastasios Nikolas Angelopoulos
Jiantao Jiao
Banghua Zhu
Joseph E. Gonzalez
Ion Stoica
ALMOffRL
71
18
0
18 Oct 2024
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs
Hamed Fayyaz
Raphael Poulain
Rahmatollah Beheshti
104
2
0
18 Oct 2024
Think Thrice Before You Act: Progressive Thought Refinement in Large
  Language Models
Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models
Chengyu Du
Jinyi Han
Yizhou Ying
Aili Chen
Qianyu He
...
Haoran Guo
Jiaqing Liang
Zulong Chen
Liangyue Li
Yanghua Xiao
KELMCLLLRM
69
1
0
17 Oct 2024
Anchored Alignment for Self-Explanations Enhancement
Anchored Alignment for Self-Explanations Enhancement
Luis Felipe Villa-Arenas
Ata Nizamoglu
Qianli Wang
Sebastian Möller
Vera Schmitt
54
0
0
17 Oct 2024
Retrospective Learning from Interactions
Retrospective Learning from Interactions
Zizhao Chen
Mustafa Omer Gul
Yiwei Chen
Gloria Geng
Anne Wu
Yoav Artzi
LRM
106
1
0
17 Oct 2024
Negative-Prompt-driven Alignment for Generative Language Model
Negative-Prompt-driven Alignment for Generative Language Model
Shiqi Qiao
Ning Xv
Biao Liu
Xin Geng
ALMSyDa
71
0
0
16 Oct 2024
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
Jared Joselowitz
Arjun Jagota
Satyapriya Krishna
Sonali Parbhoo
Nyal Patel
Satyapriya Krishna
Sonali Parbhoo
48
0
0
16 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
143
1
0
15 Oct 2024
SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented
  Generation
SEER: Self-Aligned Evidence Extraction for Retrieval-Augmented Generation
Xinping Zhao
Dongfang Li
Yan Zhong
Boren Hu
Yibin Chen
Baotian Hu
Min Zhang
88
4
0
15 Oct 2024
Optimizing Instruction Synthesis: Effective Exploration of Evolutionary
  Space with Tree Search
Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search
Chenglin Li
Qianglong Chen
Zhi Li
Feng Tao
Yicheng Li
Hao Chen
Fei Yu
Yin Zhang
SyDa
83
0
0
14 Oct 2024
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Bokai Hu
Sai Ashish Somayajula
Xin Pan
Zihan Huang
OffRL
31
1
0
14 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
70
6
0
14 Oct 2024
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language
  Models Alignment
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li
Zhihui Xie
Mukai Li
Shunian Chen
Peiyi Wang
L. Chen
Yazheng Yang
Benyou Wang
Dianbo Sui
Qiang Liu
VLMALM
102
29
0
12 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention
  Manipulation
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Yanjie Wang
Cihang Xie
57
9
0
11 Oct 2024
From Interaction to Impact: Towards Safer AI Agents Through
  Understanding and Evaluating UI Operation Impacts
From Interaction to Impact: Towards Safer AI Agents Through Understanding and Evaluating UI Operation Impacts
Zhuohao Jerry Zhang
E. Schoop
Jeffrey Nichols
Anuj Mahajan
Amanda Swearngin
LLMAG
88
1
0
11 Oct 2024
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt
  Decomposition Process
RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
Peiran Wang
Xiaogeng Liu
Chaowei Xiao
AAML
62
4
0
11 Oct 2024
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin
Sadhika Malladi
Adithya Bhaskar
Danqi Chen
Sanjeev Arora
Boris Hanin
216
35
0
11 Oct 2024
SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction
SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction
L. Yang
Zhaochen Yu
Tianze Zhang
Minkai Xu
Joseph E. Gonzalez
Tengjiao Wang
Shuicheng Yan
ELMReLMLRM
89
0
0
11 Oct 2024
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang
Ahmed Elgohary
Ahmed Magooda
Daniel Khashabi
Benjamin Van Durme
468
8
0
11 Oct 2024
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture
Agents Thinking Fast and Slow: A Talker-Reasoner Architecture
Konstantina Christakopoulou
Shibl Mourad
Maja Matarić
LLMAG
81
13
0
10 Oct 2024
Evolutionary Contrastive Distillation for Language Model Alignment
Evolutionary Contrastive Distillation for Language Model Alignment
Julian Katz-Samuels
Zheng Li
Hyokun Yun
Priyanka Nigam
Yi Xu
Vaclav Petricek
Bing Yin
Trishul Chilimbi
ALMSyDa
31
0
0
10 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Zhaochun Ren
152
10
0
10 Oct 2024
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Shenao Zhang
Zhihan Liu
Boyi Liu
Yanzhe Zhang
Yingxiang Yang
Yunxing Liu
Liyu Chen
Tao Sun
Ziyi Wang
173
3
0
10 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
95
6
0
09 Oct 2024
Self-Boosting Large Language Models with Synthetic Preference Data
Self-Boosting Large Language Models with Synthetic Preference Data
Qingxiu Dong
Li Dong
Xingxing Zhang
Zhifang Sui
Furu Wei
SyDa
92
12
0
09 Oct 2024
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge
  with Curriculum Preference Learning
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning
Xiyao Wang
Linfeng Song
Ye Tian
Dian Yu
Baolin Peng
Haitao Mi
Furong Huang
Dong Yu
LRM
134
14
0
09 Oct 2024
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Fei Wang
Xingchen Wan
Ruoxi Sun
Jiefeng Chen
Sercan Ö. Arık
RALM
104
12
0
09 Oct 2024
On the Modeling Capabilities of Large Language Models for Sequential
  Decision Making
On the Modeling Capabilities of Large Language Models for Sequential Decision Making
Martin Klissarov
Devon Hjelm
Alexander Toshev
Bogdan Mazoure
LM&RoELMOffRLLRM
94
2
0
08 Oct 2024
Better than Your Teacher: LLM Agents that learn from Privileged AI
  Feedback
Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback
Sanjiban Choudhury
Paloma Sodhi
LLMAG
51
4
0
07 Oct 2024
Data Advisor: Dynamic Data Curation for Safety Alignment of Large
  Language Models
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Fei Wang
Ninareh Mehrabi
Palash Goyal
Rahul Gupta
Kai-Wei Chang
Aram Galstyan
ALM
74
2
0
07 Oct 2024
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Akira Kawabata
Saku Sugawara
LRM
116
5
0
07 Oct 2024
Rule-based Data Selection for Large Language Models
Rule-based Data Selection for Large Language Models
Xiaomin Li
Mingye Gao
Zhiwei Zhang
Chang Yue
Hong Hu
75
6
0
07 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized
  Distributions
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
122
7
0
06 Oct 2024
Exploring LLM-based Data Annotation Strategies for Medical Dialogue
  Preference Alignment
Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment
Chengfeng Dou
Y. Zhang
Zhi Jin
Wenpin Jiao
Haiyan Zhao
Yongqiang Zhao
Zhengwei Tao
65
1
0
05 Oct 2024
TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles
  with Simulated Students
TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students
Hyoungwook Jin
Minju Yoo
Jeongeon Park
Yokyung Lee
Xu Wang
Juho Kim
ELM
104
10
0
05 Oct 2024
Misinformation with Legal Consequences (MisLC): A New Task Towards
  Harnessing Societal Harm of Misinformation
Misinformation with Legal Consequences (MisLC): A New Task Towards Harnessing Societal Harm of Misinformation
Chu Fei Luo
Radin Shayanfar
R. Bhambhoria
Samuel Dahan
Xiaodan Zhu
AILaw
60
0
0
04 Oct 2024
System 2 Reasoning Capabilities Are Nigh
System 2 Reasoning Capabilities Are Nigh
Scott C. Lowe
VLMLRM
98
0
0
04 Oct 2024
Aligning LLMs with Individual Preferences via Interaction
Aligning LLMs with Individual Preferences via Interaction
Shujin Wu
May Fung
Cheng Qian
Jeonghwan Kim
Dilek Z. Hakkani-Tür
Heng Ji
109
25
0
04 Oct 2024
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and
  Generation
TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation
Jonathan Cook
Tim Rocktaschel
Jakob Foerster
Dennis Aumiller
Alex Wang
ALM
106
16
0
04 Oct 2024
Is Safer Better? The Impact of Guardrails on the Argumentative Strength
  of LLMs in Hate Speech Countering
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering
Helena Bonaldi
Greta Damo
Nicolás Benjamín Ocampo
Elena Cabrio
S. Villata
Marco Guerini
56
6
0
04 Oct 2024
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim
Ah Jeong Seo
Hao Liu
Jinwoo Shin
Kimin Lee
48
5
0
04 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh Chawla
Xiangliang Zhang
ELM
128
78
0
03 Oct 2024
Previous
123...678...232425
Next