Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,202 papers shown
Title
Online Iterative Reinforcement Learning from Human Feedback with General Preference Model
Chen Ye
Wei Xiong
Yuheng Zhang
Nan Jiang
Tong Zhang
OffRL
85
15
0
11 Feb 2024
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu
T. Sumers
Ishita Dasgupta
Thomas Griffiths
LLMAG
67
17
0
11 Feb 2024
OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning
Rui Ye
Wenhao Wang
Jingyi Chai
Dihan Li
Zexi Li
Yinda Xu
Yaxin Du
Yanfeng Wang
Siheng Chen
ALM
FedML
AIFin
96
98
0
10 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
98
33
0
09 Feb 2024
V-STaR: Training Verifiers for Self-Taught Reasoners
Arian Hosseini
Xingdi Yuan
Nikolay Malkin
Rameswar Panda
Alessandro Sordoni
Rishabh Agarwal
ReLM
LRM
109
137
0
09 Feb 2024
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
AAML
SILM
96
32
0
09 Feb 2024
Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation
Xianghe Pang
Shuo Tang
Rui Ye
Yuxin Xiong
Bolun Zhang
Yanfeng Wang
Siheng Chen
188
36
0
08 Feb 2024
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia
Guangyu Shen
Shuyang Cheng
Kai-xian Zhang
Guanhong Tao
Shengwei An
Lu Yan
Zhuo Zhang
Shiqing Ma
Xiangyu Zhang
78
15
0
08 Feb 2024
Permute-and-Flip: An optimally stable and watermarkable decoder for LLMs
Xuandong Zhao
Lei Li
Yu-Xiang Wang
91
9
0
08 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
151
118
0
07 Feb 2024
A Roadmap to Pluralistic Alignment
Taylor Sorensen
Jared Moore
Jillian R. Fisher
Mitchell L. Gordon
Niloofar Mireshghallah
...
Liwei Jiang
Ximing Lu
Nouha Dziri
Tim Althoff
Yejin Choi
142
95
0
07 Feb 2024
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Hao Zhao
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
ALM
159
57
0
07 Feb 2024
Direct Language Model Alignment from Online AI Feedback
Shangmin Guo
Biao Zhang
Tianlin Liu
Tianqi Liu
Misha Khalman
...
Thomas Mesnard
Yao-Min Zhao
Bilal Piot
Johan Ferret
Mathieu Blondel
ALM
105
160
0
07 Feb 2024
The Future of Cognitive Strategy-enhanced Persuasive Dialogue Agents: New Perspectives and Trends
Mengqi Chen
Bin Guo
Hao Wang
Haoyu Li
Qian Zhao
Jingqi Liu
Yasan Ding
Yan Pan
Zhiwen Yu
LLMAG
82
2
0
07 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
112
419
0
06 Feb 2024
"Task Success" is not Enough: Investigating the Use of Video-Language Models as Behavior Critics for Catching Undesirable Agent Behaviors
L. Guan
Yifan Zhou
Denis Liu
Yantian Zha
H. B. Amor
Subbarao Kambhampati
LM&Ro
89
17
0
06 Feb 2024
Measuring Implicit Bias in Explicitly Unbiased Large Language Models
Xuechunzi Bai
Angelina Wang
Ilia Sucholutsky
Thomas Griffiths
150
36
0
06 Feb 2024
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback
Yufei Wang
Zhanyi Sun
Jesse Zhang
Zhou Xian
Erdem Biyik
David Held
Zackory M. Erickson
VLM
122
59
0
06 Feb 2024
Nevermind: Instruction Override and Moderation in Large Language Models
Edward Kim
ALM
26
1
0
05 Feb 2024
DeAL: Decoding-time Alignment for Large Language Models
James Y. Huang
Sailik Sengupta
Daniele Bonadiman
Yi-An Lai
Arshit Gupta
Nikolaos Pappas
Saab Mansour
Katrin Kirchoff
Dan Roth
129
36
0
05 Feb 2024
Position: What Can Large Language Models Tell Us about Time Series Analysis
Ming Jin
Yifan Zhang
Wei Chen
Kexin Zhang
Yuxuan Liang
Bin Yang
Jindong Wang
Shirui Pan
Qingsong Wen
AI4TS
88
23
0
05 Feb 2024
Aligner: Efficient Alignment by Learning to Correct
Jiaming Ji
Boyuan Chen
Hantao Lou
Chongye Guo
Borong Zhang
Xuehai Pan
Juntao Dai
Tianyi Qiu
Yaodong Yang
148
40
0
04 Feb 2024
Jailbreaking Attack against Multimodal Large Language Model
Zhenxing Niu
Haoxuan Ji
Xinbo Gao
Gang Hua
Rong Jin
97
76
0
04 Feb 2024
Calibration and Correctness of Language Models for Code
Claudio Spiess
David Gros
Kunal Suresh Pai
Michael Pradel
Md Rafiqul Islam Rabin
Amin Alipour
Susmit Jha
Prem Devanbu
Toufique Ahmed
117
28
0
03 Feb 2024
Panacea: Pareto Alignment via Preference Adaptation for LLMs
Yifan Zhong
Chengdong Ma
Xiaoyuan Zhang
Ziran Yang
Haojun Chen
Qingfu Zhang
Siyuan Qi
Yaodong Yang
130
38
0
03 Feb 2024
The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement Learning and Large Language Models
M. Pternea
Prerna Singh
Abir Chakraborty
Y. Oruganti
M. Milletarí
Sayli Bapat
Kebei Jiang
OffRL
80
10
0
02 Feb 2024
Foundation Model Sherpas: Guiding Foundation Models through Knowledge and Reasoning
D. Bhattacharjya
Junkyu Lee
Don Joven Agravante
Balaji Ganesan
Radu Marinescu
LLMAG
50
1
0
02 Feb 2024
TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution
Wenyue Hua
Xianjun Yang
Zelong Li
Cheng Wei
Yongfeng Zhang
LLMAG
96
22
0
02 Feb 2024
Human-like Category Learning by Injecting Ecological Priors from Large Language Models into Neural Networks
Akshay K. Jagadish
Julian Coda-Forno
Mirko Thalmann
Eric Schulz
Marcel Binz
64
4
0
02 Feb 2024
Developing and Evaluating a Design Method for Positive Artificial Intelligence
Willem van der Maden
Derek Lomas
P. Hekkert
58
0
0
02 Feb 2024
Rethinking the Role of Proxy Rewards in Language Model Alignment
Sungdong Kim
Minjoon Seo
SyDa
ALM
58
2
0
02 Feb 2024
Dense Reward for Free in Reinforcement Learning from Human Feedback
Alex J. Chan
Hao Sun
Samuel Holt
M. Schaar
97
42
0
01 Feb 2024
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing
Fangkai Jiao
Chengwei Qin
Zhengyuan Liu
Nancy F. Chen
Shafiq Joty
LRM
105
35
0
01 Feb 2024
Computational Experiments Meet Large Language Model Based Agents: A Survey and Perspective
Qun Ma
Xiao Xue
Deyu Zhou
Xiangning Yu
Donghua Liu
...
Yifan Shen
Peilin Ji
Juanjuan Li
Gang Wang
Wanpeng Ma
AI4CE
LM&Ro
LLMAG
92
9
0
01 Feb 2024
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
Fan Yin
Hao Zhou
Fandong Meng
Jie Zhou
Kai-Wei Chang
Minlie Huang
Nanyun Peng
AAML
124
63
0
31 Jan 2024
Biospheric AI
Marcin Korecki
64
0
0
31 Jan 2024
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Y. Wang
138
62
0
30 Jan 2024
MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
Xiaolong Jin
Zhuo Zhang
Xiangyu Zhang
31
4
0
25 Jan 2024
TPD: Enhancing Student Language Model Reasoning via Principle Discovery and Guidance
Haorui Wang
Rongzhi Zhang
Yinghao Li
Lingkai Kong
Yuchen Zhuang
Xiusi Chen
Chao Zhang
LRM
81
5
0
24 Jan 2024
ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement
Xinliang Frederick Zhang
Carter Blum
Temma Choji
Shalin S Shah
Alakananda Vempala
124
12
0
24 Jan 2024
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
126
44
0
23 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
92
49
0
23 Jan 2024
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom
Axel Feldmann
Aleksander Madry
OODD
110
61
0
23 Jan 2024
GRATH: Gradual Self-Truthifying for Large Language Models
Weixin Chen
Basel Alomair
Yue Liu
HILM
SyDa
49
6
0
22 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
179
104
0
22 Jan 2024
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
Songyang Gao
Qiming Ge
Wei Shen
Shihan Dou
Junjie Ye
...
Yicheng Zou
Zhi Chen
Hang Yan
Qi Zhang
Dahua Lin
78
11
0
21 Jan 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Adib Hasan
Ileana Rugina
Alex Wang
AAML
96
24
0
19 Jan 2024
Self-Rewarding Language Models
Weizhe Yuan
Richard Yuanzhe Pang
Kyunghyun Cho
Xian Li
Sainbayar Sukhbaatar
Jing Xu
Jason Weston
ReLM
SyDa
ALM
LRM
403
340
0
18 Jan 2024
Aligning Large Language Models with Counterfactual DPO
Bradley Butcher
ALM
57
1
0
17 Jan 2024
Canvil: Designerly Adaptation for LLM-Powered User Experiences
K. J. Kevin Feng
Q. V. Liao
Ziang Xiao
Jennifer Wortman Vaughan
Amy X. Zhang
David W. McDonald
101
19
0
17 Jan 2024
Previous
1
2
3
...
15
16
17
...
23
24
25
Next