Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,202 papers shown
Title
Multi-objective Reinforcement learning from AI Feedback
Marcus Williams
95
1
0
11 Jun 2024
Merging Improves Self-Critique Against Jailbreak Attacks
Victor Gallego
AAML
MoMe
93
4
0
11 Jun 2024
Teaching Language Models to Self-Improve by Learning from Language Feedback
Chi Hu
Yimin Hu
Hang Cao
Tong Xiao
Jingbo Zhu
LRM
VLM
79
5
0
11 Jun 2024
Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies
Junlin Wang
Siddhartha Jain
Dejiao Zhang
Baishakhi Ray
Varun Kumar
Ben Athiwaratkun
71
19
0
10 Jun 2024
Annotation alignment: Comparing LLM and human annotations of conversational safety
Rajiv Movva
Pang Wei Koh
Emma Pierson
ALM
91
9
0
10 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
118
142
0
10 Jun 2024
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
Xiaoying Zhang
Baolin Peng
Ye Tian
Jingyan Zhou
Yipeng Zhang
Haitao Mi
Helen Meng
CLL
KELM
157
8
0
10 Jun 2024
Language Models Resist Alignment: Evidence From Data Compression
Yalan Qin
Kaile Wang
Tianyi Qiu
Boyuan Chen
Jiayi Zhou
Changye Li
Hantao Lou
Juntao Dai
Y. Liu
Yaodong Yang
119
3
0
10 Jun 2024
Flow of Reasoning:Training LLMs for Divergent Problem Solving with Minimal Examples
Fangxu Yu
Lai Jiang
Haoqiang Kang
Shibo Hao
Lianhui Qin
LRM
AI4CE
208
10
0
09 Jun 2024
Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas
Chengyuan Deng
Yiqun Duan
Xin Jin
Heng Chang
Yijun Tian
...
Kuofeng Gao
Sihong He
Jun Zhuang
Lu Cheng
Haohan Wang
AILaw
90
24
0
08 Jun 2024
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models
Guangyi Liu
Rui Ge
Xinyu Zhu
Jingyi Chai
Yaxin Du
Yang Liu
Yanfeng Wang
Siheng Chen
FedML
108
19
0
07 Jun 2024
Learning Task Decomposition to Assist Humans in Competitive Programming
Jiaxin Wen
Ruiqi Zhong
Pei Ke
Zhihong Shao
Hongning Wang
Minlie Huang
ReLM
126
9
0
07 Jun 2024
Proofread: Fixes All Errors with One Tap
Renjie Liu
Yanxiang Zhang
Yun Zhu
Haicheng Sun
Yuanbo Zhang
Michael Xuelin Huang
Shanqing Cai
Lei Meng
Shumin Zhai
ALM
68
3
0
06 Jun 2024
Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models
Xiang Ji
Sanjeev Kulkarni
Mengdi Wang
Tengyang Xie
OffRL
106
5
0
06 Jun 2024
Open-Endedness is Essential for Artificial Superhuman Intelligence
Edward Hughes
Michael Dennis
Jack Parker-Holder
Feryal M. P. Behbahani
Aditi Mavalankar
Yuge Shi
Tom Schaul
Tim Rocktaschel
LRM
106
33
0
06 Jun 2024
Prototypical Reward Network for Data-Efficient RLHF
Jinghan Zhang
Xiting Wang
Yiqiao Jin
Changyu Chen
Xinhao Zhang
Kunpeng Liu
ALM
91
22
0
06 Jun 2024
Aligning Large Language Models via Fine-grained Supervision
Dehong Xu
Liang Qiu
Minseok Kim
Faisal Ladhak
Jaeyoung Do
79
3
0
04 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
83
1
0
04 Jun 2024
Process-Driven Autoformalization in Lean 4
Jianqiao Lu
Zhengying Liu
Yingjia Wan
Yinya Huang
Haiming Wang
Zhicheng YANG
Jing Tang
Zhijiang Guo
AI4CE
129
19
0
04 Jun 2024
Dishonesty in Helpful and Harmless Alignment
Youcheng Huang
Jingkun Tang
Duanyu Feng
Zheng Zhang
Wenqiang Lei
Jiancheng Lv
Anthony G. Cohn
LLMSV
93
4
0
04 Jun 2024
The Life Cycle of Large Language Models: A Review of Biases in Education
Jinsook Lee
Yann Hicke
Renzhe Yu
Christopher A. Brooks
René F. Kizilcec
AI4Ed
99
2
0
03 Jun 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang
Xueguang Ma
Ge Zhang
Yuansheng Ni
Abhranil Chandra
...
Kai Wang
Alex Zhuang
Rongqi Fan
Xiang Yue
Wenhu Chen
LRM
ELM
160
465
0
03 Jun 2024
Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Keyon Vafa
Ashesh Rambachan
S. Mullainathan
ELM
ALM
80
17
0
03 Jun 2024
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi
Yusen Zhang
Nan Zhang
Jiawei Han
Rui Zhang
LRM
171
85
0
03 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
120
12
0
02 Jun 2024
Aligning Language Models with Demonstrated Feedback
Omar Shaikh
Michelle S. Lam
Joey Hejna
Yijia Shao
Michael S. Bernstein
Michael S. Bernstein
Diyi Yang
ALM
102
26
0
02 Jun 2024
A Survey on Large Language Models for Code Generation
Juyong Jiang
Fan Wang
Jiasi Shen
Sungju Kim
Sunghun Kim
145
204
0
01 Jun 2024
On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots
Christine Herlihy
Jennifer Neville
Tobias Schnabel
Adith Swaminathan
92
4
0
01 Jun 2024
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
99
11
0
31 May 2024
Group Robust Preference Optimization in Reward-free RLHF
Shyam Sundhar Ramesh
Yifan Hu
Iason Chaimalas
Viraj Mehta
Pier Giuseppe Sessa
Haitham Bou-Ammar
Ilija Bogunovic
89
39
0
30 May 2024
TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models
Chen Zhang
Chengguang Tang
Dading Chong
Ke Shi
Guohua Tang
Feng Jiang
Haizhou Li
73
4
0
30 May 2024
Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding
Kuo Liao
Shuang Li
Meng Zhao
Liqun Liu
Mengge Xue
Zhenyu Hu
Honglin Han
Chengguo Yin
86
1
0
30 May 2024
X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions
Chong Li
Wen Yang
Jiajun Zhang
Jinliang Lu
Shaonan Wang
Chengqing Zong
105
9
0
30 May 2024
Is In-Context Learning Sufficient for Instruction Following in LLMs?
Hao Zhao
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
127
14
0
30 May 2024
AI Risk Management Should Incorporate Both Safety and Security
Xiangyu Qi
Yangsibo Huang
Yi Zeng
Edoardo Debenedetti
Jonas Geiping
...
Chaowei Xiao
Yue Liu
Dawn Song
Peter Henderson
Prateek Mittal
AAML
117
12
0
29 May 2024
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhanhui Zhou
Zhixuan Liu
Jie Liu
Zhichen Dong
Chao Yang
Yu Qiao
ALM
109
27
0
29 May 2024
Offline Regularised Reinforcement Learning for Large Language Models Alignment
Pierre Harvey Richemond
Yunhao Tang
Daniel Guo
Daniele Calandriello
M. G. Azar
...
Gil Shamir
Rishabh Joshi
Tianqi Liu
Rémi Munos
Bilal Piot
OffRL
121
29
0
29 May 2024
Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets
Peter Devine
ALM
87
3
0
29 May 2024
Efficient Model-agnostic Alignment via Bayesian Persuasion
Fengshuo Bai
Mingzhi Wang
Zhaowei Zhang
Boyuan Chen
Yinda Xu
Ying Wen
Yaodong Yang
84
6
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
96
27
0
28 May 2024
Aligning to Thousands of Preferences via System Message Generalization
Seongyun Lee
Sue Hyun Park
Seungone Kim
Minjoon Seo
ALM
113
49
0
28 May 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen
AAML
SILM
85
7
0
28 May 2024
The Impossibility of Fair LLMs
Jacy Reese Anthis
Kristian Lum
Michael Ekstrand
Avi Feller
Alexander D’Amour
FaML
130
14
0
28 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
158
20
0
28 May 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Sheng-Hsuan Peng
Pin-Yu Chen
Matthew Hull
Duen Horng Chau
102
30
0
27 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
143
56
0
27 May 2024
CHESS: Contextual Harnessing for Efficient SQL Synthesis
Shayan Talaei
Mohammadreza Pourreza
Yu-Chen Chang
Azalia Mirhoseini
Amin Saberi
85
76
0
27 May 2024
Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity
Shanghaoran Quan
79
4
0
26 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
138
24
0
25 May 2024
Bayesian WeakS-to-Strong from Text Classification to Generation
Ziyun Cui
Ziyang Zhang
Wen Wu
Wen Wu
Chao Zhang
127
3
0
24 May 2024
Previous
1
2
3
...
10
11
12
...
23
24
25
Next