Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,202 papers shown
Title
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
308
1,528
0
27 Jul 2023
Evaluating the Moral Beliefs Encoded in LLMs
Nino Scherrer
Claudia Shi
Amir Feder
David M. Blei
89
140
0
26 Jul 2023
Embedding Democratic Values into Social Media AIs via Societal Objective Functions
Chenyan Jia
Michelle S. Lam
Minh Chau Mai
Jeffrey T. Hancock
Michael S. Bernstein
54
32
0
26 Jul 2023
Evaluating Large Language Models for Radiology Natural Language Processing
Zheng Liu
Tianyang Zhong
Yiwei Li
Yutong Zhang
Yirong Pan
...
Shijie Zhao
Quanzheng Li
Hongtu Zhu
Dinggang Shen
Tianming Liu
LM&MA
ELM
126
6
0
25 Jul 2023
ARB: Advanced Reasoning Benchmark for Large Language Models
Tomohiro Sawada
Daniel Paleka
Alexander Havrilla
Pranav Tadepalli
Paula Vidas
Alexander Kranias
John J. Nay
Kshitij Gupta
Aran Komatsuzaki
ELM
LRM
81
39
0
25 Jul 2023
Predicting Code Coverage without Execution
Michele Tufano
Shubham Chandel
Anisha Agarwal
Neel Sundaresan
Colin B. Clement
55
10
0
25 Jul 2023
RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment
Kevin Kaichuang Yang
Dan Klein
Asli Celikyilmaz
Nanyun Peng
Yuandong Tian
ALM
108
29
0
24 Jul 2023
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Seonghyeon Ye
Doyoung Kim
Sungdong Kim
Hyeonbin Hwang
Seungone Kim
Yongrae Jo
James Thorne
Juho Kim
Minjoon Seo
ALM
134
108
0
20 Jul 2023
Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa
Shriyash Upadhyay
Etan Ginsberg
SyDa
LRM
49
0
0
20 Jul 2023
Deceptive Alignment Monitoring
Andres Carranza
Dhruv Pai
Rylan Schaeffer
Arnuv Tandon
Oluwasanmi Koyejo
76
9
0
20 Jul 2023
Can Instruction Fine-Tuned Language Models Identify Social Bias through Prompting?
O. Dige
Jacob-Junqi Tian
David B. Emerson
Faiza Khan Khattak
ALM
57
5
0
19 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
103
115
0
18 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
510
12,128
0
18 Jul 2023
Emotional Intelligence of Large Language Models
Xuena Wang
Xueting Li
Zi Yin
Yue Wu
Tsinghua University
120
87
0
18 Jul 2023
AlpaGasus: Training A Better Alpaca with Fewer Data
Lichang Chen
Shiyang Li
Jun Yan
Hai Wang
Kalpa Gunaratna
...
Zheng Tang
Vijay Srinivasan
Dinesh Manocha
Heng-Chiao Huang
Hongxia Jin
ALM
125
0
0
17 Jul 2023
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
Huachuan Qiu
Shuai Zhang
Anqi Li
Hongliang He
Zhenzhong Lan
ALM
94
53
0
17 Jul 2023
Large Language Models as Superpositions of Cultural Perspectives
Grgur Kovač
Masataka Sawayama
Rémy Portelas
Cédric Colas
Peter Ford Dominey
Pierre-Yves Oudeyer
LLMAG
85
37
0
15 Jul 2023
Leveraging Contextual Counterfactuals Toward Belief Calibration
Qiuyi Zhang
Zhang
Michael S. Lee
Sherol Chen
65
1
0
13 Jul 2023
A Comprehensive Overview of Large Language Models
Humza Naveed
Asad Ullah Khan
Shi Qiu
Muhammad Saqib
Saeed Anwar
Muhammad Usman
Naveed Akhtar
Nick Barnes
Ajmal Mian
OffRL
257
622
0
12 Jul 2023
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang
Chen Wang
Ruohan Zhang
Yunzhu Li
Jiajun Wu
Li Fei-Fei
LM&Ro
132
519
0
12 Jul 2023
Secrets of RLHF in Large Language Models Part I: PPO
Rui Zheng
Shihan Dou
Songyang Gao
Yuan Hua
Wei Shen
...
Hang Yan
Tao Gui
Qi Zhang
Xipeng Qiu
Xuanjing Huang
ALM
OffRL
122
177
0
11 Jul 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji
Mickel Liu
Juntao Dai
Xuehai Pan
Chi Zhang
Ce Bian
Chi Zhang
Ruiyang Sun
Yizhou Wang
Yaodong Yang
ALM
98
506
0
10 Jul 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety
Markus Anderljung
Joslyn Barnhart
Anton Korinek
Jade Leung
Cullen O'Keefe
...
Jonas Schuett
Yonadav Shavit
Divya Siddarth
Robert F. Trager
Kevin J. Wolf
SILM
145
125
0
06 Jul 2023
Style Over Substance: Evaluation Biases for Large Language Models
Minghao Wu
Alham Fikri Aji
ALM
ELM
143
47
0
06 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
236
1,005
0
05 Jul 2023
Causal Discovery with Language Models as Imperfect Experts
Stephanie Long
Alexandre Piché
Valentina Zantedeschi
Tibor Schuster
Alexandre Drouin
CML
115
40
0
05 Jul 2023
SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions
Sameera Horawalavithana
Sai Munikoti
Ian Stewart
Henry Kvinge
MLLM
93
19
0
03 Jul 2023
Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Teun van der Weij
Simon Lermen
Leon Lang
LLMAG
67
4
0
03 Jul 2023
BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer
Z. Li
Shitou Zhang
Hai Zhao
Yifei Yang
Dongjie Yang
LM&MA
116
17
0
01 Jul 2023
Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
Beatriz Borges
Niket Tandon
Tanja Käser
Antoine Bosselut
142
4
0
01 Jul 2023
Stay on topic with Classifier-Free Guidance
Guillaume Sanchez
Honglu Fan
Alexander Spangher
Elad Levi
Pawan Sasanka Ammanamanchi
Stella Biderman
3DV
105
55
0
30 Jun 2023
Preference Ranking Optimization for Human Alignment
Feifan Song
Yu Bowen
Minghao Li
Haiyang Yu
Fei Huang
Yongbin Li
Houfeng Wang
ALM
86
272
0
30 Jun 2023
On the Exploitability of Instruction Tuning
Manli Shu
Jiong Wang
Chen Zhu
Jonas Geiping
Chaowei Xiao
Tom Goldstein
SILM
141
99
0
28 Jun 2023
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Esin Durmus
Karina Nyugen
Thomas I. Liao
Nicholas Schiefer
Amanda Askell
...
Alex Tamkin
Janel Thamkul
Jared Kaplan
Jack Clark
Deep Ganguli
147
245
0
28 Jun 2023
Next Steps for Human-Centered Generative AI: A Technical Perspective
Xiang Ánthony' Chen
Jeff Burke
Andrea Colaço
Matthew K. Hong
Jennifer Jacobs
...
Dingzeyu Li
Nanyun Peng
Karl D. D. Willis
Chien-Sheng Wu
Bolei Zhou
LLMAG
91
35
0
27 Jun 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models
Xiangyu Qi
Kaixuan Huang
Ashwinee Panda
Peter Henderson
Mengdi Wang
Prateek Mittal
AAML
118
172
0
22 Jun 2023
Apolitical Intelligence? Auditing Delphi's responses on controversial political issues in the US
J. H. Rystrøm
53
0
0
22 Jun 2023
Towards Regulatable AI Systems: Technical Gaps and Policy Opportunities
Xudong Shen
H. Brown
Jiashu Tao
Martin Strobel
Yao Tong
Akshay Narayan
Harold Soh
Finale Doshi-Velez
98
3
0
22 Jun 2023
Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning
Huiguo He
Tianfu Wang
Huan Yang
Jianlong Fu
N. Yuan
Jian Yin
Hongyang Chao
Qi Zhang
EGVM
152
10
0
20 Jun 2023
TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models
Yue Huang
Qihui Zhang
Philip S. Y
Lichao Sun
68
54
0
20 Jun 2023
CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation
Xiwen Liang
Liang Ma
Shanshan Guo
Jianhua Han
Hang Xu
Shikui Ma
Xiaodan Liang
LM&Ro
LLMAG
159
4
0
17 Jun 2023
Aligning Synthetic Medical Images with Clinical Knowledge using Human Feedback
Shenghuan Sun
Gregory M. Goldgof
A. Butte
Ahmed Alaa
MedIm
67
14
0
16 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Stephen Casper
Jason Lin
Joe Kwon
Gatlen Culp
Dylan Hadfield-Menell
AAML
60
99
0
15 Jun 2023
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Longtao Zheng
Rongpin Wang
Xinrun Wang
Bo An
LLMAG
101
73
0
13 Jun 2023
Evaluating the Social Impact of Generative AI Systems in Systems and Society
Irene Solaiman
Zeerak Talat
William Agnew
Lama Ahmad
Dylan K. Baker
...
Marie-Therese Png
Shubham Singh
A. Strait
Lukas Struppek
Arjun Subramonian
ELM
EGVM
139
117
0
09 Jun 2023
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation
Zhouhong Gu
Xiaoxuan Zhu
Haoning Ye
Lin Zhang
Jianchen Wang
...
Zili Wang
Shusen Wang
Weiguo Zheng
Hongwei Feng
Yanghua Xiao
ALM
ELM
148
60
0
09 Jun 2023
Improving Open Language Models by Learning from Organic Interactions
Jing Xu
Da Ju
Joshua Lane
M. Komeili
Eric Michael Smith
...
Rashel Moritz
Sainbayar Sukhbaatar
Y-Lan Boureau
Jason Weston
Kurt Shuster
66
9
0
07 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Alexandre Ramé
Guillaume Couairon
Mustafa Shukor
Corentin Dancette
Jean-Baptiste Gaya
Laure Soulier
Matthieu Cord
MoMe
120
157
0
07 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
160
584
0
06 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang
Xin Li
Lidong Bing
MLLM
209
1,067
0
05 Jun 2023
Previous
1
2
3
...
21
22
23
24
25
Next