Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,202 papers shown
Title
Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
Simon Lermen
Ondvrej Kvapil
ELM
AAML
44
3
0
26 Nov 2023
Large Language Models as Automated Aligners for benchmarking Vision-Language Models
Yuanfeng Ji
Chongjian Ge
Weikai Kong
Enze Xie
Zhengying Liu
Zhengguo Li
Ping Luo
MLLM
ELM
91
7
0
24 Nov 2023
Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language
Di Jin
Shikib Mehri
Devamanyu Hazarika
Aishwarya Padmakumar
Sungjin Lee
Yang Liu
Mahdi Namazifar
ALM
80
17
0
24 Nov 2023
Scalable AI Safety via Doubly-Efficient Debate
Jonah Brown-Cohen
Geoffrey Irving
Georgios Piliouras
67
18
0
23 Nov 2023
Data Diversity Matters for Robust Instruction Tuning
Alexander Bukharin
Tuo Zhao
164
44
0
21 Nov 2023
Diffusion Model Alignment Using Direct Preference Optimization
Bram Wallace
Meihua Dang
Rafael Rafailov
Linqi Zhou
Aaron Lou
Senthil Purushwalkam
Stefano Ermon
Caiming Xiong
Shafiq Joty
Nikhil Naik
EGVM
159
288
0
21 Nov 2023
Applications of Large Scale Foundation Models for Autonomous Driving
Yu Huang
Yue Chen
Zhu Li
ELM
AI4CE
LRM
ALM
LM&Ro
139
16
0
20 Nov 2023
FinanceBench: A New Benchmark for Financial Question Answering
Pranab Islam
Anand Kannappan
Douwe Kiela
Rebecca Qian
Nino Scherrer
Bertie Vidgen
RALM
65
92
0
20 Nov 2023
System 2 Attention (is something you might need too)
Jason Weston
Sainbayar Sukhbaatar
RALM
OffRL
LRM
93
65
0
20 Nov 2023
Case Repositories: Towards Case-Based Reasoning for AI Alignment
K. J. Kevin Feng
Quan Ze Chen
Inyoung Cheong
King Xia
Amy X. Zhang
72
9
0
18 Nov 2023
Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge
Genglin Liu
Xingyao Wang
Lifan Yuan
Yangyi Chen
Hao Peng
92
19
0
16 Nov 2023
Trustworthy Large Models in Vision: A Survey
Ziyan Guo
Li Xu
Jun Liu
MU
126
0
0
16 Nov 2023
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models
Jiong Wang
Junlin Wu
Muhao Chen
Yevgeniy Vorobeychik
Chaowei Xiao
AAML
94
15
0
16 Nov 2023
JAB: Joint Adversarial Prompting and Belief Augmentation
Ninareh Mehrabi
Palash Goyal
Anil Ramakrishna
Jwala Dhamala
Shalini Ghosh
Richard Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
AAML
65
8
0
16 Nov 2023
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao
Bochuan Cao
Jinghui Chen
92
28
0
15 Nov 2023
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
Lingbo Mo
Boshi Wang
Muhao Chen
Huan Sun
82
29
0
15 Nov 2023
When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks
Hao Peng
Xiaozhi Wang
Jianhui Chen
Weikai Li
Yunjia Qi
...
Zhili Wu
Kaisheng Zeng
Bin Xu
Lei Hou
Juanzi Li
90
33
0
15 Nov 2023
An Empathetic User-Centric Chatbot for Emotional Support
Yanting Pan
Yixuan Tang
Yuchen Niu
28
4
0
15 Nov 2023
Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Values
Jing Yao
Xiaoyuan Yi
Xiting Wang
Yifan Gong
Xing Xie
126
29
0
15 Nov 2023
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment
Philippe Laban
Lidiya Murakhovs'ka
Caiming Xiong
Chien-Sheng Wu
LRM
93
23
0
14 Nov 2023
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
Bhaktipriya Radharapu
Kevin Robinson
Lora Aroyo
Preethi Lahoti
106
41
0
14 Nov 2023
LLMs cannot find reasoning errors, but can correct them given the error location
Gladys Tyen
Hassan Mansoor
Victor Carbune
Peter Chen
Tony Mak
LRM
146
79
0
14 Nov 2023
Functionality learning through specification instructions
Pedro Henrique Luz de Araujo
Benjamin Roth
ELM
76
0
0
14 Nov 2023
Extrinsically-Focused Evaluation of Omissions in Medical Summarization
Elliot Schumacher
Daniel Rosenthal
Varun Nair
Luladay Price
Geoffrey Tso
Anitha Kannan
44
2
0
14 Nov 2023
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning
Ruixin Hong
Hongming Zhang
Xinyu Pang
Dong Yu
Changshui Zhang
LRM
93
27
0
14 Nov 2023
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
88
104
0
13 Nov 2023
A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question Decomposition with Large Language Models
He Cao
Zhenwei An
Jiazhan Feng
Kun Xu
Liwei Chen
Dongyan Zhao
HILM
43
2
0
13 Nov 2023
Past as a Guide: Leveraging Retrospective Learning for Python Code Completion
Seunggyoon Shin
Seunggyu Chang
Sungjoon Choi
KELM
59
1
0
13 Nov 2023
Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang
Xiangyang Liu
Qianyu Guo
Tianxiang Sun
Jiawei Sun
...
Yixu Wang
Yan Teng
Xipeng Qiu
Yingchun Wang
Dahua Lin
ALM
190
17
0
12 Nov 2023
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang
Yan Teng
Kexin Huang
Chengqi Lyu
Songyang Zhang
Wenwei Zhang
Xingjun Ma
Yu-Gang Jiang
Yu Qiao
Yingchun Wang
64
23
0
10 Nov 2023
Hallucination-minimized Data-to-answer Framework for Financial Decision-makers
Sohini Roychowdhury
Andres Alvarez
Brian Moore
Marko Krema
Maria Paz Gelpi
...
Angel Rodriguez
Jose Ramon Cabrejas
Pablo Martinez Serrano
Punit Agrawal
Arijit Mukherjee
68
9
0
09 Nov 2023
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Jiale Cheng
Xiao Liu
Kehan Zheng
Pei Ke
Hongning Wang
Yuxiao Dong
Jie Tang
Minlie Huang
77
88
0
07 Nov 2023
Can LLMs Follow Simple Rules?
Norman Mu
Sarah Chen
Zifan Wang
Sizhe Chen
David Karamardian
Lulwa Aljeraisy
Basel Alomair
Dan Hendrycks
David Wagner
ALM
91
32
0
06 Nov 2023
LLMs grasp morality in concept
Mark Pock
Andre Ye
Jared Moore
FaML
76
2
0
04 Nov 2023
Conditions on Preference Relations that Guarantee the Existence of Optimal Policies
Jonathan Colaco Carr
Prakash Panangaden
Doina Precup
86
2
0
03 Nov 2023
Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review
Mingze Yuan
Peng Bao
Jiajia Yuan
Yunhao Shen
Zi Chen
...
Jie Zhao
Yang Chen
Li Zhang
Lin Shen
Bin Dong
ELM
LM&MA
103
16
0
03 Nov 2023
Contextual Confidence and Generative AI
Shrey Jain
Zoe Hitzig
Pamela Mishkin
112
4
0
02 Nov 2023
Making Harmful Behaviors Unlearnable for Large Language Models
Xin Zhou
Yi Lu
Ruotian Ma
Tao Gui
Qi Zhang
Xuanjing Huang
MU
77
12
0
02 Nov 2023
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer
Olivia Watkins
Ethan Mendes
Justin Svegliato
Luke Bailey
...
Karim Elmaaroufi
Pieter Abbeel
Trevor Darrell
Alan Ritter
Stuart J. Russell
111
79
0
02 Nov 2023
On The Open Prompt Challenge In Conditional Audio Generation
Ernie Chang
Sidd Srinivasan
Mahi Luthra
Pin-Jie Lin
Varun K. Nagaraja
...
Zechun Liu
Zhaoheng Ni
Changsheng Zhao
Yangyang Shi
Vikas Chandra
69
4
0
01 Nov 2023
Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
Yang Deng
Wenxuan Zhang
Wai Lam
See-Kiong Ng
Tat-Seng Chua
LM&Ro
LLMAG
140
43
0
01 Nov 2023
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield
Jinhwa Kim
Ali Derakhshan
Ian G. Harris
AAML
187
18
0
31 Oct 2023
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
ALM
77
92
0
31 Oct 2023
MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks
Allen Nie
Yuhui Zhang
Atharva Amdekar
Chris Piech
Tatsunori Hashimoto
Tobias Gerstenberg
82
40
0
30 Oct 2023
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions
Stefano Massaroli
Michael Poli
Daniel Y. Fu
Hermann Kumbong
Rom N. Parnichkun
...
Atri Rudra
Ce Zhang
Christopher Ré
Stefano Ermon
Yoshua Bengio
106
22
0
28 Oct 2023
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
Bobby Azad
Reza Azad
Sania Eskandari
Afshin Bozorgpour
Amirhossein Kazerouni
I. Rekik
Dorit Merhof
VLM
MedIm
144
68
0
28 Oct 2023
Social Contract AI: Aligning AI Assistants with Implicit Group Norms
Jan-Philipp Fränken
Sam Kwok
Peixuan Ye
Kanishk Gandhi
Dilip Arumugam
Jared Moore
Alex Tamkin
Tobias Gerstenberg
Noah D. Goodman
70
9
0
26 Oct 2023
Managing extreme AI risks amid rapid progress
Yoshua Bengio
Geoffrey Hinton
Andrew Yao
Dawn Song
Pieter Abbeel
...
Philip Torr
Stuart J. Russell
Daniel Kahneman
J. Brauner
Sören Mindermann
99
67
0
26 Oct 2023
Unpacking the Ethical Value Alignment in Big Models
Xiaoyuan Yi
Jing Yao
Xiting Wang
Xing Xie
79
13
0
26 Oct 2023
Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting
Preethi Lahoti
Nicholas Blumm
Xiao Ma
Raghavendra Kotikalapudi
Sahitya Potluri
...
Hansa Srinivasan
Ben Packer
Ahmad Beirami
Alex Beutel
Jilin Chen
112
32
0
25 Oct 2023
Previous
1
2
3
...
17
18
19
...
23
24
25
Next