Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,116 papers shown
Title
Prospect Personalized Recommendation on Large Language Model-based Agent Platform
Jizhi Zhang
Keqin Bao
Wenjie Wang
Yang Zhang
Wentao Shi
Wanhong Xu
Fuli Feng
Tat-Seng Chua
LLMAG
51
17
0
28 Feb 2024
LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
Akash Gupta
Ivaxi Sheth
Vyas Raina
Mark Gales
Mario Fritz
43
4
0
28 Feb 2024
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
Tong Liu
Yingjie Zhang
Zhe Zhao
Yinpeng Dong
Guozhu Meng
Kai Chen
AAML
51
47
0
28 Feb 2024
SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model
Bin Cao
Jianhao Yuan
Yexin Liu
Jian Li
Shuyang Sun
Jing Liu
Bo Zhao
DiffM
43
7
0
28 Feb 2024
On the Challenges and Opportunities in Generative AI
Laura Manduchi
Kushagra Pandey
Robert Bamler
Ryan Cotterell
Sina Daubener
...
F. Wenzel
Frank Wood
Stephan Mandt
Vincent Fortuin
Vincent Fortuin
56
17
0
28 Feb 2024
Self-Refinement of Language Models from External Proxy Metrics Feedback
Keshav Ramji
Young-Suk Lee
R. Astudillo
M. Sultan
Tahira Naseem
Asim Munawar
Radu Florian
Salim Roukos
HILM
39
3
0
27 Feb 2024
SoFA: Shielded On-the-fly Alignment via Priority Rule Following
Xinyu Lu
Bowen Yu
Yaojie Lu
Hongyu Lin
Haiyang Yu
Le Sun
Xianpei Han
Yongbin Li
70
13
0
27 Feb 2024
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
Zhenhong Zhou
Jiuyang Xiang
Haopeng Chen
Quan Liu
Zherui Li
Sen Su
37
20
0
27 Feb 2024
Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models
Xinran Zhao
Hongming Zhang
Xiaoman Pan
Wenlin Yao
Dong Yu
Tongshuang Wu
Jianshu Chen
HILM
LRM
32
4
0
27 Feb 2024
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
Huijie Lv
Xiao Wang
Yuan Zhang
Caishuang Huang
Shihan Dou
Junjie Ye
Tao Gui
Qi Zhang
Xuanjing Huang
AAML
44
29
0
26 Feb 2024
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
Xirui Li
Ruochen Wang
Minhao Cheng
Tianyi Zhou
Cho-Jui Hsieh
AAML
52
37
0
25 Feb 2024
Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration
Xin Mao
Fengming Li
Huimin Xu
Wei Zhang
A. Luu
ALM
45
6
0
25 Feb 2024
Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware
Ahmed E. Hassan
Dayi Lin
Gopi Krishnan Rajbahadur
Keheliya Gallaba
F. Côgo
...
Kishanthan Thangarajah
G. Oliva
Jiahuei Lin
Wali Mohammad Abdullah
Zhen Ming Jiang
37
7
0
25 Feb 2024
How Do Humans Write Code? Large Models Do It the Same Way Too
Long Li
Xuzheng He
LRM
43
4
0
24 Feb 2024
Fast Adversarial Attacks on Language Models In One GPU Minute
Vinu Sankar Sadasivan
Shoumik Saha
Gaurang Sriramanan
Priyatham Kattakinda
Atoosa Malemir Chegini
S. Feizi
MIALM
43
34
0
23 Feb 2024
Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control
Masatoshi Uehara
Yulai Zhao
Kevin Black
Ehsan Hajiramezanali
Gabriele Scalia
N. Diamant
Alex Tseng
Tommaso Biancalani
Sergey Levine
42
42
0
23 Feb 2024
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
Zicheng Lin
Zhibin Gou
Tian Liang
Ruilin Luo
Haowei Liu
Yujiu Yang
LRM
42
44
0
22 Feb 2024
A Language Model's Guide Through Latent Space
Dimitri von Rutte
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
45
24
0
22 Feb 2024
Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Jinyi Liu
Yifu Yuan
Jianye Hao
Fei Ni
Lingzhi Fu
Yibin Chen
Yan Zheng
LM&Ro
163
4
0
22 Feb 2024
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping
Alex Stein
Manli Shu
Khalid Saifullah
Yuxin Wen
Tom Goldstein
AAML
48
43
0
21 Feb 2024
Large Language Models for Data Annotation: A Survey
Zhen Tan
Dawei Li
Song Wang
Alimohammad Beigi
Bohan Jiang
Amrita Bhattacharjee
Mansooreh Karami
Wenlin Yao
Lu Cheng
Huan Liu
SyDa
56
53
0
21 Feb 2024
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Yu Kong
Tianlong Chen
Huan Liu
44
15
0
20 Feb 2024
Learning and Sustaining Shared Normative Systems via Bayesian Rule Induction in Markov Games
Ninell Oldenburg
Zhi-Xuan Tan
40
4
0
20 Feb 2024
Is the System Message Really Important to Jailbreaks in Large Language Models?
Xiaotian Zou
Yongkang Chen
Ke Li
30
13
0
20 Feb 2024
A Survey on Knowledge Distillation of Large Language Models
Xiaohan Xu
Ming Li
Chongyang Tao
Tao Shen
Reynold Cheng
Jinyang Li
Can Xu
Dacheng Tao
Dinesh Manocha
KELM
VLM
44
103
0
20 Feb 2024
Incentive Compatibility for AI Alignment in Sociotechnical Systems: Positions and Prospects
Zhaowei Zhang
Fengshuo Bai
Mingzhi Wang
Haoyang Ye
Chengdong Ma
Yaodong Yang
35
4
0
20 Feb 2024
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models
Loka Li
Zhenhao Chen
Guan-Hong Chen
Yixuan Zhang
Yusheng Su
Eric P. Xing
Kun Zhang
LRM
44
16
0
19 Feb 2024
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
Aiwei Liu
Haoping Bai
Zhiyun Lu
Xiang Kong
Simon Wang
Jiulong Shan
Mengsi Cao
Lijie Wen
ALM
34
12
0
19 Feb 2024
Ask Optimal Questions: Aligning Large Language Models with Retriever's Preference in Conversational Search
Chanwoong Yoon
Gangwoo Kim
Byeongguk Jeon
Sungdong Kim
Yohan Jo
Jaewoo Kang
RALM
KELM
39
12
0
19 Feb 2024
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema
Junru Lu
Siyu An
Min Zhang
Yulan He
Di Yin
Xing Sun
56
2
0
19 Feb 2024
Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations
M. Sultan
Jatin Ganhotra
Ramón Fernández Astudillo
LRM
32
3
0
19 Feb 2024
Learning to Learn Faster from Human Feedback with Language Model Predictive Control
Jacky Liang
Fei Xia
Wenhao Yu
Andy Zeng
Montse Gonzalez Arenas
...
N. Heess
Kanishka Rao
Nik Stewart
Jie Tan
Carolina Parada
LM&Ro
61
34
0
18 Feb 2024
Aligning Large Language Models by On-Policy Self-Judgment
Sangkyu Lee
Sungdong Kim
Ashkan Yousefpour
Minjoon Seo
Kang Min Yoo
Youngjae Yu
OSLM
33
11
0
17 Feb 2024
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models
Yougang Lyu
Lingyong Yan
Shuaiqiang Wang
Haibo Shi
Dawei Yin
Pengjie Ren
Zhumin Chen
Maarten de Rijke
Zhaochun Ren
24
5
0
17 Feb 2024
Whose Emotions and Moral Sentiments Do Language Models Reflect?
Zihao He
Siyi Guo
Ashwin Rao
Kristina Lerman
47
12
0
16 Feb 2024
Multi-modal preference alignment remedies regression of visual instruction tuning on language model
Shengzhi Li
Rongyu Lin
Shichao Pei
40
21
0
16 Feb 2024
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye
Sixian Li
Guanyu Li
Caishuang Huang
Songyang Gao
Yilong Wu
Qi Zhang
Tao Gui
Xuanjing Huang
LLMAG
38
17
0
16 Feb 2024
Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning
Jun Zhuang
C. Kennington
28
9
0
16 Feb 2024
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
Ajay Patel
Colin Raffel
Chris Callison-Burch
SyDa
AI4CE
33
25
0
16 Feb 2024
A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents
Lingbo Mo
Zeyi Liao
Boyuan Zheng
Yu-Chuan Su
Chaowei Xiao
Huan Sun
AAML
LLMAG
49
15
0
15 Feb 2024
Reward Generalization in RLHF: A Topological Perspective
Tianyi Qiu
Fanzhi Zeng
Jiaming Ji
Dong Yan
Kaile Wang
Jiayi Zhou
Yang Han
Josef Dai
Xuehai Pan
Yaodong Yang
AI4CE
32
3
0
15 Feb 2024
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
Ming Li
Lichang Chen
Jiuhai Chen
Shwai He
Jiuxiang Gu
Dinesh Manocha
29
52
0
15 Feb 2024
Aligning Crowd Feedback via Distributional Preference Reward Modeling
Dexun Li
Cong Zhang
Kuicai Dong
Derrick-Goh-Xin Deik
Ruiming Tang
Yong Liu
21
15
0
15 Feb 2024
Instruction Tuning for Secure Code Generation
Jingxuan He
Mark Vero
Gabriela Krasnopolska
Martin Vechev
21
16
0
14 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
91
21
0
14 Feb 2024
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
Souradip Chakraborty
Jiahao Qiu
Hui Yuan
Alec Koppel
Furong Huang
Dinesh Manocha
Amrit Singh Bedi
Mengdi Wang
ALM
49
48
0
14 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
79
84
0
13 Feb 2024
GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
Alex Havrilla
Sharath Raparthy
Christoforus Nalmpantis
Jane Dwivedi-Yu
Maksym Zhuravinskyi
Eric Hambro
Roberta Railneau
ReLM
LRM
41
51
0
13 Feb 2024
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
Xing-ming Guo
Fangxu Yu
Huan Zhang
Lianhui Qin
Bin Hu
AAML
117
73
0
13 Feb 2024
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
Fei Deng
Qifei Wang
Wei Wei
Matthias Grundmann
Tingbo Hou
EGVM
25
15
0
13 Feb 2024
Previous
1
2
3
...
12
13
14
...
21
22
23
Next