Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,116 papers shown
Title
Efficient Model-agnostic Alignment via Bayesian Persuasion
Fengshuo Bai
Mingzhi Wang
Zhaowei Zhang
Boyuan Chen
Yinda Xu
Ying Wen
Yaodong Yang
58
3
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
47
14
0
28 May 2024
Aligning to Thousands of Preferences via System Message Generalization
Seongyun Lee
Sue Hyun Park
Seungone Kim
Minjoon Seo
ALM
44
38
0
28 May 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Qizhang Li
Yiwen Guo
Wangmeng Zuo
Hao Chen
AAML
SILM
28
5
0
28 May 2024
Learning diverse attacks on large language models for robust red-teaming and safety tuning
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
...
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
AAML
63
12
0
28 May 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Sheng-Hsuan Peng
Pin-Yu Chen
Matthew Hull
Duen Horng Chau
50
23
0
27 May 2024
Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models
Chia-Yi Hsu
Yu-Lin Tsai
Chih-Hsun Lin
Pin-Yu Chen
Chia-Mu Yu
Chun-ying Huang
49
34
0
27 May 2024
CHESS: Contextual Harnessing for Efficient SQL Synthesis
Shayan Talaei
Mohammadreza Pourreza
Yu-Chen Chang
Azalia Mirhoseini
Amin Saberi
46
52
0
27 May 2024
Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity
Shanghaoran Quan
43
4
0
26 May 2024
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks
Chak Tou Leong
Yi Cheng
Kaishuai Xu
Jian Wang
Hanlin Wang
Wenjie Li
AAML
51
17
0
25 May 2024
Bayesian WeakS-to-Strong from Text Classification to Generation
Ziyun Cui
Ziyang Zhang
Wen Wu
Wen Wu
Chao Zhang
39
2
0
24 May 2024
Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
Andi Peng
Yuying Sun
Tianmin Shu
David Abel
46
3
0
23 May 2024
Multi-turn Reinforcement Learning from Preference Human Feedback
Lior Shani
Aviv Rosenberg
Asaf B. Cassel
Oran Lang
Daniele Calandriello
...
Bilal Piot
Idan Szpektor
Avinatan Hassidim
Yossi Matias
Rémi Munos
49
26
0
23 May 2024
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Jingnan Zheng
Han Wang
An Zhang
Tai D. Nguyen
Jun Sun
Tat-Seng Chua
LLMAG
40
14
0
23 May 2024
Online Self-Preferring Language Models
Yuanzhao Zhai
Zhuo Zhang
Kele Xu
Hanyang Peng
Yue Yu
Dawei Feng
Cheng Yang
Bo Ding
Huaimin Wang
56
0
0
23 May 2024
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
Tianrong Zhang
Bochuan Cao
Yuanpu Cao
Lu Lin
Prasenjit Mitra
Jinghui Chen
AAML
45
9
0
22 May 2024
LIRE: listwise reward enhancement for preference alignment
Mingye Zhu
Yi Liu
Lei Zhang
Junbo Guo
Zhendong Mao
26
7
0
22 May 2024
Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
Rheeya Uppaal
Apratim De
Yiting He
Yiquao Zhong
Junjie Hu
37
7
0
22 May 2024
Curriculum Direct Preference Optimization for Diffusion and Consistency Models
Florinel-Alin Croitoru
Vlad Hondru
Radu Tudor Ionescu
N. Sebe
Mubarak Shah
EGVM
89
6
0
22 May 2024
Babysit A Language Model From Scratch: Interactive Language Learning by Trials and Demonstrations
Ziqiao Ma
Zekun Wang
Joyce Chai
60
2
0
22 May 2024
Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in LLMs
Bilgehan Sel
Priya Shanmugasundaram
Mohammad Kachuee
Kun Zhou
Ruoxi Jia
Ming Jin
LRM
40
2
0
21 May 2024
Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents
San Kim
Gary Geunbae Lee
AAML
43
3
0
21 May 2024
SPO: Multi-Dimensional Preference Sequential Alignment With Implicit Reward Modeling
Xingzhou Lou
Junge Zhang
Jian Xie
Lifeng Liu
Dong Yan
Kaiqi Huang
45
11
0
21 May 2024
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
Jiaxu Liu
Xiangyu Yin
Sihao Wu
Jianhong Wang
Meng Fang
Xinping Yi
Xiaowei Huang
34
4
0
21 May 2024
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
Jiachen Sun
Changsheng Wang
Jiong Wang
Yiwei Zhang
Chaowei Xiao
AAML
VLM
39
3
0
17 May 2024
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety
Andrea V. Bajcsy
J. F. Fisac
40
7
0
16 May 2024
Understanding the performance gap between online and offline alignment algorithms
Yunhao Tang
Daniel Guo
Zeyu Zheng
Daniele Calandriello
Yuan Cao
...
Rémi Munos
Bernardo Avila-Pires
Michal Valko
Yong Cheng
Will Dabney
OffRL
OnRL
27
61
0
14 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
34
3
0
14 May 2024
Divergent Creativity in Humans and Large Language Models
Antoine Bellemare-Pepin
Franccois Lespinasse
Philipp Tholke
Y. Harel
K. Mathewson
Jay A. Olson
Yoshua Bengio
Department of Computer Science
AI4CE
56
9
0
13 May 2024
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang
Qizhen Zhang
Jakob N. Foerster
AAML
43
18
0
13 May 2024
RLHF Workflow: From Reward Modeling to Online RLHF
Hanze Dong
Wei Xiong
Bo Pang
Haoxiang Wang
Han Zhao
Yingbo Zhou
Nan Jiang
Doyen Sahoo
Caiming Xiong
Tong Zhang
OffRL
29
98
0
13 May 2024
METAREFLECTION: Learning Instructions for Language Agents using Past Reflections
Priyanshu Gupta
Shashank Kirtania
Ananya Singha
Sumit Gulwani
Arjun Radhakrishna
Sherry Shi
Gustavo Soares
LLMAG
40
4
0
13 May 2024
DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Bamberger
Ofek Glick
Chaim Baskin
Yonatan Belinkov
67
0
0
13 May 2024
MathDivide: Improved mathematical reasoning by large language models
S. Srivastava
Ashutosh Gandhi
LRM
ReLM
38
0
0
12 May 2024
Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation
JoonHo Lee
Jae Oh Woo
Juree Seok
Parisa Hassanzadeh
Wooseok Jang
...
Hankyu Moon
Wenjun Hu
Yeong-Dae Kwon
Taehee Lee
Seungjai Min
47
2
0
10 May 2024
Truthful Aggregation of LLMs with an Application to Online Advertising
Ermis Soumalias
Michael J. Curry
Sven Seuken
41
11
0
09 May 2024
BiasKG: Adversarial Knowledge Graphs to Induce Bias in Large Language Models
Chunyan Luo
Ahmad Ghawanmeh
Xiaodan Zhu
Faiza Khan Khattak
KELM
41
0
0
08 May 2024
Large Language Models for Cyber Security: A Systematic Literature Review
HanXiang Xu
Shenao Wang
Ningke Li
Kaidi Wang
Yanjie Zhao
Kai Chen
Ting Yu
Yang Liu
Haoyu Wang
37
23
0
08 May 2024
The Elephant in the Room -- Why AI Safety Demands Diverse Teams
David Rostcheck
Lara Scheibling
33
0
0
07 May 2024
MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization
Massimiliano Pappa
Luca Collorone
Giovanni Ficarra
Indro Spinelli
Fabio Galasso
54
1
0
06 May 2024
PICLe: Eliciting Diverse Behaviors from Large Language Models with Persona In-Context Learning
Hyeong Kyu Choi
Yixuan Li
69
17
0
03 May 2024
AI Governance and Accountability: An Analysis of Anthropic's Claude
Aman Priyanshu
Yash Maurya
Zuofei Hong
44
3
0
02 May 2024
FLAME: Factuality-Aware Alignment for Large Language Models
Sheng-Chieh Lin
Luyu Gao
Barlas Oğuz
Wenhan Xiong
Jimmy Lin
Wen-tau Yih
Xilun Chen
HILM
41
16
0
02 May 2024
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment
Gerald Shen
Zhilin Wang
Olivier Delalleau
Jiaqi Zeng
Yi Dong
...
Sahil Jain
Ali Taghibakhshi
Markel Sanz Ausin
Ashwath Aithal
Oleksii Kuchaiev
43
13
0
02 May 2024
DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection
Yanjing Yang
Xin Zhou
Runfeng Mao
Jinwei Xu
Lanxin Yang
Yu Zhangm
Haifeng Shen
He Zhang
24
13
0
02 May 2024
The Real, the Better: Aligning Large Language Models with Online Human Behaviors
Guanying Jiang
Lingyong Yan
Haibo Shi
Dawei Yin
33
2
0
01 May 2024
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
Yuxi Xie
Anirudh Goyal
Wenyue Zheng
Min-Yen Kan
Timothy Lillicrap
Kenji Kawaguchi
Michael Shieh
ReLM
LRM
52
87
0
01 May 2024
General Purpose Verification for Chain of Thought Prompting
Robert Vacareanu
Anurag Pratik
Evangelia Spiliopoulou
Zheng Qi
Giovanni Paolini
Neha Ann John
Jie Ma
Yassine Benajiba
Miguel Ballesteros
LRM
32
8
0
30 Apr 2024
RepEval: Effective Text Evaluation with LLM Representation
Shuqian Sheng
Yi Xu
Tianhang Zhang
Zanwei Shen
Luoyi Fu
Jiaxin Ding
Lei Zhou
Xinbing Wang
Cheng Zhou
27
1
0
30 Apr 2024
From Persona to Personalization: A Survey on Role-Playing Language Agents
Jiangjie Chen
Xintao Wang
Rui Xu
Siyu Yuan
Yikai Zhang
...
Caiyu Hu
Siye Wu
Scott Ren
Ziquan Fu
Yanghua Xiao
62
79
0
28 Apr 2024
Previous
1
2
3
...
9
10
11
...
21
22
23
Next