Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,202 papers shown
Title
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection
Yachao Zhao
Bo Wang
Yan Wang
Dongming Zhao
Ruifang He
Yuexian Hou
142
4
0
04 Jan 2025
From Generalist to Specialist: A Survey of Large Language Models for Chemistry
Yang Han
Ziping Wan
Lu Chen
Kai Yu
Xin Chen
LM&MA
97
3
0
31 Dec 2024
MLLM-as-a-Judge for Image Safety without Human Labeling
Zhenting Wang
Shuming Hu
Shiyu Zhao
Xiaowen Lin
F. Xu
...
Nan Jiang
Lingjuan Lyu
Shiqing Ma
Dimitris N. Metaxas
Ankit Jain
403
5
0
31 Dec 2024
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
Weilong Dong
Xinwei Wu
Renren Jin
Shaoyang Xu
Deyi Xiong
141
9
0
31 Dec 2024
Geometric-Averaged Preference Optimization for Soft Preference Labels
Hiroki Furuta
Kuang-Huei Lee
Shixiang Shane Gu
Y. Matsuo
Aleksandra Faust
Heiga Zen
Izzeddin Gur
144
13
0
31 Dec 2024
A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao
Feizhong Zhou
Xianglong Liu
Tianqi Liu
Zhipeng Li
Xin Liu
Xiaoxuan Huang
AILaw
LM&MA
LRM
145
30
0
31 Dec 2024
Malware Classification using a Hybrid Hidden Markov Model-Convolutional Neural Network
Ritik Mehta
Olha Jurecková
Mark Stamp
128
0
0
25 Dec 2024
Multimodal Preference Data Synthetic Alignment with Reward Model
Robert Wijaya
Ngoc-Bao Nguyen
Ngai-Man Cheung
MLLM
SyDa
133
4
0
23 Dec 2024
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum
Christoph Schnabl
Gabor Hollbeck
Silas Alberti
Philip Blinde
Marvin von Hagen
142
2
0
22 Dec 2024
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu
Michael Vaiana
Judd Rosenblatt
Cameron Berg
Diogo Schwerz de Lucena
105
0
0
20 Dec 2024
Gradual Vigilance and Interval Communication: Enhancing Value Alignment in Multi-Agent Debates
Rui Zou
Mengqi Wei
Jintian Feng
Qian Wan
Jianwen Sun
Sannyuya Liu
89
0
0
18 Dec 2024
A Method for Detecting Legal Article Competition for Korean Criminal Law Using a Case-augmented Mention Graph
Seonho An
Young Yik Rhim
Min-Soo Kim
AILaw
107
0
0
16 Dec 2024
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
Boyang Xue
Fei Mi
Qi Zhu
Hongru Wang
Rui Wang
Sheng Wang
Erxin Yu
Xuming Hu
Kam-Fai Wong
HILM
218
2
0
16 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
176
1
0
15 Dec 2024
TapeAgents: a Holistic Framework for Agent Development and Optimization
Dzmitry Bahdanau
Nicolas Angelard-Gontier
Gabriel Huang
Ehsan Kamalloo
Rafael Pardinas
...
Jordan Prince Tremblay
Karam Ghanem
S. Parikh
Mitul Tiwari
Quaizar Vohra
157
4
0
11 Dec 2024
Coverage-based Fairness in Multi-document Summarization
Haoyuan Li
Yusen Zhang
Rui Zhang
Snigdha Chaturvedi
181
0
0
11 Dec 2024
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning
Mathurin Videau
Alessandro Leite
Marc Schoenauer
O. Teytaud
ReLM
LRM
114
0
0
05 Dec 2024
Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Vishakh Padmakumar
Chuanyang Jin
Hannah Rose Kirk
He He
91
6
0
05 Dec 2024
Reinforcement Learning Enhanced LLMs: A Survey
Shuhe Wang
Shengyu Zhang
Jing Zhang
Runyi Hu
Xiaoya Li
Tianwei Zhang
Jiwei Li
Leilei Gan
G. Wang
Eduard H. Hovy
OffRL
245
16
0
05 Dec 2024
Query Performance Explanation through Large Language Model for HTAP Systems
Haibo Xiu
Li Zhang
Tieying Zhang
Jun Yang
Jianjun Chen
83
0
0
02 Dec 2024
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
Shenghui Li
Edith C.H. Ngai
Fanghua Ye
Thiemo Voigt
SILM
197
6
0
28 Nov 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
364
112
0
25 Nov 2024
From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set
M. Finkelstein
Dan Deutsch
Parker Riley
Juraj Juraska
Geza Kovacs
Markus Freitag
112
0
0
23 Nov 2024
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
Bethel Melesse Tessema
Akhil Kedia
Tae-Sun Chung
94
0
0
21 Nov 2024
Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets
Ike Obi
Rohan Pant
Srishti Shekhar Agrawal
Maham Ghazanfar
Aaron Basiletti
81
2
0
18 Nov 2024
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Xinyan Guan
Yanjiang Liu
Xinyu Lu
Boxi Cao
Xianpei Han
...
Le Sun
Jie Lou
Bowen Yu
Yaojie Lu
Hongyu Lin
ALM
179
5
0
18 Nov 2024
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
Xikang Yang
Xuehai Tang
Jizhong Han
Songlin Hu
116
0
0
18 Nov 2024
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min
Long H. Pham
Yige Li
Jun Sun
AAML
152
5
0
18 Nov 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
162
18
0
18 Nov 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Hongrui Jia
Chaoya Jiang
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
MLLM
147
3
0
17 Nov 2024
SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment
Quan Ze Chen
K. J. Kevin Feng
Chan Young Park
Amy X. Zhang
63
0
0
16 Nov 2024
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
Somanshu Singla
Zhen Wang
Tianyang Liu
Abdullah Ashfaq
Zhiting Hu
Eric Xing
70
2
0
13 Nov 2024
PyGen: A Collaborative Human-AI Approach to Python Package Creation
Saikat Barua
Mostafizur Rahman
Md Jafor Sadek
Rafiul Islam
Shehnaz Khaled
Md. Shohrab Hossain
121
2
0
13 Nov 2024
Mitigating Metric Bias in Minimum Bayes Risk Decoding
Geza Kovacs
Daniel Deutsch
Markus Freitag
107
8
0
05 Nov 2024
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
Prashanth Gurunath Shivakumar
Aditya Gourav
Yile Gu
Ankur Gandhe
Hung-yi Lee
I. Bulyko
120
9
0
04 Nov 2024
Rule Based Rewards for Language Model Safety
Tong Mu
Alec Helyar
Johannes Heidecke
Joshua Achiam
Andrea Vallone
Ian Kivlichan
Molly Lin
Alex Beutel
John Schulman
Lilian Weng
ALM
123
49
0
02 Nov 2024
Token-level Proximal Policy Optimization for Query Generation
Yichen Ouyang
Lu Wang
Fangkai Yang
Pu Zhao
Chenghua Huang
...
Saravan Rajmohan
Weiwei Deng
Dongmei Zhang
Feng Sun
Qi Zhang
OffRL
430
5
0
01 Nov 2024
Comparison-based Active Preference Learning for Multi-dimensional Personalization
Minhyeon Oh
Seungjoon Lee
Jungseul Ok
68
1
0
01 Nov 2024
Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation
Bohan Lyu
Yadi Cao
Duncan Watson-Parris
Leon Bergen
Taylor Berg-Kirkpatrick
Rose Yu
131
5
0
01 Nov 2024
Representative Social Choice: From Learning Theory to AI Alignment
Tianyi Qiu
FedML
53
2
0
31 Oct 2024
Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models
Phil Wee
Riyadh Baghdadi
HILM
68
1
0
31 Oct 2024
OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models
Junda Wu
Xintong Li
Ruoyu Wang
Yu Xia
Yuxin Xiong
...
Xiang Chen
Branislav Kveton
Lina Yao
Jingbo Shang
Julian McAuley
OffRL
LRM
80
1
0
31 Oct 2024
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu
Bowen Shi
Avi Caciularu
Idan Szpektor
Arman Cohan
158
4
0
30 Oct 2024
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou
Shikun Zhang
Wei Ye
ELM
84
16
0
29 Oct 2024
Fast Best-of-N Decoding via Speculative Rejection
Hanshi Sun
Momin Haider
Ruiqi Zhang
Huitao Yang
Jiahao Qiu
Ming Yin
Mengdi Wang
Peter L. Bartlett
Andrea Zanette
BDL
117
52
0
26 Oct 2024
An Auditing Test To Detect Behavioral Shift in Language Models
Leo Richter
Xuanli He
Pasquale Minervini
Matt J. Kusner
95
0
0
25 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
Liwen Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
178
7
0
24 Oct 2024
CorrectionLM: Self-Corrections with SLM for Dialogue State Tracking
Chia-Hsuan Lee
Hao Cheng
Mari Ostendorf
LRM
52
0
0
23 Oct 2024
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
Nirav Bhan
Shival Gupta
Sai Manaswini
Ritik Baba
Narun Yadav
Hillori Desai
Yash Choudhary
Aman Pawar
Sarthak Shrivastava
Sudipta Biswas
LLMAG
44
0
0
23 Oct 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu
Yuhang Zang
Xiaoyi Dong
Pan Zhang
Yuhang Cao
Haodong Duan
Zeang Sheng
Yuanjun Xiong
Dahua Lin
Jiaqi Wang
105
12
0
23 Oct 2024
Previous
1
2
3
...
5
6
7
...
23
24
25
Next