Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.08073
Cited By
Constitutional AI: Harmlessness from AI Feedback
15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Constitutional AI: Harmlessness from AI Feedback"
50 / 1,106 papers shown
Title
A Systematic Analysis of Base Model Choice for Reward Modeling
Kian Ahrabian
Pegah Jandaghi
Negar Mokhberian
Sai Praneeth Karimireddy
Jay Pujara
22
0
0
16 May 2025
WorldPM: Scaling Human Preference Modeling
Binghui Wang
Runji Lin
K. Lu
L. Yu
Z. Zhang
...
Xuanjing Huang
Yu-Gang Jiang
Bowen Yu
J. Zhou
Junyang Lin
24
0
0
15 May 2025
Demystifying AI Agents: The Final Generation of Intelligence
Kevin J McNamara
Rhea Pritham Marpu
29
0
0
15 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng-Shen Lin
Binxing Fang
PILM
44
0
0
15 May 2025
Atomic Consistency Preference Optimization for Long-Form Question Answering
Jingfeng Chen
Raghuveer Thirukovalluru
Junlin Wang
Kaiwei Luo
Bhuwan Dhingra
KELM
HILM
20
0
0
14 May 2025
Optimized Couplings for Watermarking Large Language Models
Dor Tsur
Carol Xuan Long
C. M. Verdun
Hsiang Hsu
H. Permuter
Flavio du Pin Calmon
WaLM
35
0
0
13 May 2025
Evaluating LLM Metrics Through Real-World Capabilities
Justin K Miller
Wenjia Tang
ELM
ALM
42
0
0
13 May 2025
Towards Artificial General or Personalized Intelligence? A Survey on Foundation Models for Personalized Federated Intelligence
Yu Qiao
Huy Q. Le
Avi Deb Raha
Phuong-Nam Tran
Apurba Adhikary
Mengchun Zhang
Loc X. Nguyen
Eui-nam Huh
Dusit Niyato
Choong Seon Hong
AI4CE
31
0
0
11 May 2025
Multi-Agent Systems for Robotic Autonomy with LLMs
Junhong Chen
Ziqi Yang
Haoyuan G Xu
Dandan Zhang
George Mylonas
LLMAG
49
0
0
09 May 2025
G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness
Jaehyun Jeon
Janghan Yoon
Minsoo Kim
Sumin Shim
Yejin Choi
Hanbin Kim
Youngjae Yu
AAML
47
0
0
08 May 2025
RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection
Yixin Yang
Qingxiu Dong
Linli Yao
Fangwei Zhu
Zhifang Sui
48
0
0
08 May 2025
Optimization Problem Solving Can Transition to Evolutionary Agentic Workflows
Wenhao Li
Bo Jin
Mingyi Hong
Changhong Lu
Xiangfeng Wang
48
0
0
07 May 2025
RM-R1: Reward Modeling as Reasoning
Xiusi Chen
Gaotang Li
Zehua Wang
Bowen Jin
Cheng Qian
...
Y. Zhang
D. Zhang
Tong Zhang
Hanghang Tong
Heng Ji
ReLM
OffRL
LRM
165
1
0
05 May 2025
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
Miaomiao Ji
Yanqiu Wu
Zhibin Wu
Shoujin Wang
Jian Yang
Mark Dras
Usman Naseem
39
0
0
05 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
72
1
0
05 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
55
0
0
05 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
Xiaojun Jia
Yingfei Sun
Qianqian Xu
Q. Huang
AAML
159
0
0
03 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Jen-tse Huang
Joey Tianyi Zhou
AAML
MU
83
3
0
01 May 2025
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
63
0
0
30 Apr 2025
Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning
Pengxiang Li
Zhi Gao
Bofei Zhang
Yapeng Mi
Xiaojian Ma
...
Tao Yuan
Yuwei Wu
Yunde Jia
Song-Chun Zhu
Qing Li
LLMAG
70
0
0
30 Apr 2025
PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations
Haowen Sun
Haoran Wang
Chengzhong Ma
Shaolong Zhang
Jiawei Ye
Xingyu Chen
Xuguang Lan
OffRL
53
1
0
29 Apr 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
81
0
0
27 Apr 2025
Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society
Feifei Zhao
Y. Wang
Enmeng Lu
Dongcheng Zhao
Bing Han
...
Chao Liu
Yaodong Yang
Yi Zeng
Boyuan Chen
Jinyu Fan
83
0
0
24 Apr 2025
Cognitive Silicon: An Architectural Blueprint for Post-Industrial Computing Systems
Christoforus Yoga Haryanto
Emily Lomempow
27
0
0
23 Apr 2025
Safety Pretraining: Toward the Next Generation of Safe AI
Pratyush Maini
Sachin Goyal
Dylan Sam
Alex Robey
Yash Savani
Yiding Jiang
Andy Zou
Zacharcy C. Lipton
J. Zico Kolter
63
0
0
23 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
76
0
0
23 Apr 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Minghao Wu
Weixuan Wang
Sinuo Liu
Huifeng Yin
Xintong Wang
Yu Zhao
Chenyang Lyu
Longyue Wang
Weihua Luo
Kaifu Zhang
ELM
79
0
0
22 Apr 2025
Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability
Daniel Hendriks
Philipp Spitzer
Niklas Kühl
G. Satzger
27
1
0
22 Apr 2025
SUDO: Enhancing Text-to-Image Diffusion Models with Self-Supervised Direct Preference Optimization
Liang Peng
Boxi Wu
Haoran Cheng
Yibo Zhao
Xiaofei He
36
0
0
20 Apr 2025
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
Avinandan Bose
Zhihan Xiong
Yuejie Chi
Simon S. Du
Lin Xiao
Maryam Fazel
28
0
0
20 Apr 2025
Harnessing Generative LLMs for Enhanced Financial Event Entity Extraction Performance
Soo-joon Choi
Ji-jun Park
41
0
0
20 Apr 2025
Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
Shaomu Tan
Christof Monz
37
0
0
18 Apr 2025
Image-Editing Specialists: An RLAIF Approach for Diffusion Models
Elior Benarous
Yilun Du
Heng Yang
22
0
0
17 Apr 2025
Aligning Constraint Generation with Design Intent in Parametric CAD
Evan Casey
Tianyu Zhang
Shu Ishida
John Roger Thompson
Amir Hosein Khasahmadi
Joseph George Lambourne
P. Jayaraman
K. Willis
35
0
0
17 Apr 2025
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula
Benjamin LeBrun
Li Du
Ben Lipkin
Clemente Pasti
...
Ryan Cotterel
Vikash K. Mansinghka
Alexander K. Lew
Tim Vieira
Timothy J. O'Donnell
34
1
0
17 Apr 2025
Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex
Azadeh Beiranvand
Seyed Mehdi Vahidipour
34
0
0
16 Apr 2025
REWARD CONSISTENCY: Improving Multi-Objective Alignment from a Data-Centric Perspective
Zhihao Xu
Yongqi Tong
Xin Zhang
Jun Zhou
Xiting Wang
35
0
0
15 Apr 2025
Teaching Large Language Models to Reason through Learning and Forgetting
Tianwei Ni
Allen Nie
Sapana Chaudhary
Yao Liu
Huzefa Rangwala
Rasool Fakoor
ReLM
CLL
LRM
142
0
0
15 Apr 2025
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
LLMSV
LRM
36
0
0
13 Apr 2025
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Zongxian Yang
Jiayu Qian
Z. Huang
Kay Chen Tan
LM&MA
LRM
31
0
0
13 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
43
0
0
13 Apr 2025
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
Jian Wu
Hao Yang
Xinhua Zeng
Guibing He
Zhengzhang Chen
Z. Li
Xinming Zhang
Yangyang Ma
Run Fang
Yang Liu
LRM
133
0
0
12 Apr 2025
A Short Survey on Small Reasoning Models: Training, Inference, Applications and Research Directions
Chengyu Wang
Taolin Zhang
Richang Hong
Jun Huang
ReLM
LRM
42
1
0
12 Apr 2025
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
Tuhin Chakrabarty
Philippe Laban
C. Wu
32
1
0
10 Apr 2025
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization
Jing Yao
Xiaoyuan Yi
Jindong Wang
Zhicheng Dou
Xing Xie
28
0
0
09 Apr 2025
Bypassing Safety Guardrails in LLMs Using Humor
Pedro Cisneros-Velarde
31
0
0
09 Apr 2025
HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification
Bibek Paudel
Alexander Lyzhov
Preetam Joshi
Puneet Anand
HILM
49
0
0
09 Apr 2025
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavaram
LRM
31
0
0
08 Apr 2025
Lightweight and Direct Document Relevance Optimization for Generative Information Retrieval
Kidist Amde Mekonnen
Yubao Tang
Maarten de Rijke
60
0
0
07 Apr 2025
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
Jiawei Lian
Jianhong Pan
L. Wang
Yi Wang
Shaohui Mei
Lap-Pui Chau
AAML
29
0
0
07 Apr 2025
1
2
3
4
...
21
22
23
Next