ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDaMoMe
ArXiv (abs)PDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,202 papers shown
Title
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Yunqiao Yang
Houxing Ren
Zimu Lu
Ke Wang
Weikang Shi
A-Long Zhou
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
52
0
0
29 May 2025
Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment
Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment
Krti Tallam
Emma Miller
40
0
0
28 May 2025
Square$χ$PO: Differentially Private and Robust $χ^2$-Preference Optimization in Offline Direct Alignment
SquareχχχPO: Differentially Private and Robust χ2χ^2χ2-Preference Optimization in Offline Direct Alignment
Xingyu Zhou
Yulian Wu
Wenqian Weng
Francesco Orabona
83
0
0
27 May 2025
Sparsified State-Space Models are Efficient Highway Networks
Sparsified State-Space Models are Efficient Highway Networks
Woomin Song
Jihoon Tack
Sangwoo Mo
Seunghyuk Oh
Jinwoo Shin
Mamba
39
0
0
27 May 2025
EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
Chengyu Wang
Junbing Yan
Wenrui Cai
Yuanhao Yue
Jun Huang
VLM
45
0
0
27 May 2025
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge
Fengqing Jiang
Fengbo Ma
Zhangchen Xu
Yuetai Li
Bhaskar Ramasubramanian
Luyao Niu
Bo Li
Xianyan Chen
Zhen Xiang
Radha Poovendran
ALMELM
70
1
0
27 May 2025
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Tharindu Kumarage
Ninareh Mehrabi
Anil Ramakrishna
Xinyan Zhao
R. Zemel
Kai-Wei Chang
Aram Galstyan
Rahul Gupta
Charith Peris
LRM
30
0
0
27 May 2025
SGM: A Framework for Building Specification-Guided Moderation Filters
SGM: A Framework for Building Specification-Guided Moderation Filters
M. Fatehkia
Enes Altinisik
Husrev Taha Sencar
51
1
0
26 May 2025
Agents Require Metacognitive and Strategic Reasoning to Succeed in the Coming Labor Markets
Agents Require Metacognitive and Strategic Reasoning to Succeed in the Coming Labor Markets
Simpson Zhang
Tennison Liu
M. Schaar
LRMLLMAG
53
0
0
26 May 2025
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Y. Zhang
Yu Yu
Bo Tang
Yu Zhu
Chuxiong Sun
...
Jie Hu
Zipeng Xie
Zhiyu Li
Feiyu Xiong
Edward Chung
99
0
0
26 May 2025
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback
Mengdi Li
Jiaye Lin
Xufeng Zhao
Wenhao Lu
P. Zhao
S. Wermter
Di Wang
45
0
0
26 May 2025
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models
Yi Liu
Dianqing Liu
Mingye Zhu
Junbo Guo
Yongdong Zhang
Zhendong Mao
102
0
0
26 May 2025
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Geon-hyeong Kim
Youngsoo Jang
Yu Jin Kim
Byoungjip Kim
Honglak Lee
Kyunghoon Bae
Moontae Lee
24
2
0
26 May 2025
Learning a Pessimistic Reward Model in RLHF
Learning a Pessimistic Reward Model in RLHF
Yinglun Xu
Hangoo Kang
Tarun Suresh
Yuxuan Wan
Gagandeep Singh
OffRL
61
0
0
26 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
27
0
0
26 May 2025
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models
Baihui Zheng
Boren Zheng
Kerui Cao
Y. Tan
Zhendong Liu
...
Jian Yang
Wenbo Su
Xiaoyong Zhu
Bo Zheng
Kaifu Zhang
ELM
88
0
0
26 May 2025
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
RECAST: Strengthening LLMs' Complex Instruction Following with Constraint-Verifiable Data
Wenhao Liu
Zhengkang Guo
Mingchen Xie
Jingwen Xu
Zisu Huang
...
Changze Lv
He-Da Wang
Hu Yao
Xiaoqing Zheng
Xuanjing Huang
181
0
0
25 May 2025
The Price of Format: Diversity Collapse in LLMs
The Price of Format: Diversity Collapse in LLMs
Longfei Yun
Chenyang An
Zilong Wang
Letian Peng
Jingbo Shang
47
0
0
25 May 2025
Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP)
Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP)
Libo Wang
46
0
0
25 May 2025
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
Jiayi Zhou
Jiaming Ji
Boyuan Chen
Jiapeng Sun
Wenqi Chen
Donghai Hong
Sirui Han
Yike Guo
Yaodong Yang
21
0
0
24 May 2025
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Hongzheng Yang
Yongqiang Chen
Zeyu Qin
Tongliang Liu
Chaowei Xiao
Kun Zhang
Bo Han
LLMSV
44
0
0
24 May 2025
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMe
90
1
0
23 May 2025
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning
Huayu Chen
Kaiwen Zheng
Qinsheng Zhang
Ganqu Cui
Yin Cui
Haotian Ye
Tsung-Yi Lin
Ming-Yu Liu
Jun Zhu
Haoxiang Wang
OffRLLRM
251
3
0
23 May 2025
An Example Safety Case for Safeguards Against Misuse
An Example Safety Case for Safeguards Against Misuse
Joshua Clymer
Jonah Weinbaum
Robert Kirk
Kimberly Mai
Selena Zhang
Xander Davies
61
0
0
23 May 2025
Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
Multi-Scale Probabilistic Generation Theory: A Hierarchical Framework for Interpreting Large Language Models
Yukin Zhang
Qi Dong
104
0
0
23 May 2025
Value-Guided Search for Efficient Chain-of-Thought Reasoning
Value-Guided Search for Efficient Chain-of-Thought Reasoning
Kaiwen Wang
Jin Peng Zhou
Jonathan D. Chang
Zhaolin Gao
Nathan Kallus
Kianté Brantley
Wen Sun
LRM
90
1
0
23 May 2025
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Bin Xu
Yu Bai
Huashan Sun
Yiguan Lin
Siming Liu
Xinyue Liang
Yaolin Li
Yang Gao
Heyan Huang
AI4EdELM
212
0
0
22 May 2025
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment
Weixiang Zhao
Xingyu Sui
Yulin Hu
Jiahe Guo
Haixiao Liu
Biye Li
Yanyan Zhao
Bing Qin
Ting Liu
OffRL
110
1
0
21 May 2025
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Eric Hanchen Jiang
Haozheng Luo
Shengyuan Pang
Xiaomin Li
Zhenting Qi
...
Zongyu Lin
Xinfeng Li
Hao Xu
Kai-Wei Chang
Ying Nian Wu
LRM
120
0
0
21 May 2025
Aligning Explanations with Human Communication
Aligning Explanations with Human Communication
Jacopo Teneggi
Zhenzhen Wang
Paul H. Yi
Tianmin Shu
Jeremias Sulam
175
0
0
21 May 2025
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal
Zimin Zhang
Lifan Yuan
Jiawei Han
Hao Peng
162
8
0
21 May 2025
Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models
Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models
Md Rafi Ur Rashid
Vishnu Asutosh Dasu
Ye Wang
Gang Tan
Shagufta Mehnaz
AAMLELM
109
0
0
20 May 2025
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D'Souza
Hamed Babaei Giglou
Quentin Münch
ELM
109
0
0
20 May 2025
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Jiaer Xia
Yuhang Zang
Peng Gao
Yixuan Li
Kaiyang Zhou
OffRLReLMAI4TSVLMLRM
107
0
0
20 May 2025
Krikri: Advancing Open Large Language Models for Greek
Krikri: Advancing Open Large Language Models for Greek
Dimitris Roussis
Leon Voukoutis
Georgios Paraskevopoulos
Sokratis Sofianopoulos
Prokopis Prokopidis
Vassilis Papavasileiou
Athanasios Katsamanis
Stelios Piperidis
Vassilis Katsouros
ALM
89
1
0
19 May 2025
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Jianwei Li
Jung-Eng Kim
AAML
187
1
0
19 May 2025
Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
Walking the Tightrope: Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning
Xiaoyu Yang
Jie Lu
En Yu
59
1
0
19 May 2025
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space
Hengli Li
Chenxi Li
Tong Wu
Xuekai Zhu
Yuxuan Wang
...
Eric Hanchen Jiang
Song-Chun Zhu
Zixia Jia
Ying Nian Wu
Zilong Zheng
LRM
119
1
0
19 May 2025
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Narek Maloyan
Bislan Ashinov
Dmitry Namiot
AAMLELM
85
0
0
19 May 2025
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models
Linghan Huang
Haolin Jin
Zhaoge Bi
Pengyue Yang
Peizhou Zhao
Taozhao Chen
Xiongfei Wu
Lei Ma
Huaming Chen
AAML
64
0
0
18 May 2025
SafeVid: Toward Safety Aligned Video Large Multimodal Models
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang
Jiaxin Song
Yifeng Gao
Xin Wang
Yang Yao
Yan Teng
Xingjun Ma
Yingchun Wang
Yu-Gang Jiang
134
0
0
17 May 2025
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets
Ning Lu
Shengcai Liu
Jiahao Wu
Weiyu Chen
Zhirui Zhang
Yew-Soon Ong
Qi Wang
Ke Tang
106
3
0
17 May 2025
Pairwise Calibrated Rewards for Pluralistic Alignment
Pairwise Calibrated Rewards for Pluralistic Alignment
Daniel Halpern
Evi Micha
Ariel D. Procaccia
Itai Shapira
23
0
0
17 May 2025
A Systematic Analysis of Base Model Choice for Reward Modeling
A Systematic Analysis of Base Model Choice for Reward Modeling
Kian Ahrabian
Pegah Jandaghi
Negar Mokhberian
Sai Praneeth Karimireddy
Jay Pujara
134
0
0
16 May 2025
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO
Peter Chen
Xiaopeng Li
Zhiyu Li
Xi Chen
Tianyi Lin
85
0
0
16 May 2025
WorldPM: Scaling Human Preference Modeling
WorldPM: Scaling Human Preference Modeling
Binghai Wang
Runji Lin
Keming Lu
Le Yu
Zizhuo Zhang
...
Xuanjing Huang
Yu-Gang Jiang
Bowen Yu
Jingren Zhou
Junyang Lin
106
1
0
15 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng Lin
Binxing Fang
PILM
122
0
0
15 May 2025
Demystifying AI Agents: The Final Generation of Intelligence
Demystifying AI Agents: The Final Generation of Intelligence
Kevin J McNamara
Rhea Pritham Marpu
75
0
0
15 May 2025
Atomic Consistency Preference Optimization for Long-Form Question Answering
Atomic Consistency Preference Optimization for Long-Form Question Answering
Jingfeng Chen
Raghuveer Thirukovalluru
Junlin Wang
Kaiwei Luo
Bhuwan Dhingra
KELMHILM
69
0
0
14 May 2025
Optimized Couplings for Watermarking Large Language Models
Optimized Couplings for Watermarking Large Language Models
Dor Tsur
Carol Xuan Long
C. M. Verdun
Hsiang Hsu
Haim Permuter
Flavio du Pin Calmon
WaLM
90
1
0
13 May 2025
Previous
12345...232425
Next