ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDaMoMe
ArXiv (abs)PDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,202 papers shown
Title
Purple-teaming LLMs with Adversarial Defender Training
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
63
1
0
01 Jul 2024
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Shihan Deng
Weikai Xu
Hongda Sun
Wei Liu
Tao Tan
...
Ang Li
Jian Luan
Bin Wang
Rui Yan
Shuo Shang
LLMAG
96
21
0
01 Jul 2024
Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients
  via Eliciting and Adhering to Principles
Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
Ryan Louie
Ananjan Nandi
William Fang
Cheng Chang
Emma Brunskill
Diyi Yang
104
44
0
01 Jul 2024
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
  Reasoning
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
Zimu Lu
Aojun Zhou
Ke Wang
Houxing Ren
Weikang Shi
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
100
25
0
30 Jun 2024
LLM Critics Help Catch LLM Bugs
LLM Critics Help Catch LLM Bugs
Nat McAleese
Rai Michael Pokorny
Juan Felipe Cerón Uribe
Evgenia Nitishinskaya
Maja Trebacz
Jan Leike
ALMLRM
83
83
0
28 Jun 2024
ProgressGym: Alignment with a Millennium of Moral Progress
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Yalan Qin
Yaodong Yang
AI4TS
106
7
0
28 Jun 2024
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Sujan Dutta
Sayantan Mahinder
R. Anantha
Bortik Bandyopadhyay
ALM
72
7
0
28 Jun 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILMAAML
103
35
0
28 Jun 2024
Rethinking harmless refusals when fine-tuning foundation models
Rethinking harmless refusals when fine-tuning foundation models
Florin Pop
Judd Rosenblatt
Diogo Schwerz de Lucena
Michael Vaiana
30
0
0
27 Jun 2024
Suri: Multi-constraint Instruction Following for Long-form Text
  Generation
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Chau Minh Pham
Simeng Sun
Mohit Iyyer
ALMLRM
124
23
0
27 Jun 2024
Diminishing Stereotype Bias in Image Generation Model using
  Reinforcemenlent Learning Feedback
Diminishing Stereotype Bias in Image Generation Model using Reinforcemenlent Learning Feedback
Xin Chen
Virgile Foussereau
EGVM
81
0
0
27 Jun 2024
Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Austen Liao
Nicholas Tomlin
Dan Klein
105
1
0
27 Jun 2024
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology
  Report Simplification
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Ziyu Yang
Santhosh Cherian
Slobodan Vucetic
MedIm
96
0
0
27 Jun 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
Beyza Ermis
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
119
39
0
26 Jun 2024
AI Alignment through Reinforcement Learning from Human Feedback?
  Contradictions and Limitations
AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Adam Dahlgren Lindstrom
Leila Methnani
Lea Krause
Petter Ericson
Ínigo Martínez de Rituerto de Troya
Dimitri Coelho Mollo
Roel Dobbe
ALM
86
2
0
26 Jun 2024
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in
  Retrieval-Augmented Generative Models
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Song Wang
Jundong Li
Tianlong Chen
Huan Liu
SILM
101
25
0
26 Jun 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLMLRM
132
3
0
25 Jun 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
151
77
0
24 Jun 2024
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large
  Language Models via Opposite Prompt Optimization
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
71
8
0
24 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle
  Localized Unsafe Content
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo
Shaun Khoo
82
4
0
24 Jun 2024
On the Transformations across Reward Model, Parameter Update, and
  In-Context Prompt
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Deng Cai
Huayang Li
Tingchen Fu
Siheng Li
Weiwen Xu
...
Leyang Cui
Yan Wang
Lemao Liu
Taro Watanabe
Shuming Shi
KELM
78
2
0
24 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
143
15
0
24 Jun 2024
INDICT: Code Generation with Internal Dialogues of Critiques for Both
  Security and Helpfulness
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness
Hung Le
Yingbo Zhou
Caiming Xiong
Silvio Savarese
Doyen Sahoo
123
3
0
23 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
88
12
0
21 Jun 2024
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Anton Xue
Avishree Khare
Rajeev Alur
Surbhi Goel
Eric Wong
161
3
0
21 Jun 2024
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu
Yi R. Fung
Sha Li
Yixin Wan
Kai-Wei Chang
Heng Ji
95
7
0
20 Jun 2024
Global Human-guided Counterfactual Explanations for Molecular Properties
  via Reinforcement Learning
Global Human-guided Counterfactual Explanations for Molecular Properties via Reinforcement Learning
Danqing Wang
Antonis Antoniades
Kha-Dinh Luong
Edwin Zhang
Mert Kosan
Jiachen Li
Ambuj Singh
William Yang Wang
Lei Li
AI4CE
75
0
0
19 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using
  Large Language Models
Supporting Human Raters with the Detection of Harmful Content using Large Language Models
Kurt Thomas
Patrick Gage Kelley
David Tao
Sarah Meiklejohn
Owen Vallis
Shunwen Tan
Blaz Bratanic
Felipe Tiengo Ferreira
Vijay Eranti
Elie Bursztein
82
2
0
18 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large
  Language Models
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
95
7
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
98
8
0
17 Jun 2024
Improving Multi-Agent Debate with Sparse Communication Topology
Improving Multi-Agent Debate with Sparse Communication Topology
Yunxuan Li
Yibing Du
Jiageng Zhang
Le Hou
Peter Grabowski
Yeqing Li
Eugene Ie
LLMAG
98
25
0
17 Jun 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Zhewen Shen
Aditya Joshi
Ruey-Cheng Chen
CLL
92
2
0
17 Jun 2024
A Complete Survey on LLM-based AI Chatbots
A Complete Survey on LLM-based AI Chatbots
Sumit Kumar Dam
Choong Seon Hong
Yu Qiao
Chaoning Zhang
104
62
0
17 Jun 2024
A Survey on Human Preference Learning for Large Language Models
A Survey on Human Preference Learning for Large Language Models
Ruili Jiang
Kehai Chen
Xuefeng Bai
Zhixuan He
Juntao Li
Muyun Yang
Tiejun Zhao
Liqiang Nie
Min Zhang
134
9
0
17 Jun 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
101
11
0
17 Jun 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
123
15
0
17 Jun 2024
garak: A Framework for Security Probing Large Language Models
garak: A Framework for Security Probing Large Language Models
Leon Derczynski
Erick Galinkin
Jeffrey Martin
Subho Majumdar
Nanna Inie
AAMLELM
95
20
0
16 Jun 2024
Toward Optimal LLM Alignments Using Two-Player Games
Toward Optimal LLM Alignments Using Two-Player Games
Rui Zheng
Hongyi Guo
Zhihan Liu
Xiaoying Zhang
Yuanshun Yao
...
Tao Gui
Qi Zhang
Xuanjing Huang
Hang Li
Yang Liu
116
6
0
16 Jun 2024
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for
  Cartoon Captioning
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
Jifan Zhang
Lalit P. Jain
Yang Guo
Jiayi Chen
Kuan Lok Zhou
...
Scott Sievert
Timothy T. Rogers
Kevin Jamieson
Robert Mankoff
Robert Nowak
105
6
0
15 Jun 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large
  Language Models
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson E. Denison
M. MacDiarmid
Fazl Barez
David Duvenaud
Shauna Kravec
...
Jared Kaplan
Buck Shlegeris
Samuel R. Bowman
Ethan Perez
Evan Hubinger
132
44
0
14 Jun 2024
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
  Survey
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Lin Long
Rui Wang
Ruixuan Xiao
Junbo Zhao
Xiao Ding
Gang Chen
Haobo Wang
SyDa
112
126
0
14 Jun 2024
From Text to Life: On the Reciprocal Relationship between Artificial
  Life and Large Language Models
From Text to Life: On the Reciprocal Relationship between Artificial Life and Large Language Models
Eleni Nisioti
Claire Glanois
Elias Najarro
Andrew Dai
Elliot Meyerson
J. Pedersen
Laetitia Teodorescu
Conor F. Hayes
Shyam Sudhakaran
Sebastian Risi
AI4CELM&Ro
103
4
0
14 Jun 2024
Understanding Jailbreak Success: A Study of Latent Space Dynamics in
  Large Language Models
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball
Frauke Kreuter
Nina Rimsky
87
18
0
13 Jun 2024
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning
  in LLMs
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
Xuan Zhang
Chao Du
Tianyu Pang
Qian Liu
Wei Gao
Min Lin
LRMAI4CE
101
64
0
13 Jun 2024
Bayesian Statistical Modeling with Predictors from LLMs
Bayesian Statistical Modeling with Predictors from LLMs
Michael Franke
Polina Tsvilodub
Fausto Carcassi
87
6
0
13 Jun 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous
  Preferences
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen
Yi Chen
Aniket Rege
Ramya Korlakai Vinayak
114
23
0
12 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin
  for Preference Datasets
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng Zhang
Wenqiang Lei
79
3
0
12 Jun 2024
Collective Constitutional AI: Aligning a Language Model with Public
  Input
Collective Constitutional AI: Aligning a Language Model with Public Input
Saffron Huang
Divya Siddarth
Liane Lovitt
Thomas I. Liao
Esin Durmus
Alex Tamkin
Deep Ganguli
ELM
137
83
0
12 Jun 2024
UICoder: Finetuning Large Language Models to Generate User Interface
  Code through Automated Feedback
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
Jason Wu
E. Schoop
Alan Leung
Titus Barik
Jeffrey P. Bigham
Jeffrey Nichols
68
14
0
11 Jun 2024
Beyond Model Collapse: Scaling Up with Synthesized Data Requires
  Reinforcement
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement
Yunzhen Feng
Elvis Dohmatob
Pu Yang
Francois Charton
Julia Kempe
91
17
0
11 Jun 2024
Previous
123...91011...232425
Next