ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDa
    MoMe
ArXivPDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,107 papers shown
Title
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
  Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Pan Zhang
Xiaoyi Dong
Yuhang Zang
Yuhang Cao
Rui Qian
...
Kai Chen
Jifeng Dai
Yu Qiao
Dahua Lin
Jiaqi Wang
45
100
0
03 Jul 2024
Single Character Perturbations Break LLM Alignment
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
149
2
0
03 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content
  Moderation of Large Language Models
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
36
2
0
03 Jul 2024
Purple-teaming LLMs with Adversarial Defender Training
Purple-teaming LLMs with Adversarial Defender Training
Jingyan Zhou
Kun Li
Junan Li
Jiawen Kang
Minda Hu
Xixin Wu
Helen Meng
AAML
36
1
0
01 Jul 2024
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
Shihan Deng
Weikai Xu
Hongda Sun
Wei Liu
Tao Tan
...
Ang Li
Jian Luan
Bin Wang
Rui Yan
Shuo Shang
LLMAG
44
6
0
01 Jul 2024
Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients
  via Eliciting and Adhering to Principles
Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
Ryan Louie
Ananjan Nandi
William Fang
Cheng Chang
Emma Brunskill
Diyi Yang
50
38
0
01 Jul 2024
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
  Reasoning
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
Zimu Lu
Aojun Zhou
Ke Wang
Houxing Ren
Weikang Shi
Junting Pan
Mingjie Zhan
Hongsheng Li
LRM
42
23
0
30 Jun 2024
LLM Critics Help Catch LLM Bugs
LLM Critics Help Catch LLM Bugs
Nat McAleese
Rai Michael Pokorny
Juan Felipe Cerón Uribe
Evgenia Nitishinskaya
Maja Trebacz
Jan Leike
ALM
LRM
38
60
0
28 Jun 2024
ProgressGym: Alignment with a Millennium of Moral Progress
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Yalan Qin
Yaodong Yang
AI4TS
38
4
0
28 Jun 2024
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Sujan Dutta
Sayantan Mahinder
R. Anantha
Bortik Bandyopadhyay
ALM
39
4
0
28 Jun 2024
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi
Alexander Wei
Eric Wallace
Tony T. Wang
Nika Haghtalab
Jacob Steinhardt
SILM
AAML
38
30
0
28 Jun 2024
Rethinking harmless refusals when fine-tuning foundation models
Rethinking harmless refusals when fine-tuning foundation models
Florin Pop
Judd Rosenblatt
Diogo Schwerz de Lucena
Michael Vaiana
18
0
0
27 Jun 2024
Suri: Multi-constraint Instruction Following for Long-form Text
  Generation
Suri: Multi-constraint Instruction Following for Long-form Text Generation
Chau Minh Pham
Simeng Sun
Mohit Iyyer
ALM
LRM
50
15
0
27 Jun 2024
Diminishing Stereotype Bias in Image Generation Model using
  Reinforcemenlent Learning Feedback
Diminishing Stereotype Bias in Image Generation Model using Reinforcemenlent Learning Feedback
Xin Chen
Virgile Foussereau
EGVM
49
0
0
27 Jun 2024
Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Austen Liao
Nicholas Tomlin
Dan Klein
69
1
0
27 Jun 2024
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology
  Report Simplification
Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Ziyu Yang
Santhosh Cherian
Slobodan Vucetic
MedIm
33
0
0
27 Jun 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences
  to Reduce Harm
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
Beyza Ermis
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
40
28
0
26 Jun 2024
AI Alignment through Reinforcement Learning from Human Feedback?
  Contradictions and Limitations
AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
Adam Dahlgren Lindstrom
Leila Methnani
Lea Krause
Petter Ericson
Ínigo Martínez de Rituerto de Troya
Dimitri Coelho Mollo
Roel Dobbe
ALM
45
2
0
26 Jun 2024
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in
  Retrieval-Augmented Generative Models
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
Zhen Tan
Chengshuai Zhao
Raha Moraffah
Yifan Li
Song Wang
Jundong Li
Tianlong Chen
Huan Liu
SILM
54
16
0
26 Jun 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for
  Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLM
LRM
39
1
0
25 Jun 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
33
50
0
24 Jun 2024
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large
  Language Models via Opposite Prompt Optimization
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
Zhengyue Zhao
Xiaoyun Zhang
Kaidi Xu
Xing Hu
Rui Zhang
Zidong Du
Qi Guo
Yunji Chen
27
6
0
24 Jun 2024
LionGuard: Building a Contextualized Moderation Classifier to Tackle
  Localized Unsafe Content
LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
Jessica Foo
Shaun Khoo
38
4
0
24 Jun 2024
On the Transformations across Reward Model, Parameter Update, and
  In-Context Prompt
On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Deng Cai
Huayang Li
Tingchen Fu
Siheng Li
Weiwen Xu
...
Leyang Cui
Yan Wang
Lemao Liu
Taro Watanabe
Shuming Shi
KELM
30
2
0
24 Jun 2024
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
Yifan Wang
A. Grama
Ruqi Zhang
Ruqi Zhang
AI4TS
49
9
0
24 Jun 2024
INDICT: Code Generation with Internal Dialogues of Critiques for Both
  Security and Helpfulness
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness
Hung Le
Yingbo Zhou
Caiming Xiong
Silvio Savarese
Doyen Sahoo
52
2
0
23 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
42
6
0
21 Jun 2024
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference
Anton Xue
Avishree Khare
Rajeev Alur
Surbhi Goel
Eric Wong
58
2
0
21 Jun 2024
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu
Yi Ren Fung
Sha Li
Yixin Wan
Kai-Wei Chang
Heng Ji
47
5
0
20 Jun 2024
Global Human-guided Counterfactual Explanations for Molecular Properties
  via Reinforcement Learning
Global Human-guided Counterfactual Explanations for Molecular Properties via Reinforcement Learning
Danqing Wang
Antonis Antoniades
Kha-Dinh Luong
Edwin Zhang
Mert Kosan
Jiachen Li
Ambuj Singh
William Yang Wang
Lei Li
AI4CE
42
0
0
19 Jun 2024
Supporting Human Raters with the Detection of Harmful Content using
  Large Language Models
Supporting Human Raters with the Detection of Harmful Content using Large Language Models
Kurt Thomas
Patrick Gage Kelley
David Tao
Sarah Meiklejohn
Owen Vallis
Shunwen Tan
Blaz Bratanic
Felipe Tiengo Ferreira
Vijay Eranti
Elie Bursztein
46
2
0
18 Jun 2024
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large
  Language Models
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
Somnath Banerjee
Soham Tripathy
Sayan Layek
Shanu Kumar
Animesh Mukherjee
Rima Hazra
33
1
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
44
7
0
17 Jun 2024
Improving Multi-Agent Debate with Sparse Communication Topology
Improving Multi-Agent Debate with Sparse Communication Topology
Yunxuan Li
Yibing Du
Jiageng Zhang
Le Hou
Peter Grabowski
Yeqing Li
Eugene Ie
LLMAG
36
18
0
17 Jun 2024
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Zhewen Shen
Aditya Joshi
Ruey-Cheng Chen
CLL
52
2
0
17 Jun 2024
A Complete Survey on LLM-based AI Chatbots
A Complete Survey on LLM-based AI Chatbots
Sumit Kumar Dam
Choong Seon Hong
Yu Qiao
Chaoning Zhang
59
51
0
17 Jun 2024
A Survey on Human Preference Learning for Large Language Models
A Survey on Human Preference Learning for Large Language Models
Ruili Jiang
Kehai Chen
Xuefeng Bai
Zhixuan He
Juntao Li
Muyun Yang
Tiejun Zhao
Liqiang Nie
Min Zhang
49
8
0
17 Jun 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Wenkai Yang
Shiqi Shen
Guangyao Shen
Zhi Gong
Yankai Lin
Zhi Gong
Yankai Lin
Ji-Rong Wen
58
13
0
17 Jun 2024
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
Lingrui Mei
Shenghua Liu
Yiwei Wang
Baolong Bi
Jiayi Mao
Xueqi Cheng
AAML
47
9
0
17 Jun 2024
garak: A Framework for Security Probing Large Language Models
garak: A Framework for Security Probing Large Language Models
Leon Derczynski
Erick Galinkin
Jeffrey Martin
Subho Majumdar
Nanna Inie
AAML
ELM
38
16
0
16 Jun 2024
Toward Optimal LLM Alignments Using Two-Player Games
Toward Optimal LLM Alignments Using Two-Player Games
Rui Zheng
Hongyi Guo
Zhihan Liu
Xiaoying Zhang
Yuanshun Yao
...
Tao Gui
Qi Zhang
Xuanjing Huang
Hang Li
Yang Liu
60
5
0
16 Jun 2024
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for
  Cartoon Captioning
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
Jifan Zhang
Lalit P. Jain
Yang Guo
Jiayi Chen
Kuan Lok Zhou
...
Scott Sievert
Timothy T. Rogers
Kevin Jamieson
Robert Mankoff
Robert Nowak
39
5
0
15 Jun 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large
  Language Models
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson E. Denison
M. MacDiarmid
Fazl Barez
David Duvenaud
Shauna Kravec
...
Jared Kaplan
Buck Shlegeris
Samuel R. Bowman
Ethan Perez
Evan Hubinger
56
36
0
14 Jun 2024
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
  Survey
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
Lin Long
Rui Wang
Ruixuan Xiao
Junbo Zhao
Xiao Ding
Gang Chen
Haobo Wang
SyDa
59
93
0
14 Jun 2024
From Text to Life: On the Reciprocal Relationship between Artificial
  Life and Large Language Models
From Text to Life: On the Reciprocal Relationship between Artificial Life and Large Language Models
Eleni Nisioti
Claire Glanois
Elias Najarro
Andrew Dai
Elliot Meyerson
J. Pedersen
Laetitia Teodorescu
Conor F. Hayes
Shyam Sudhakaran
Sebastian Risi
AI4CE
LM&Ro
51
3
0
14 Jun 2024
Understanding Jailbreak Success: A Study of Latent Space Dynamics in
  Large Language Models
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball
Frauke Kreuter
Nina Rimsky
40
13
0
13 Jun 2024
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning
  in LLMs
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
Xuan Zhang
Chao Du
Tianyu Pang
Qian Liu
Wei Gao
Min-Bin Lin
LRM
AI4CE
44
34
0
13 Jun 2024
Bayesian Statistical Modeling with Predictors from LLMs
Bayesian Statistical Modeling with Predictors from LLMs
Michael Franke
Polina Tsvilodub
Fausto Carcassi
45
4
0
13 Jun 2024
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous
  Preferences
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen
Yi Chen
Aniket Rege
Ramya Korlakai Vinayak
46
17
0
12 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin
  for Preference Datasets
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng-Wei Zhang
Wenqiang Lei
44
2
0
12 Jun 2024
Previous
123...789...212223
Next