ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08073
  4. Cited By
Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 December 2022
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
Andy Jones
A. Chen
Anna Goldie
Azalia Mirhoseini
C. McKinnon
Carol Chen
Catherine Olsson
C. Olah
Danny Hernandez
Dawn Drain
Deep Ganguli
Dustin Li
Eli Tran-Johnson
E. Perez
Jamie Kerr
J. Mueller
Jeff Ladish
J. Landau
Kamal Ndousse
Kamilė Lukošiūtė
Liane Lovitt
Michael Sellitto
Nelson Elhage
Nicholas Schiefer
Noemí Mercado
Nova Dassarma
R. Lasenby
Robin Larson
Sam Ringer
Scott R. Johnston
Shauna Kravec
S. E. Showk
Stanislav Fort
Tamera Lanham
Timothy Telleen-Lawton
Tom Conerly
T. Henighan
Tristan Hume
Sam Bowman
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
    SyDa
    MoMe
ArXivPDFHTML

Papers citing "Constitutional AI: Harmlessness from AI Feedback"

50 / 1,122 papers shown
Title
Weak-to-Strong Jailbreaking on Large Language Models
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
Xianjun Yang
Tianyu Pang
Chao Du
Lei Li
Yu-Xiang Wang
William Y. Wang
34
54
0
30 Jan 2024
MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse
  Worlds
MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds
Xiaolong Jin
Zhuo Zhang
Xiangyu Zhang
18
3
0
25 Jan 2024
TPD: Enhancing Student Language Model Reasoning via Principle Discovery
  and Guidance
TPD: Enhancing Student Language Model Reasoning via Principle Discovery and Guidance
Haorui Wang
Rongzhi Zhang
Yinghao Li
Lingkai Kong
Yuchen Zhuang
Xiusi Chen
Chao Zhang
LRM
43
5
0
24 Jan 2024
ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement
ULTRA: Unleash LLMs' Potential for Event Argument Extraction through Hierarchical Modeling and Pair-wise Self-Refinement
Xinliang Frederick Zhang
Carter Blum
Temma Choji
Shalin S Shah
Alakananda Vempala
63
12
0
24 Jan 2024
Visibility into AI Agents
Visibility into AI Agents
Alan Chan
Carson Ezell
Max Kaufmann
K. Wei
Lewis Hammond
...
Nitarshan Rajkumar
David M. Krueger
Noam Kolt
Lennart Heim
Markus Anderljung
20
32
0
23 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in
  Multilingual Contexts
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
34
38
0
23 Jan 2024
DsDm: Model-Aware Dataset Selection with Datamodels
DsDm: Model-Aware Dataset Selection with Datamodels
Logan Engstrom
Axel Feldmann
A. Madry
OODD
25
49
0
23 Jan 2024
GRATH: Gradual Self-Truthifying for Large Language Models
GRATH: Gradual Self-Truthifying for Large Language Models
Weixin Chen
D. Song
Bo-wen Li
HILM
SyDa
33
5
0
22 Jan 2024
WARM: On the Benefits of Weight Averaged Reward Models
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Ramé
Nino Vieillard
Léonard Hussenot
Robert Dadashi
Geoffrey Cideron
Olivier Bachem
Johan Ferret
123
94
0
22 Jan 2024
Linear Alignment: A Closed-form Solution for Aligning Human Preferences
  without Tuning and Feedback
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
Songyang Gao
Qiming Ge
Wei Shen
Shihan Dou
Junjie Ye
...
Yicheng Zou
Zhi Chen
Hang Yan
Qi Zhang
Dahua Lin
57
11
0
21 Jan 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs
  Without Fine-Tuning
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Adib Hasan
Ileana Rugina
Alex Wang
AAML
54
22
0
19 Jan 2024
Self-Rewarding Language Models
Self-Rewarding Language Models
Weizhe Yuan
Richard Yuanzhe Pang
Kyunghyun Cho
Xian Li
Sainbayar Sukhbaatar
Jing Xu
Jason Weston
ReLM
SyDa
ALM
LRM
244
301
0
18 Jan 2024
Aligning Large Language Models with Counterfactual DPO
Aligning Large Language Models with Counterfactual DPO
Bradley Butcher
ALM
28
1
0
17 Jan 2024
Canvil: Designerly Adaptation for LLM-Powered User Experiences
Canvil: Designerly Adaptation for LLM-Powered User Experiences
K. J. Kevin Feng
Q. V. Liao
Ziang Xiao
Jennifer Wortman Vaughan
Amy X. Zhang
David W. McDonald
59
17
0
17 Jan 2024
Deductive Closure Training of Language Models for Coherence, Accuracy,
  and Updatability
Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability
Afra Feyza Akyürek
Ekin Akyürek
Leshem Choshen
Derry Wijaya
Jacob Andreas
HILM
SyDa
54
16
0
16 Jan 2024
Interrogating AI: Characterizing Emergent Playful Interactions with
  ChatGPT
Interrogating AI: Characterizing Emergent Playful Interactions with ChatGPT
Mohammad Ronagh Nikghalb
Jinghui Cheng
33
1
0
16 Jan 2024
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large
  Language Models in Tool Learning
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning
Junjie Ye
Yilong Wu
Songyang Gao
Caishuang Huang
Sixian Li
Guanyu Li
Xiaoran Fan
Qi Zhang
Tao Gui
Xuanjing Huang
AAML
35
16
0
16 Jan 2024
Small Language Model Can Self-correct
Small Language Model Can Self-correct
Haixia Han
Jiaqing Liang
Jie Shi
Qi He
Yanghua Xiao
LRM
SyDa
ReLM
KELM
40
11
0
14 Jan 2024
Generative Ghosts: Anticipating Benefits and Risks of AI Afterlives
Generative Ghosts: Anticipating Benefits and Risks of AI Afterlives
Meredith Ringel Morris
Jed R. Brubaker
42
10
0
14 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
  Challenge AI Safety by Humanizing LLMs
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
18
257
0
12 Jan 2024
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language
  Model Systems
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Tianyu Cui
Yanling Wang
Chuanpu Fu
Yong Xiao
Sijia Li
...
Junwu Xiong
Xinyu Kong
Zujie Wen
Ke Xu
Qi Li
63
57
0
11 Jan 2024
Integrating Physician Diagnostic Logic into Large Language Models:
  Preference Learning from Process Feedback
Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback
Chengfeng Dou
Zhi Jin
Wenpin Jiao
Haiyan Zhao
Yongqiang Zhao
Zhenwei Tao
LM&MA
82
5
0
11 Jan 2024
Towards Conversational Diagnostic AI
Towards Conversational Diagnostic AI
Tao Tu
Anil Palepu
M. Schaekermann
Khaled Saab
Jan Freyberg
...
Katherine Chou
Greg S. Corrado
Yossi Matias
Alan Karthikesalingam
Vivek Natarajan
AI4MH
LM&MA
34
93
0
11 Jan 2024
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Dennis Ulmer
Elman Mansimov
Kaixiang Lin
Justin Sun
Xibin Gao
Yi Zhang
LLMAG
35
27
0
10 Jan 2024
Are Language Models More Like Libraries or Like Librarians?
  Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs
Harvey Lederman
Kyle Mahowald
21
10
0
10 Jan 2024
Agent Alignment in Evolving Social Norms
Agent Alignment in Evolving Social Norms
Shimin Li
Tianxiang Sun
Qinyuan Cheng
Xipeng Qiu
LLMAG
43
8
0
09 Jan 2024
Evaluating Language Model Agency through Negotiations
Evaluating Language Model Agency through Negotiations
Tim R. Davidson
V. Veselovsky
Martin Josifoski
Maxime Peyrard
Antoine Bosselut
Michal Kosinski
Robert West
LLMAG
36
22
0
09 Jan 2024
The Critique of Critique
The Critique of Critique
Shichao Sun
Junlong Li
Weizhe Yuan
Ruifeng Yuan
Wenjie Li
Pengfei Liu
ELM
42
0
0
09 Jan 2024
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
Gokul Swamy
Christoph Dann
Rahul Kidambi
Zhiwei Steven Wu
Alekh Agarwal
OffRL
43
96
0
08 Jan 2024
Human-Instruction-Free LLM Self-Alignment with Limited Samples
Human-Instruction-Free LLM Self-Alignment with Limited Samples
Hongyi Guo
Yuanshun Yao
Wei Shen
Jiaheng Wei
Xiaoying Zhang
Zhaoran Wang
Yang Liu
95
21
0
06 Jan 2024
AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident
  Analysis
AccidentGPT: Large Multi-Modal Foundation Model for Traffic Accident Analysis
Kebin Wu
Wenbin Li
Xiaofei Xiao
21
3
0
05 Jan 2024
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
Renjie Pi
Tianyang Han
Jianshu Zhang
Yueqi Xie
Rui Pan
Qing Lian
Hanze Dong
Jipeng Zhang
Tong Zhang
AAML
31
60
0
05 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
27
9
0
03 Jan 2024
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
  Models
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen
Yihe Deng
Huizhuo Yuan
Kaixuan Ji
Quanquan Gu
SyDa
46
277
0
02 Jan 2024
The Art of Defending: A Systematic Evaluation and Analysis of LLM
  Defense Strategies on Safety and Over-Defensiveness
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
Neeraj Varshney
Pavel Dolin
Agastya Seth
Chitta Baral
AAML
ELM
25
47
0
30 Dec 2023
Preference as Reward, Maximum Preference Optimization with Importance
  Sampling
Preference as Reward, Maximum Preference Optimization with Importance Sampling
Zaifan Jiang
Xing Huang
Chao Wei
36
2
0
27 Dec 2023
A Comprehensive Analysis of the Effectiveness of Large Language Models
  as Automatic Dialogue Evaluators
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
Chen Zhang
L. F. D’Haro
Yiming Chen
Malu Zhang
Haizhou Li
ELM
21
29
0
24 Dec 2023
Reasons to Reject? Aligning Language Models with Judgments
Reasons to Reject? Aligning Language Models with Judgments
Weiwen Xu
Deng Cai
Zhisong Zhang
Wai Lam
Shuming Shi
ALM
21
14
0
22 Dec 2023
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion
  Models with RL Finetuning
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
Desai Xie
Jiahao Li
Hao Tan
Xin Sun
Zhixin Shu
Yi Zhou
Sai Bi
Soren Pirk
Arie E. Kaufman
37
8
0
21 Dec 2023
Speech Translation with Large Language Models: An Industrial Practice
Speech Translation with Large Language Models: An Industrial Practice
Zhichao Huang
Rong Ye
Tom Ko
Qianqian Dong
Shanbo Cheng
Mingxuan Wang
Hang Li
70
15
0
21 Dec 2023
Learning and Forgetting Unsafe Examples in Large Language Models
Learning and Forgetting Unsafe Examples in Large Language Models
Jiachen Zhao
Zhun Deng
David Madras
James Zou
Mengye Ren
MU
KELM
CLL
94
16
0
20 Dec 2023
Iterative Preference Learning from Human Feedback: Bridging Theory and
  Practice for RLHF under KL-Constraint
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
Wei Xiong
Hanze Dong
Chen Ye
Ziqi Wang
Han Zhong
Heng Ji
Nan Jiang
Tong Zhang
OffRL
38
164
0
18 Dec 2023
Retrieval-Augmented Generation for Large Language Models: A Survey
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao
Yun Xiong
Xinyu Gao
Kangxiang Jia
Jinliu Pan
Yuxi Bi
Yi Dai
Jiawei Sun
Meng Wang
Haofen Wang
3DV
RALM
64
1,552
1
18 Dec 2023
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
Zhi Gao
Yuntao Du
Xintong Zhang
Xiaojian Ma
Wenjuan Han
Song-Chun Zhu
Qing Li
LLMAG
VLM
33
22
0
18 Dec 2023
A Survey of Reasoning with Foundation Models
A Survey of Reasoning with Foundation Models
Jiankai Sun
Chuanyang Zheng
Enze Xie
Zhengying Liu
Ruihang Chu
...
Xipeng Qiu
Yi-Chen Guo
Hui Xiong
Qun Liu
Zhenguo Li
ReLM
LRM
AI4CE
30
77
0
17 Dec 2023
Silkie: Preference Distillation for Large Visual Language Models
Silkie: Preference Distillation for Large Visual Language Models
Lei Li
Zhihui Xie
Mukai Li
Shunian Chen
Peiyi Wang
Liang Chen
Yazheng Yang
Benyou Wang
Lingpeng Kong
MLLM
117
69
0
17 Dec 2023
Distilling Large Language Models for Matching Patients to Clinical
  Trials
Distilling Large Language Models for Matching Patients to Clinical Trials
Mauro Nievas
Aditya Basu
Yanshan Wang
Hrituraj Singh
ELM
LM&MA
28
30
0
15 Dec 2023
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
  Supervision
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
Collin Burns
Pavel Izmailov
Jan Hendrik Kirchner
Bowen Baker
Leo Gao
...
Adrien Ecoffet
Manas Joglekar
Jan Leike
Ilya Sutskever
Jeff Wu
ELM
50
262
0
14 Dec 2023
Self-Evaluation Improves Selective Generation in Large Language Models
Self-Evaluation Improves Selective Generation in Large Language Models
Jie Jessie Ren
Yao-Min Zhao
Tu Vu
Peter J. Liu
Balaji Lakshminarayanan
ELM
36
34
0
14 Dec 2023
Distributional Preference Learning: Understanding and Accounting for
  Hidden Context in RLHF
Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF
Anand Siththaranjan
Cassidy Laidlaw
Dylan Hadfield-Menell
34
58
0
13 Dec 2023
Previous
123...141516...212223
Next