Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.07459
Cited By
The Capacity for Moral Self-Correction in Large Language Models
15 February 2023
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
Anna Chen
Anna Goldie
Azalia Mirhoseini
Catherine Olsson
Danny Hernandez
Dawn Drain
Dustin Li
Eli Tran-Johnson
Ethan Perez
John Kernion
Jamie Kerr
J. Mueller
J. Landau
Kamal Ndousse
Karina Nguyen
Liane Lovitt
Michael Sellitto
Nelson Elhage
Noemí Mercado
Nova Dassarma
Oliver Rausch
R. Lasenby
Robin Larson
Sam Ringer
Sandipan Kundu
Saurav Kadavath
Scott Johnston
Shauna Kravec
S. E. Showk
Tamera Lanham
Timothy Telleen-Lawton
T. Henighan
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
LRM
ReLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The Capacity for Moral Self-Correction in Large Language Models"
50 / 115 papers shown
Title
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli
Kentaroh Toyoda
Yuan Wang
Leon Witt
Muhammad Asif Ali
Yukai Miao
Dan Li
Qingsong Wei
UQCV
92
0
0
25 Apr 2025
Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
Zhouhao Sun
Xiao Ding
LI DU
Yunpeng Xu
Yixuan Ma
Yang Zhao
Bing Qin
Ting Liu
27
0
0
17 Apr 2025
Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning
Sanchit Kabra
Akshita Jha
Chandan K. Reddy
LRM
24
0
0
08 Apr 2025
The LLM Wears Prada: Analysing Gender Bias and Stereotypes through Online Shopping Data
Massimiliano Luca
Ciro Beneduce
Bruno Lepri
Jacopo Staiano
45
0
0
02 Apr 2025
Multi-head Reward Aggregation Guided by Entropy
Xiaomin Li
Xupeng Chen
Jingxuan Fan
Eric Hanchen Jiang
Mingye Gao
AAML
49
1
0
26 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
44
0
0
08 Mar 2025
Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts
Akito Nakanishi
Yukie Sano
Geng Liu
Francesco Pierri
55
0
0
03 Mar 2025
Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs
Hannah Cyberey
Yangfeng Ji
David E. Evans
LLMSV
66
1
0
27 Feb 2025
A Three-Branch Checks-and-Balances Frameworkfor Context-Aware Ethical Alignment of Large Language Models
Edward Y. Chang
AILaw
56
0
0
31 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
79
44
0
20 Jan 2025
Surveying Attitudinal Alignment Between Large Language Models Vs. Humans Towards 17 Sustainable Development Goals
Qingyang Wu
Ying Xu
Tingsong Xiao
Yunze Xiao
Yitong Li
...
Yichi Zhang
Shanghai Zhong
Yuwei Zhang
Wei Lu
Yifan Yang
75
1
0
17 Jan 2025
Smaller Large Language Models Can Do Moral Self-Correction
Guangliang Liu
Zhiyu Xue
Rongrong Wang
K. Johnson
Kristen Marie Johnson
LRM
23
0
0
30 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism
Jared Moore
Yejin Choi
Sydney Levine
33
0
0
07 Oct 2024
Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback
Sanjiban Choudhury
Paloma Sodhi
LLMAG
32
4
0
07 Oct 2024
Co-Learning: Code Learning for Multi-Agent Reinforcement Collaborative Framework with Conversational Natural Language Interfaces
Jiapeng Yu
Yuqian Wu
Yajing Zhan
Wenhao Guo
Zhou Xu
Raymond S. T. Lee
LLMAG
33
2
0
02 Sep 2024
Critique-out-Loud Reward Models
Zachary Ankner
Mansheej Paul
Brandon Cui
Jonathan D. Chang
Prithviraj Ammanabrolu
ALM
LRM
32
27
0
21 Aug 2024
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
37
8
0
21 Jul 2024
LLM Critics Help Catch LLM Bugs
Nat McAleese
Rai Michael Pokorny
Juan Felipe Cerón Uribe
Evgenia Nitishinskaya
Maja Trebacz
Jan Leike
ALM
LRM
33
61
0
28 Jun 2024
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Jiaming Ji
Yaodong Yang
AI4TS
33
4
0
28 Jun 2024
Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models
Daniel Lopez-Martinez
MedIm
40
1
0
24 Jun 2024
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
Alexander Nikitin
Jannik Kossen
Yarin Gal
Pekka Marttinen
UQCV
47
23
0
30 May 2024
Expert-Guided Extinction of Toxic Tokens for Debiased Generation
Xueyao Sun
Kaize Shi
Haoran Tang
Guandong Xu
Qing Li
MU
40
1
0
29 May 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
36
13
0
28 May 2024
White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs
Yixin Wan
Kai-Wei Chang
32
4
0
16 Apr 2024
Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
Ruoxi Cheng
Haoxuan Ma
Shuirong Cao
Jiaqi Li
Aihua Pei
Zhiqiang Wang
Pengliang Ji
Haoyu Wang
Jiaqi Huo
AI4CE
26
6
0
15 Apr 2024
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
Yanhong Li
Chenghao Yang
Allyson Ettinger
ReLM
LRM
LLMAG
31
6
0
14 Apr 2024
Frontier AI Ethics: Anticipating and Evaluating the Societal Impacts of Generative Agents
Seth Lazar
SILM
31
1
0
10 Apr 2024
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
Simone Tedeschi
Felix Friedrich
P. Schramowski
Kristian Kersting
Roberto Navigli
Huu Nguyen
Bo Li
ELM
33
45
0
06 Apr 2024
The Impact of Unstated Norms in Bias Analysis of Language Models
Farnaz Kohankhaki
D. B. Emerson
David B. Emerson
Laleh Seyyed-Kalantari
Faiza Khan Khattak
52
1
0
04 Apr 2024
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game
Qianqiao Xu
Zhiliang Tian
Hongyan Wu
Zhen Huang
Yiping Song
Feng Liu
Dongsheng Li
LLMAG
AAML
34
2
0
03 Apr 2024
Reinforcement Learning from Reflective Feedback (RLRF): Aligning and Improving LLMs via Fine-Grained Self-Reflection
Kyungjae Lee
Dasol Hwang
Sunghyun Park
Youngsoo Jang
Moontae Lee
38
8
0
21 Mar 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
39
13
0
08 Mar 2024
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
Boshi Wang
Hao Fang
Jason Eisner
Benjamin Van Durme
Yu-Chuan Su
CLL
29
7
0
07 Mar 2024
A challenge in A(G)I, cybernetics revived in the Ouroboros Model as one algorithm for all thinking
Knud Thomsen
23
0
0
07 Mar 2024
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
38
16
0
07 Mar 2024
Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
Shitong Duan
Xiaoyuan Yi
Peng Zhang
T. Lu
Xing Xie
Ning Gu
34
4
0
06 Mar 2024
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
LLMAG
AAML
40
59
0
02 Mar 2024
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
Saurabh Srivastava
B. AnnaroseM
V. AntoP
Shashank Menon
Ajay Sukumar
T. AdwaithSamod
Alan Philipose
Stevin Prince
Sooraj Thomas
ELM
ReLM
LRM
34
45
0
29 Feb 2024
Do Large Language Models Mirror Cognitive Language Processing?
Yuqi Ren
Renren Jin
Tongxuan Zhang
Deyi Xiong
44
4
0
28 Feb 2024
Chain-of-Thought Unfaithfulness as Disguised Accuracy
Oliver Bentham
Nathan Stringham
Ana Marasović
LRM
HILM
50
8
0
22 Feb 2024
Harnessing Large Language Models as Post-hoc Correctors
Zhiqiang Zhong
Kuangyu Zhou
Davide Mottin
28
4
0
20 Feb 2024
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models
Loka Li
Zhenhao Chen
Guan-Hong Chen
Yixuan Zhang
Yusheng Su
Eric P. Xing
Kun Zhang
LRM
36
15
0
19 Feb 2024
ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding
Qihuang Zhong
Liang Ding
Juhua Liu
Bo Du
Dacheng Tao
LM&MA
42
22
0
19 Feb 2024
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
Ryan Liu
T. Sumers
Ishita Dasgupta
Thomas L. Griffiths
LLMAG
35
13
0
11 Feb 2024
Measuring Implicit Bias in Explicitly Unbiased Large Language Models
Xuechunzi Bai
Angelina Wang
Ilia Sucholutsky
Thomas L. Griffiths
100
30
0
06 Feb 2024
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting
Masahiro Kaneko
Danushka Bollegala
Naoaki Okazaki
Timothy Baldwin
LRM
31
27
0
28 Jan 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen
Weiting Tan
Sihao Chen
Yunmo Chen
Jingyu Zhang
Haoran Xu
Boyuan Zheng
Philipp Koehn
Daniel Khashabi
26
38
0
23 Jan 2024
JumpCoder: Go Beyond Autoregressive Coder via Online Modification
Mouxiang Chen
Hao Tian
Zhongxi Liu
Xiaoxue Ren
Jianling Sun
SyDa
KELM
35
2
0
15 Jan 2024
Small Language Model Can Self-correct
Haixia Han
Jiaqing Liang
Jie Shi
Qi He
Yanghua Xiao
LRM
SyDa
ReLM
KELM
34
11
0
14 Jan 2024
Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang
Liang Ding
Lefei Zhang
Dacheng Tao
LLMSV
24
15
0
12 Jan 2024
1
2
3
Next