ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2302.07459
31
158

The Capacity for Moral Self-Correction in Large Language Models

15 February 2023
Deep Ganguli
Amanda Askell
Nicholas Schiefer
Thomas I. Liao
Kamil.e Lukovsiut.e
Anna Chen
Anna Goldie
Azalia Mirhoseini
Catherine Olsson
Danny Hernandez
Dawn Drain
Dustin Li
Eli Tran-Johnson
Ethan Perez
John Kernion
Jamie Kerr
J. Mueller
J. Landau
Kamal Ndousse
Karina Nguyen
Liane Lovitt
Michael Sellitto
Nelson Elhage
Noemí Mercado
Nova Dassarma
Oliver Rausch
R. Lasenby
Robin Larson
Sam Ringer
Sandipan Kundu
Saurav Kadavath
Scott Johnston
Shauna Kravec
S. E. Showk
Tamera Lanham
Timothy Telleen-Lawton
T. Henighan
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
C. Olah
Jack Clark
Sam Bowman
Jared Kaplan
    LRM
    ReLM
ArXivPDFHTML
Abstract

We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

View on arXiv
Comments on this paper