Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.15286
Cited By
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
21 July 2024
Guang-Da Liu
Haitao Mao
Jiliang Tang
K. Johnson
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis"
8 / 8 papers shown
Title
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Angelina Wang
Michelle Phan
Daniel E. Ho
Sanmi Koyejo
49
2
0
04 Feb 2025
Smaller Large Language Models Can Do Moral Self-Correction
Guangliang Liu
Zhiyu Xue
Rongrong Wang
K. Johnson
Kristen Marie Johnson
LRM
32
0
0
30 Oct 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
74
96
0
03 Jan 2024
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Kaya Stechly
Matthew Marquez
Subbarao Kambhampati
LRM
165
84
0
19 Oct 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
191
261
0
28 Apr 2023
BBQ: A Hand-Built Bias Benchmark for Question Answering
Alicia Parrish
Angelica Chen
Nikita Nangia
Vishakh Padmakumar
Jason Phang
Jana Thompson
Phu Mon Htut
Sam Bowman
217
367
0
15 Oct 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick
Sahana Udupa
Hinrich Schütze
259
374
0
28 Feb 2021
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
226
405
0
24 Feb 2021
1