Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis

21 July 2024

Papers citing "Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis"

8 / 8 papers shown

Title
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs Angelina Wang Michelle Phan Daniel E. Ho Sanmi Koyejo 49 2 0 04 Feb 2025
Smaller Large Language Models Can Do Moral Self-Correction Guangliang Liu Zhiyu Xue Rongrong Wang K. Johnson Kristen Marie Johnson LRM 32 0 0 30 Oct 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 74 96 0 03 Jan 2024
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems Kaya Stechly Matthew Marquez Subbarao Kambhampati LRM 165 84 0 19 Oct 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 191 261 0 28 Apr 2023
BBQ: A Hand-Built Bias Benchmark for Question Answering Alicia Parrish Angelica Chen Nikita Nangia Vishakh Padmakumar Jason Phang Jana Thompson Phu Mon Htut Sam Bowman 217 367 0 15 Oct 2021
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP Timo Schick Sahana Udupa Hinrich Schütze 259 374 0 28 Feb 2021
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021