Exploring and steering the moral compass of Large Language Models

27 May 2024

Papers citing "Exploring and steering the moral compass of Large Language Models"

3 / 3 papers shown

Title
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi Karthikeyan N. Ramamurthy Erik Miehling Pierre Dognin Manish Nagireddy Amit Dhurandhar LLMSV 130 20 0 06 Sep 2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 85 104 0 27 Sep 2023
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 296 494 0 24 Sep 2022