Refusal in Language Models Is Mediated by a Single Direction

v1v2 (latest)

Refusal in Language Models Is Mediated by a Single Direction

17 June 2024

Nina Panickssery

ArXiv (abs)PDF HTML

Papers citing "Refusal in Language Models Is Mediated by a Single Direction"

5 / 55 papers shown

Title
LoRA: Low-Rank Adaptation of Large Language Models J. E. Hu Yelong Shen Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen OffRL AI4TS AI4CE ALM AIMat 474 10,367 0 17 Jun 2021
Implicit Representations of Meaning in Neural Language Models Belinda Z. Li Maxwell Nye Jacob Andreas NAI MILM 60 176 0 01 Jun 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe ... Horace He Anish Thite Noa Nabeshima Shawn Presser Connor Leahy AIMat 450 2,096 0 31 Dec 2020
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection Shauli Ravfogel Yanai Elazar Hila Gonen Michael Twiton Yoav Goldberg 138 387 0 16 Apr 2020
PyTorch: An Imperative Style, High-Performance Deep Learning Library Adam Paszke Sam Gross Francisco Massa Adam Lerer James Bradbury ... Sasank Chilamkurthy Benoit Steiner Lu Fang Junjie Bai Soumith Chintala ODL 514 42,449 0 03 Dec 2019