Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2302.12461
Cited By
Analyzing And Editing Inner Mechanisms Of Backdoored Language Models
24 February 2023
Max Lamparth
Anka Reuel
KELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Analyzing And Editing Inner Mechanisms Of Backdoored Language Models"
5 / 5 papers shown
Title
Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
Zongru Wu
Zhuosheng Zhang
Pengzhou Cheng
Gongshen Liu
AAML
44
4
0
19 Feb 2024
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
92
124
0
01 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
494
0
01 Nov 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
250
458
0
24 Sep 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
273
0
28 Sep 2021
1