Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2212.11415
Cited By
Circumventing interpretability: How to defeat mind-readers
21 December 2022
Lee D. Sharkey
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Circumventing interpretability: How to defeat mind-readers"
12 / 12 papers shown
Title
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
181
366
0
21 Sep 2022
Adversarially trained neural representations may already be as robust as corresponding biological neural representations
Chong Guo
Michael J. Lee
Guillaume Leclerc
Joel Dapello
Yug Rao
Aleksander Madry
J. DiCarlo
GAN
AAML
32
13
0
19 Jun 2022
Planting Undetectable Backdoors in Machine Learning Models
S. Goldwasser
Michael P. Kim
Vinod Vaikuntanathan
Or Zamir
AAML
45
71
0
14 Apr 2022
Trojan Horse Training for Breaking Defenses against Backdoor Attacks in Deep Learning
Arezoo Rajabi
Bhaskar Ramasubramanian
Radha Poovendran
AAML
102
5
0
25 Mar 2022
An Interpretability Illusion for BERT
Tolga Bolukbasi
Adam Pearce
Ann Yuan
Andy Coenen
Emily Reif
Fernanda Viégas
Martin Wattenberg
MILM
FAtt
67
79
0
14 Apr 2021
An overview of 11 proposals for building safe advanced AI
Evan Hubinger
AAML
51
23
0
04 Dec 2020
Shortcut Learning in Deep Neural Networks
Robert Geirhos
J. Jacobsen
Claudio Michaelis
R. Zemel
Wieland Brendel
Matthias Bethge
Felix Wichmann
201
2,048
0
16 Apr 2020
Weight Poisoning Attacks on Pre-trained Models
Keita Kurita
Paul Michel
Graham Neubig
AAML
SILM
134
451
0
14 Apr 2020
Interval timing in deep reinforcement learning agents
B. Deverett
Ryan Faulkner
Meire Fortunato
Greg Wayne
Joel Z Leibo
39
14
0
31 May 2019
Universal Transformers
Mostafa Dehghani
Stephan Gouws
Oriol Vinyals
Jakob Uszkoreit
Lukasz Kaiser
83
752
0
10 Jul 2018
Building machines that adapt and compute like brains
Brenden M. Lake
J. Tenenbaum
AI4CE
FedML
NAI
AILaw
320
887
0
11 Nov 2017
Concrete Problems in AI Safety
Dario Amodei
C. Olah
Jacob Steinhardt
Paul Christiano
John Schulman
Dandelion Mané
220
2,384
0
21 Jun 2016
1