Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2408.12664
Cited By
Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience
22 August 2024
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
Yinzhu Yang
Wes Gurnee
Ilia Sucholutsky
Yuhan Tang
Rebeca Ianov
George Ogden
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
AI4CE
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience"
50 / 82 papers shown
Title
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
100
3
0
19 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
73
6
0
03 Jun 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Subhabrata Dutta
Joykirat Singh
Soumen Chakrabarti
Tanmoy Chakraborty
LRM
66
25
0
28 Feb 2024
Towards Uncovering How Large Language Model Works: An Explainability Perspective
Haiyan Zhao
Fan Yang
Bo Shen
Himabindu Lakkaraju
Jundong Li
69
11
0
16 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
93
117
0
03 Jan 2024
Efficient Large Language Models: A Survey
Zhongwei Wan
Xin Wang
Che Liu
Samiul Alam
Yu Zheng
...
Shen Yan
Yi Zhu
Quanlu Zhang
Mosharaf Chowdhury
Mi Zhang
LM&MA
35
130
0
06 Dec 2023
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Nicholas Lourie
Kyunghyun Cho
He He
28
2
0
16 Nov 2023
How do Language Models Bind Entities in Context?
Jiahai Feng
Jacob Steinhardt
84
39
0
26 Oct 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
284
226
0
20 Oct 2023
Getting aligned on representational alignment
Ilia Sucholutsky
Lukas Muttenthaler
Adrian Weller
Andi Peng
Andreea Bobu
...
Thomas Unterthiner
Andrew Kyle Lampinen
Klaus-Robert Muller
M. Toneva
Thomas Griffiths
98
88
0
18 Oct 2023
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
Ziming Liu
Mikail Khona
Ila R. Fiete
Max Tegmark
69
12
0
11 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head
Callum McDougall
Arthur Conmy
Cody Rushing
Thomas McGrath
Neel Nanda
MILM
49
45
0
06 Oct 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
104
156
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
169
108
0
27 Sep 2023
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
R. Thomas McCoy
Shunyu Yao
Dan Friedman
Matthew Hardy
Thomas Griffiths
46
153
0
24 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
90
412
0
15 Sep 2023
Neurons in Large Language Models: Dead, N-gram, Positional
Elena Voita
Javier Ferrando
Christoforos Nalmpantis
MILM
119
54
0
09 Sep 2023
Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf
Yuzhuang Xu
Shuo Wang
Peng Li
Ziyue Wang
Xiaolong Wang
Weidong Liu
Yang Liu
LLMAG
40
202
0
09 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
60
153
0
28 Aug 2023
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
54
83
0
31 Jul 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations
Tom McGrath
Matthew Rahtz
János Kramár
Vladimir Mikulik
Shane Legg
MILM
LRM
38
72
0
28 Jul 2023
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Danny Halawi
Jean-Stanislas Denain
Jacob Steinhardt
63
59
0
18 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
82
113
0
18 Jul 2023
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
L. Wong
Gabriel Grand
Alexander K. Lew
Noah D. Goodman
Vikash K. Mansinghka
Jacob Andreas
J. Tenenbaum
LRM
AI4CE
32
106
0
22 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
85
548
0
06 Jun 2023
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
Paul S. Scotti
Atmadeep Banerjee
J. Goode
Stepan Shabalin
A. Nguyen
...
Nathalie Verlinde
Elad Yundler
David Weisberg
K. A. Norman
Tanishq Mathew Abraham
DiffM
73
118
0
29 May 2023
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
73
159
0
24 May 2023
How Language Model Hallucinations Can Snowball
Muru Zhang
Ofir Press
William Merrill
Alisa Liu
Noah A. Smith
HILM
LRM
113
274
0
22 May 2023
Scaling laws for language encoding models in fMRI
Richard Antonello
Aditya R. Vaidya
Alexander G. Huth
MedIm
62
64
0
19 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
185
211
0
02 May 2023
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.2K
14,179
0
15 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas Icard
Noah D. Goodman
CML
96
107
0
05 Mar 2023
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
71
431
0
12 Jan 2023
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
91
487
0
15 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
292
549
0
01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data
Ziming Liu
Eric J. Michaud
Max Tegmark
83
82
0
03 Oct 2022
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
305
510
0
24 Sep 2022
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
172
363
0
21 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
74
132
0
27 Jul 2022
Single-phase deep learning in cortico-cortical networks
Will Greedy
He Zhu
Joe Pemberton
J. Mellor
Rui Ponte Costa
41
37
0
23 Jun 2022
Towards Understanding Grokking: An Effective Theory of Representation Learning
Ziming Liu
O. Kitouni
Niklas Nolte
Eric J. Michaud
Max Tegmark
Mike Williams
AI4CE
72
152
0
20 May 2022
When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes
Mycal Tucker
Tiwalayo Eisape
Peng Qian
R. Levy
J. Shah
MILM
38
12
0
20 Apr 2022
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
100
614
0
15 Feb 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
215
1,344
0
10 Feb 2022
Survey of Hallucination in Natural Language Generation
Ziwei Ji
Nayeon Lee
Rita Frieske
Tiezheng Yu
D. Su
...
Delong Chen
Wenliang Dai
Ho Shu Chan
Andrea Madotto
Pascale Fung
HILM
LRM
189
2,356
0
08 Feb 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power
Yuri Burda
Harrison Edwards
Igor Babuschkin
Vedant Misra
73
354
0
06 Jan 2022
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie
Aditi Raghunathan
Percy Liang
Tengyu Ma
ReLM
BDL
VPVLM
LRM
177
746
0
03 Nov 2021
Causal Abstractions of Neural Networks
Atticus Geiger
Hanson Lu
Thomas Icard
Christopher Potts
NAI
CML
66
241
0
06 Jun 2021
Examining the Inductive Bias of Neural Language Models with Artificial Languages
Jennifer C. White
Ryan Cotterell
57
44
0
02 Jun 2021
Are Convolutional Neural Networks or Transformers more like human vision?
Shikhar Tuli
Ishita Dasgupta
Erin Grant
Thomas Griffiths
ViT
FaML
54
185
0
15 May 2021
1
2
Next