ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.12664
  4. Cited By
Multilevel Interpretability Of Artificial Neural Networks: Leveraging
  Framework And Methods From Neuroscience

Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience

22 August 2024
Zhonghao He
Jascha Achterberg
Katie Collins
Kevin K. Nejad
Danyal Akarca
Yinzhu Yang
Wes Gurnee
Ilia Sucholutsky
Yuhan Tang
Rebeca Ianov
George Ogden
Chole Li
Kai J. Sandbrink
Stephen Casper
Anna Ivanova
Grace W. Lindsay
    AI4CE
ArXivPDFHTML

Papers citing "Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience"

50 / 82 papers shown
Title
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
100
3
0
19 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons
  from Cognitive Neuroscience
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
73
6
0
03 Jun 2024
How to think step-by-step: A mechanistic understanding of
  chain-of-thought reasoning
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Subhabrata Dutta
Joykirat Singh
Soumen Chakrabarti
Tanmoy Chakraborty
LRM
66
25
0
28 Feb 2024
Towards Uncovering How Large Language Model Works: An Explainability
  Perspective
Towards Uncovering How Large Language Model Works: An Explainability Perspective
Haiyan Zhao
Fan Yang
Bo Shen
Himabindu Lakkaraju
Jundong Li
69
11
0
16 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO
  and Toxicity
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee
Xiaoyan Bai
Itamar Pres
Martin Wattenberg
Jonathan K. Kummerfeld
Rada Mihalcea
93
117
0
03 Jan 2024
Efficient Large Language Models: A Survey
Efficient Large Language Models: A Survey
Zhongwei Wan
Xin Wang
Che Liu
Samiul Alam
Yu Zheng
...
Shen Yan
Yi Zhu
Quanlu Zhang
Mosharaf Chowdhury
Mi Zhang
LM&MA
35
130
0
06 Dec 2023
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Nicholas Lourie
Kyunghyun Cho
He He
28
2
0
16 Nov 2023
How do Language Models Bind Entities in Context?
How do Language Models Bind Entities in Context?
Jiahai Feng
Jacob Steinhardt
84
39
0
26 Oct 2023
Towards Understanding Sycophancy in Language Models
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
284
226
0
20 Oct 2023
Getting aligned on representational alignment
Getting aligned on representational alignment
Ilia Sucholutsky
Lukas Muttenthaler
Adrian Weller
Andi Peng
Andreea Bobu
...
Thomas Unterthiner
Andrew Kyle Lampinen
Klaus-Robert Muller
M. Toneva
Thomas Griffiths
98
88
0
18 Oct 2023
Growing Brains: Co-emergence of Anatomical and Functional Modularity in
  Recurrent Neural Networks
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
Ziming Liu
Mikail Khona
Ila R. Fiete
Max Tegmark
69
12
0
11 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head
Copy Suppression: Comprehensively Understanding an Attention Head
Callum McDougall
Arthur Conmy
Cody Rushing
Thomas McGrath
Neel Nanda
MILM
49
45
0
06 Oct 2023
Language Models Represent Space and Time
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
104
156
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models:
  Metrics and Methods
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
169
108
0
27 Sep 2023
Embers of Autoregression: Understanding Large Language Models Through
  the Problem They are Trained to Solve
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
R. Thomas McCoy
Shunyu Yao
Dan Friedman
Matthew Hardy
Thomas Griffiths
46
153
0
24 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language
  Models
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham
Aidan Ewart
Logan Riggs
R. Huben
Lee Sharkey
MILM
90
412
0
15 Sep 2023
Neurons in Large Language Models: Dead, N-gram, Positional
Neurons in Large Language Models: Dead, N-gram, Positional
Elena Voita
Javier Ferrando
Christoforos Nalmpantis
MILM
119
54
0
09 Sep 2023
Exploring Large Language Models for Communication Games: An Empirical
  Study on Werewolf
Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf
Yuzhuang Xu
Shuo Wang
Peng Li
Ziyue Wang
Xiaolong Wang
Weidong Liu
Yang Liu
LLMAG
40
202
0
09 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
60
153
0
28 Aug 2023
Deception Abilities Emerged in Large Language Models
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
54
83
0
31 Jul 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations
The Hydra Effect: Emergent Self-repair in Language Model Computations
Tom McGrath
Matthew Rahtz
János Kramár
Vladimir Mikulik
Shane Legg
MILM
LRM
38
72
0
28 Jul 2023
Overthinking the Truth: Understanding how Language Models Process False
  Demonstrations
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Danny Halawi
Jean-Stanislas Denain
Jacob Steinhardt
63
59
0
18 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
  Choice Capabilities in Chinchilla
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Tom Lieberum
Matthew Rahtz
János Kramár
Neel Nanda
G. Irving
Rohin Shah
Vladimir Mikulik
82
113
0
18 Jul 2023
From Word Models to World Models: Translating from Natural Language to
  the Probabilistic Language of Thought
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
L. Wong
Gabriel Grand
Alexander K. Lew
Noah D. Goodman
Vikash K. Mansinghka
Jacob Andreas
J. Tenenbaum
LRM
AI4CE
32
106
0
22 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language
  Model
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
85
548
0
06 Jun 2023
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning
  and Diffusion Priors
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
Paul S. Scotti
Atmadeep Banerjee
J. Goode
Stepan Shabalin
A. Nguyen
...
Nathalie Verlinde
Elad Yundler
David Weisberg
K. A. Norman
Tanishq Mathew Abraham
DiffM
73
118
0
29 May 2023
Model evaluation for extreme risks
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
73
159
0
24 May 2023
How Language Model Hallucinations Can Snowball
How Language Model Hallucinations Can Snowball
Muru Zhang
Ofir Press
William Merrill
Alisa Liu
Noah A. Smith
HILM
LRM
113
274
0
22 May 2023
Scaling laws for language encoding models in fMRI
Scaling laws for language encoding models in fMRI
Richard Antonello
Aditya R. Vaidya
Alexander G. Huth
MedIm
62
64
0
19 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Wes Gurnee
Neel Nanda
Matthew Pauly
Katherine Harvey
Dmitrii Troitskii
Dimitris Bertsimas
MILM
185
211
0
02 May 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAG
MLLM
1.2K
14,179
0
15 Mar 2023
Finding Alignments Between Interpretable Causal Variables and
  Distributed Neural Representations
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
Atticus Geiger
Zhengxuan Wu
Christopher Potts
Thomas Icard
Noah D. Goodman
CML
96
107
0
05 Mar 2023
Progress measures for grokking via mechanistic interpretability
Progress measures for grokking via mechanistic interpretability
Neel Nanda
Lawrence Chan
Tom Lieberum
Jess Smith
Jacob Steinhardt
71
431
0
12 Jan 2023
Transformers learn in-context by gradient descent
Transformers learn in-context by gradient descent
J. Oswald
Eyvind Niklasson
E. Randazzo
João Sacramento
A. Mordvintsev
A. Zhmoginov
Max Vladymyrov
MLT
91
487
0
15 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
292
549
0
01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data
Omnigrok: Grokking Beyond Algorithmic Data
Ziming Liu
Eric J. Michaud
Max Tegmark
83
82
0
03 Oct 2022
In-context Learning and Induction Heads
In-context Learning and Induction Heads
Catherine Olsson
Nelson Elhage
Neel Nanda
Nicholas Joseph
Nova Dassarma
...
Tom B. Brown
Jack Clark
Jared Kaplan
Sam McCandlish
C. Olah
305
510
0
24 Sep 2022
Toy Models of Superposition
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
172
363
0
21 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of
  Deep Neural Networks
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Tilman Raukur
A. Ho
Stephen Casper
Dylan Hadfield-Menell
AAML
AI4CE
74
132
0
27 Jul 2022
Single-phase deep learning in cortico-cortical networks
Single-phase deep learning in cortico-cortical networks
Will Greedy
He Zhu
Joe Pemberton
J. Mellor
Rui Ponte Costa
41
37
0
23 Jun 2022
Towards Understanding Grokking: An Effective Theory of Representation
  Learning
Towards Understanding Grokking: An Effective Theory of Representation Learning
Ziming Liu
O. Kitouni
Niklas Nolte
Eric J. Michaud
Max Tegmark
Mike Williams
AI4CE
72
152
0
20 May 2022
When Does Syntax Mediate Neural Language Model Performance? Evidence
  from Dropout Probes
When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes
Mycal Tucker
Tiwalayo Eisape
Peng Qian
R. Levy
J. Shah
MILM
38
12
0
20 Apr 2022
Quantifying Memorization Across Neural Language Models
Quantifying Memorization Across Neural Language Models
Nicholas Carlini
Daphne Ippolito
Matthew Jagielski
Katherine Lee
Florian Tramèr
Chiyuan Zhang
PILM
100
614
0
15 Feb 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
215
1,344
0
10 Feb 2022
Survey of Hallucination in Natural Language Generation
Survey of Hallucination in Natural Language Generation
Ziwei Ji
Nayeon Lee
Rita Frieske
Tiezheng Yu
D. Su
...
Delong Chen
Wenliang Dai
Ho Shu Chan
Andrea Madotto
Pascale Fung
HILM
LRM
189
2,356
0
08 Feb 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic
  Datasets
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power
Yuri Burda
Harrison Edwards
Igor Babuschkin
Vedant Misra
73
354
0
06 Jan 2022
An Explanation of In-context Learning as Implicit Bayesian Inference
An Explanation of In-context Learning as Implicit Bayesian Inference
Sang Michael Xie
Aditi Raghunathan
Percy Liang
Tengyu Ma
ReLM
BDL
VPVLM
LRM
177
746
0
03 Nov 2021
Causal Abstractions of Neural Networks
Causal Abstractions of Neural Networks
Atticus Geiger
Hanson Lu
Thomas Icard
Christopher Potts
NAI
CML
66
241
0
06 Jun 2021
Examining the Inductive Bias of Neural Language Models with Artificial
  Languages
Examining the Inductive Bias of Neural Language Models with Artificial Languages
Jennifer C. White
Ryan Cotterell
57
44
0
02 Jun 2021
Are Convolutional Neural Networks or Transformers more like human
  vision?
Are Convolutional Neural Networks or Transformers more like human vision?
Shikhar Tuli
Ishita Dasgupta
Erin Grant
Thomas Griffiths
ViT
FaML
54
185
0
15 May 2021
12
Next