Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.14032
Cited By
Compositional Explanations of Neurons
24 June 2020
Jesse Mu
Jacob Andreas
FAtt
CoGe
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Compositional Explanations of Neurons"
33 / 33 papers shown
Title
Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs
Ling Hu
Yuemei Xu
Xiaoyang Gu
Letao Han
28
0
0
07 Apr 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
68
0
0
13 Mar 2025
LaVCa: LLM-assisted Visual Cortex Captioning
Takuya Matsuyama
Shinji Nishimoto
Yu Takagi
56
0
0
20 Feb 2025
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
Guorui Zheng
Xidong Wang
Juhao Liang
Nuo Chen
Yuping Zheng
Benyou Wang
MoE
30
5
0
14 Oct 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
56
13
0
13 Jun 2024
Graphical Perception of Saliency-based Model Explanations
Yayan Zhao
Mingwei Li
Matthew Berger
XAI
FAtt
41
2
0
11 Jun 2024
Linear Explanations for Individual Neurons
Tuomas P. Oikarinen
Tsui-Wei Weng
FAtt
MILM
29
5
0
10 May 2024
A Multimodal Automated Interpretability Agent
Tamar Rott Shaham
Sarah Schwettmann
Franklin Wang
Achyuta Rajaram
Evan Hernandez
Jacob Andreas
Antonio Torralba
31
17
0
22 Apr 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
25
76
0
25 Jan 2024
MAMI: Multi-Attentional Mutual-Information for Long Sequence Neuron Captioning
Alfirsa Damasyifa Fauzulhaq
Wahyu Parwitayasa
Joseph A. Sugihdharma
M. F. Ridhani
N. Yudistira
26
0
0
05 Jan 2024
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia
Giovanni Monea
Maxime Peyrard
Martin Josifoski
Vishrav Chaudhary
Jason Eisner
Emre Kiciman
Hamid Palangi
Barun Patra
Robert West
KELM
51
12
0
04 Dec 2023
Labeling Neural Representations with Inverse Recognition
Kirill Bykov
Laura Kopf
Shinichi Nakajima
Marius Kloft
Marina M.-C. Höhne
BDL
19
15
0
22 Nov 2023
Interpreting Pretrained Language Models via Concept Bottlenecks
Zhen Tan
Lu Cheng
Song Wang
Yuan Bo
Jundong Li
Huan Liu
LRM
29
20
0
08 Nov 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
30
27
0
26 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
28
97
0
27 Sep 2023
Identifying Interpretable Subspaces in Image Representations
N. Kalibhat
S. Bhardwaj
Bayan Bruss
Hamed Firooz
Maziar Sanjabi
S. Feizi
FAtt
32
26
0
20 Jul 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
Alex Foote
Neel Nanda
Esben Kran
Ionnis Konstas
Fazl Barez
MILM
20
2
0
22 Apr 2023
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Peter Hase
Mohit Bansal
Been Kim
Asma Ghandeharioun
MILM
34
167
0
10 Jan 2023
Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification
Yue Yang
Artemis Panagopoulou
Shenghao Zhou
Daniel Jin
Chris Callison-Burch
Mark Yatskar
40
211
0
21 Nov 2022
Finding Skill Neurons in Pre-trained Transformer-based Language Models
Xiaozhi Wang
Kaiyue Wen
Zhengyan Zhang
Lei Hou
Zhiyuan Liu
Juanzi Li
MILM
MoE
19
50
0
14 Nov 2022
Global Concept-Based Interpretability for Graph Neural Networks via Neuron Analysis
Xuanyuan Han
Pietro Barbiero
Dobrik Georgiev
Lucie Charlotte Magister
Pietro Lió
MILM
34
41
0
22 Aug 2022
Towards Explainable Evaluation Metrics for Natural Language Generation
Christoph Leiter
Piyawat Lertvittayakumjorn
M. Fomicheva
Wei-Ye Zhao
Yang Gao
Steffen Eger
AAML
ELM
22
20
0
21 Mar 2022
Interpreting Arabic Transformer Models
Ahmed Abdelali
Nadir Durrani
Fahim Dalvi
Hassan Sajjad
33
2
0
19 Jan 2022
Can Explanations Be Useful for Calibrating Black Box Models?
Xi Ye
Greg Durrett
FAtt
19
25
0
14 Oct 2021
Quantifying Local Specialization in Deep Neural Networks
Shlomi Hod
Daniel Filan
Stephen Casper
Andrew Critch
Stuart J. Russell
60
10
0
13 Oct 2021
Robust Feature-Level Adversaries are Interpretability Tools
Stephen Casper
Max Nadeau
Dylan Hadfield-Menell
Gabriel Kreiman
AAML
40
27
0
07 Oct 2021
Neuron-level Interpretation of Deep NLP Models: A Survey
Hassan Sajjad
Nadir Durrani
Fahim Dalvi
MILM
AI4CE
26
79
0
30 Aug 2021
FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging
Han Guo
Nazneen Rajani
Peter Hase
Mohit Bansal
Caiming Xiong
TDI
21
102
0
31 Dec 2020
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
22
741
0
29 Dec 2020
Discovering the Compositional Structure of Vector Representations with Role Learning Networks
Paul Soulos
R. Thomas McCoy
Tal Linzen
P. Smolensky
CoGe
29
43
0
21 Oct 2019
e-SNLI: Natural Language Inference with Natural Language Explanations
Oana-Maria Camburu
Tim Rocktaschel
Thomas Lukasiewicz
Phil Blunsom
LRM
255
620
0
04 Dec 2018
Hypothesis Only Baselines in Natural Language Inference
Adam Poliak
Jason Naradowsky
Aparajita Haldar
Rachel Rudinger
Benjamin Van Durme
190
576
0
02 May 2018
Generating Natural Language Adversarial Examples
M. Alzantot
Yash Sharma
Ahmed Elgohary
Bo-Jhang Ho
Mani B. Srivastava
Kai-Wei Chang
AAML
245
914
0
21 Apr 2018
1