ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.17700
  4. Cited By
RAVEL: Evaluating Interpretability Methods on Disentangling Language
  Model Representations

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

27 February 2024
Jing-ling Huang
Zhengxuan Wu
Christopher Potts
Mor Geva
Atticus Geiger
ArXiv (abs)PDFHTMLGithub (47★)

Papers citing "RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations"

35 / 35 papers shown
Title
Do Language Models Use Their Depth Efficiently?
Do Language Models Use Their Depth Efficiently?
Róbert Csordás
Christopher D. Manning
Christopher Potts
200
2
0
20 May 2025
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Are Sparse Autoencoders Useful for Java Function Bug Detection?
Rui Melo
Claudia Mamede
Andre Catarino
Rui Abreu
Henrique Lopes Cardoso
103
0
0
15 May 2025
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
Jiuding Sun
Jing Huang
Sidharth Baskaran
Karel DÓosterlinck
Christopher Potts
Michael Sklar
Atticus Geiger
AI4CE
107
2
0
13 Mar 2025
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Identifying Sub-networks in Neural Networks via Functionally Similar Representations
Tian Gao
Amit Dhurandhar
Karthikeyan N. Ramamurthy
Dennis L. Wei
84
0
0
21 Oct 2024
Inference and Verbalization Functions During In-Context Learning
Inference and Verbalization Functions During In-Context Learning
Junyi Tao
Xiaoyin Chen
Nelson F. Liu
LRMReLM
75
1
0
12 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
78
5
0
06 Sep 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
174
33
0
02 Jul 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations
  of Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Asma Ghandeharioun
Avi Caciularu
Adam Pearce
Lucas Dixon
Mor Geva
124
114
0
11 Jan 2024
ALMANACS: A Simulatability Benchmark for Language Model Explainability
ALMANACS: A Simulatability Benchmark for Language Model Explainability
Edmund Mills
Shiye Su
Stuart J. Russell
Scott Emmons
145
9
0
20 Dec 2023
Codebook Features: Sparse and Discrete Interpretability for Neural
  Networks
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
Alex Tamkin
Mohammad Taufeeque
Noah D. Goodman
78
29
0
26 Oct 2023
How do Language Models Bind Entities in Context?
How do Language Models Bind Entities in Context?
Jiahai Feng
Jacob Steinhardt
125
40
0
26 Oct 2023
FIND: A Function Description Benchmark for Evaluating Interpretability
  Methods
FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Sarah Schwettmann
Tamar Rott Shaham
Joanna Materzyñska
Neil Chowdhury
Shuang Li
Jacob Andreas
David Bau
Antonio Torralba
40
22
0
07 Sep 2023
Linearity of Relation Decoding in Transformer Language Models
Linearity of Relation Decoding in Transformer Language Models
Evan Hernandez
Arnab Sen Sharma
Tal Haklay
Kevin Meng
Martin Wattenberg
Jacob Andreas
Yonatan Belinkov
David Bau
KELM
82
100
0
17 Aug 2023
Discovering Variable Binding Circuitry with Desiderata
Discovering Variable Binding Circuitry with Desiderata
Xander Davies
Max Nadeau
Nikhil Prakash
Tamar Rott Shaham
David Bau
64
15
0
07 Jul 2023
LEACE: Perfect linear concept erasure in closed form
LEACE: Perfect linear concept erasure in closed form
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Ryan Cotterell
Edward Raff
Stella Biderman
KELMMU
137
120
0
06 Jun 2023
Faithfulness Tests for Natural Language Explanations
Faithfulness Tests for Natural Language Explanations
Pepa Atanasova
Oana-Maria Camburu
Christina Lioma
Thomas Lukasiewicz
J. Simonsen
Isabelle Augenstein
FAtt
114
67
0
29 May 2023
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop
  Questions
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Zexuan Zhong
Zhengxuan Wu
Christopher D. Manning
Christopher Potts
Danqi Chen
KELM
84
217
0
24 May 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy
Augustine N. Mavor-Parker
Aengus Lynch
Stefan Heimersheim
Adrià Garriga-Alonso
66
319
0
28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
314
563
0
01 Nov 2022
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model
  Behavior
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
Eldar David Abraham
Karel DÓosterlinck
Amir Feder
Y. Gat
Atticus Geiger
Christopher Potts
Roi Reichart
Zhengxuan Wu
CML
122
47
0
27 May 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
251
1,389
0
10 Feb 2022
Sparse Interventions in Language Models with Differentiable Masking
Sparse Interventions in Language Models with Differentiable Masking
Nicola De Cao
Leon Schmid
Dieuwke Hupkes
Ivan Titov
70
29
0
13 Dec 2021
Conditional probing: measuring usable information beyond a baseline
Conditional probing: measuring usable information beyond a baseline
John Hewitt
Kawin Ethayarajh
Percy Liang
Christopher D. Manning
72
57
0
19 Sep 2021
Implicit Representations of Meaning in Neural Language Models
Implicit Representations of Meaning in Neural Language Models
Belinda Z. Li
Maxwell Nye
Jacob Andreas
NAIMILM
67
177
0
01 Jun 2021
An Interpretability Illusion for BERT
An Interpretability Illusion for BERT
Tolga Bolukbasi
Adam Pearce
Ann Yuan
Andy Coenen
Emily Reif
Fernanda Viégas
Martin Wattenberg
MILMFAtt
83
81
0
14 Apr 2021
Transformer Feed-Forward Layers Are Key-Value Memories
Transformer Feed-Forward Layers Are Key-Value Memories
Mor Geva
R. Schuster
Jonathan Berant
Omer Levy
KELM
182
847
0
29 Dec 2020
Are Neural Nets Modular? Inspecting Functional Modularity Through
  Differentiable Weight Masks
Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks
Róbert Csordás
Sjoerd van Steenkiste
Jürgen Schmidhuber
98
97
0
05 Oct 2020
Null It Out: Guarding Protected Attributes by Iterative Nullspace
  Projection
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
Shauli Ravfogel
Yanai Elazar
Hila Gonen
Michael Twiton
Yoav Goldberg
144
388
0
16 Apr 2020
What Does BERT Look At? An Analysis of BERT's Attention
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark
Urvashi Khandelwal
Omer Levy
Christopher D. Manning
MILM
235
1,605
0
11 Jun 2019
BERT Rediscovers the Classical NLP Pipeline
BERT Rediscovers the Classical NLP Pipeline
Ian Tenney
Dipanjan Das
Ellie Pavlick
MILMSSeg
145
1,482
0
15 May 2019
Challenging Common Assumptions in the Unsupervised Learning of
  Disentangled Representations
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations
Francesco Locatello
Stefan Bauer
Mario Lucic
Gunnar Rätsch
Sylvain Gelly
Bernhard Schölkopf
Olivier Bachem
OOD
139
1,473
0
29 Nov 2018
Dissecting Contextual Word Embeddings: Architecture and Representation
Dissecting Contextual Word Embeddings: Architecture and Representation
Matthew E. Peters
Mark Neumann
Luke Zettlemoyer
Wen-tau Yih
109
431
0
27 Aug 2018
What you can cram into a single vector: Probing sentence embeddings for
  linguistic properties
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
Alexis Conneau
Germán Kruszewski
Guillaume Lample
Loïc Barrault
Marco Baroni
353
897
0
03 May 2018
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Linear Algebraic Structure of Word Senses, with Applications to Polysemy
Sanjeev Arora
Yuanzhi Li
Yingyu Liang
Tengyu Ma
Andrej Risteski
95
284
0
14 Jan 2016
Direct and Indirect Effects
Direct and Indirect Effects
Judea Pearl
CML
100
2,179
0
10 Jan 2013
1