Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2102.12452
Cited By
Probing Classifiers: Promises, Shortcomings, and Advances
24 February 2021
Yonatan Belinkov
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Probing Classifiers: Promises, Shortcomings, and Advances"
50 / 71 papers shown
Title
Designing and Contextualising Probes for African Languages
Wisdom Aduah
Francois Meyer
74
0
0
15 May 2025
Geospatial Mechanistic Interpretability of Large Language Models
Stef De Sabbata
Stefano Mizzaro
Kevin Roitero
AI4CE
103
0
0
06 May 2025
Decoding Vision Transformers: the Diffusion Steering Lens
Ryota Takatsuki
Sonia Joseph
Ippei Fujisawa
Ryota Kanai
DiffM
83
0
0
18 Apr 2025
Linguistic Interpretability of Transformer-based Language Models: a systematic review
Miguel López-Otal
Jorge Gracia
Jordi Bernad
Carlos Bobed
Lucía Pitarch-Ballesteros
Emma Anglés-Herrero
VLM
94
1
0
09 Apr 2025
Learning on LLM Output Signatures for gray-box Behavior Analysis
Guy Bar-Shalom
Fabrizio Frasca
Derek Lim
Yoav Gelberg
Yftah Ziser
Ran El-Yaniv
Gal Chechik
Haggai Maron
113
0
0
18 Mar 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models
Egor Zverev
Evgenii Kortukov
Alexander Panfilov
Soroush Tabesh
Alexandra Volkova
Sebastian Lapuschkin
Wojciech Samek
Christoph H. Lampert
AAML
104
2
0
13 Mar 2025
Gender Encoding Patterns in Pretrained Language Model Representations
Mahdi Zakizadeh
Mohammad Taher Pilehvar
196
0
0
09 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models
Junsol Kim
James Evans
Aaron Schein
124
6
0
03 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Jonathan Jacobi
Gal Niv
LRM
ReLM
119
0
0
03 Mar 2025
Model Lakes
Koyena Pal
David Bau
Renée J. Miller
142
2
0
24 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
Lefei Zhang
Lijie Hu
Di Wang
LRM
161
4
0
17 Feb 2025
Superpose Singular Features for Model Merging
Haiquan Qiu
You Wu
Quanming Yao
MoMe
139
0
0
15 Feb 2025
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions
H. Fokkema
T. Erven
Sara Magliacane
114
2
0
10 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models
Ala Nekouvaght Tak
Amin Banayeeanzade
Anahita Bolourani
Mina Kian
Robin Jia
Jonathan Gratch
100
0
0
08 Feb 2025
Discovering Chunks in Neural Embeddings for Interpretability
Shuchen Wu
Stephan Alaniz
Eric Schulz
Zeynep Akata
82
0
0
03 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models
Karthik Viswanathan
Yuri Gardinazzi
Giada Panerai
Alberto Cazzaniga
Matteo Biagetti
AIFin
134
7
0
17 Jan 2025
GPT or BERT: why not both?
Lucas Georges Gabriel Charpentier
David Samuel
142
5
0
31 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Peng Kuang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
99
7
0
17 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
90
11
0
07 Nov 2024
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
Tom A. Lamb
Adam Davies
Alasdair Paren
Philip Torr
Francesco Pinto
108
0
0
30 Oct 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Yaniv Nikankin
Anja Reusch
Aaron Mueller
Yonatan Belinkov
AIFin
LRM
99
32
0
28 Oct 2024
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
85
10
0
18 Oct 2024
Inference and Verbalization Functions During In-Context Learning
Junyi Tao
Xiaoyin Chen
Nelson F. Liu
LRM
ReLM
75
1
0
12 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Hadas Orgad
Michael Toker
Zorik Gekhman
Roi Reichart
Idan Szpektor
Hadas Kotek
Yonatan Belinkov
HILM
AIFin
99
43
0
03 Oct 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach
Tong Nie
Junlin He
Yuewen Mei
Guoyang Qin
Guilong Li
Jian Sun
Wei Ma
86
4
0
30 Aug 2024
Understanding Generative AI Content with Embedding Models
Max Vargas
Reilly Cannon
A. Engel
Anand D. Sarwate
Tony Chiang
187
3
0
19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
149
32
0
02 Jul 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
99
2
0
24 Jun 2024
What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages
Nadav Borenstein
Anej Svete
R. Chan
Josef Valvoda
Franz Nowak
Isabelle Augenstein
Eleanor Chodroff
Ryan Cotterell
72
13
0
06 Jun 2024
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
Huiping Zhuang
Jianwei Wang
Zhengdong Lu
Huiping Zhuang
Haoran Li
Huiping Zhuang
Cen Chen
RALM
KELM
79
8
0
03 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
105
13
0
26 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
133
151
0
28 Mar 2024
Do Large Language Models Mirror Cognitive Language Processing?
Yuqi Ren
Renren Jin
Tongxuan Zhang
Deyi Xiong
91
6
0
28 Feb 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
118
7
0
07 Nov 2023
Similarity of Neural Network Models: A Survey of Functional and Representational Measures
Max Klabunde
Tobias Schumacher
M. Strohmaier
Florian Lemmerich
133
73
0
10 May 2023
What if This Modified That? Syntactic Interventions via Counterfactual Embeddings
Mycal Tucker
Peng Qian
R. Levy
53
39
0
28 May 2021
DirectProbe: Studying Representations without Classifiers
Yichu Zhou
Vivek Srikumar
70
29
0
13 Apr 2021
Low-Complexity Probing via Finding Subnetworks
Steven Cao
Victor Sanh
Alexander M. Rush
43
54
0
08 Apr 2021
Picking BERT's Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis
Michael A. Lepori
R. Thomas McCoy
50
24
0
24 Nov 2020
When Do You Need Billions of Words of Pretraining Data?
Yian Zhang
Alex Warstadt
Haau-Sing Li
Samuel R. Bowman
58
141
0
10 Nov 2020
Pareto Probing: Trading Off Accuracy for Complexity
Tiago Pimentel
Naomi Saphra
Adina Williams
Ryan Cotterell
55
60
0
05 Oct 2020
An information theoretic view on selecting linguistic probes
Zining Zhu
Frank Rudzicz
42
19
0
15 Sep 2020
CausaLM: Causal Model Explanation Through Counterfactual Language Models
Amir Feder
Nadav Oved
Uri Shalit
Roi Reichart
CML
LRM
92
161
0
27 May 2020
A Tale of a Probe and a Parser
Rowan Hall Maudslay
Josef Valvoda
Tiago Pimentel
Adina Williams
Ryan Cotterell
51
55
0
04 May 2020
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
Qingqing Cao
H. Trivedi
A. Balasubramanian
Niranjan Balasubramanian
66
68
0
02 May 2020
Investigating Transferability in Pretrained Language Models
Alex Tamkin
Trisha Singh
D. Giovanardi
Noah D. Goodman
MILM
62
48
0
30 Apr 2020
Asking without Telling: Exploring Latent Ontologies in Contextual Representations
Julian Michael
Jan A. Botha
Ian Tenney
43
43
0
29 Apr 2020
Analyzing analytical methods: The case of phonology in neural models of spoken language
Grzegorz Chrupała
Bertrand Higy
Afra Alishahi
42
20
0
15 Apr 2020
Information-Theoretic Probing with Minimum Description Length
Elena Voita
Ivan Titov
82
275
0
27 Mar 2020
A Primer in BERTology: What we know about how BERT works
Anna Rogers
Olga Kovaleva
Anna Rumshisky
OffRL
87
1,497
0
27 Feb 2020
1
2
Next