Probing Classifiers: Promises, Shortcomings, and Advances

24 February 2021

Papers citing "Probing Classifiers: Promises, Shortcomings, and Advances"

50 / 71 papers shown

Title
Designing and Contextualising Probes for African Languages Wisdom Aduah Francois Meyer 74 0 0 15 May 2025
Geospatial Mechanistic Interpretability of Large Language Models Stef De Sabbata Stefano Mizzaro Kevin Roitero AI4CE 103 0 0 06 May 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 83 0 0 18 Apr 2025
Linguistic Interpretability of Transformer-based Language Models: a systematic review Miguel López-Otal Jorge Gracia Jordi Bernad Carlos Bobed Lucía Pitarch-Ballesteros Emma Anglés-Herrero VLM 94 1 0 09 Apr 2025
Learning on LLM Output Signatures for gray-box Behavior Analysis Guy Bar-Shalom Fabrizio Frasca Derek Lim Yoav Gelberg Yftah Ziser Ran El-Yaniv Gal Chechik Haggai Maron 113 0 0 18 Mar 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models Egor Zverev Evgenii Kortukov Alexander Panfilov Soroush Tabesh Alexandra Volkova Sebastian Lapuschkin Wojciech Samek Christoph H. Lampert AAML 104 2 0 13 Mar 2025
Gender Encoding Patterns in Pretrained Language Model Representations Mahdi Zakizadeh Mohammad Taher Pilehvar 196 0 0 09 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models Junsol Kim James Evans Aaron Schein 124 6 0 03 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Jonathan Jacobi Gal Niv LRM ReLM 119 0 0 03 Mar 2025
Model Lakes Koyena Pal David Bau Renée J. Miller 142 2 0 24 Feb 2025
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning Lefei Zhang Lijie Hu Di Wang LRM 161 4 0 17 Feb 2025
Superpose Singular Features for Model Merging Haiquan Qiu You Wu Quanming Yao MoMe 139 0 0 15 Feb 2025
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions H. Fokkema T. Erven Sara Magliacane 114 2 0 10 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models Ala Nekouvaght Tak Amin Banayeeanzade Anahita Bolourani Mina Kian Robin Jia Jonathan Gratch 100 0 0 08 Feb 2025
Discovering Chunks in Neural Embeddings for Interpretability Shuchen Wu Stephan Alaniz Eric Schulz Zeynep Akata 82 0 0 03 Feb 2025
The Geometry of Tokens in Internal Representations of Large Language Models Karthik Viswanathan Yuri Gardinazzi Giada Panerai Alberto Cazzaniga Matteo Biagetti AIFin 134 7 0 17 Jan 2025
GPT or BERT: why not both? Lucas Georges Gabriel Charpentier David Samuel 142 5 0 31 Dec 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Peng Kuang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 99 7 0 17 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 90 11 0 07 Nov 2024
Focus On This, Not That! Steering LLMs with Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip Torr Francesco Pinto 108 0 0 30 Oct 2024
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Yaniv Nikankin Anja Reusch Aaron Mueller Yonatan Belinkov AIFin LRM 99 32 0 28 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 85 10 0 18 Oct 2024
Inference and Verbalization Functions During In-Context Learning Junyi Tao Xiaoyin Chen Nelson F. Liu LRM ReLM 75 1 0 12 Oct 2024
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Hadas Orgad Michael Toker Zorik Gekhman Roi Reichart Idan Szpektor Hadas Kotek Yonatan Belinkov HILM AIFin 99 43 0 03 Oct 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach Tong Nie Junlin He Yuewen Mei Guoyang Qin Guilong Li Jian Sun Wei Ma 86 4 0 30 Aug 2024
Understanding Generative AI Content with Embedding Models Max Vargas Reilly Cannon A. Engel Anand D. Sarwate Tony Chiang 187 3 0 19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 149 32 0 02 Jul 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation Michal Golovanevsky William Rudman Vedant Palit Ritambhara Singh Carsten Eickhoff 99 2 0 24 Jun 2024
What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages Nadav Borenstein Anej Svete R. Chan Josef Valvoda Franz Nowak Isabelle Augenstein Eleanor Chodroff Ryan Cotterell 72 13 0 06 Jun 2024
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration Huiping Zhuang Jianwei Wang Zhengdong Lu Huiping Zhuang Haoran Li Huiping Zhuang Cen Chen RALM KELM 79 8 0 03 Jun 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 105 13 0 26 May 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 133 151 0 28 Mar 2024
Do Large Language Models Mirror Cognitive Language Processing? Yuqi Ren Renren Jin Tongxuan Zhang Deyi Xiong 91 6 0 28 Feb 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 118 7 0 07 Nov 2023
Similarity of Neural Network Models: A Survey of Functional and Representational Measures Max Klabunde Tobias Schumacher M. Strohmaier Florian Lemmerich 133 73 0 10 May 2023
What if This Modified That? Syntactic Interventions via Counterfactual Embeddings Mycal Tucker Peng Qian R. Levy 53 39 0 28 May 2021
DirectProbe: Studying Representations without Classifiers Yichu Zhou Vivek Srikumar 70 29 0 13 Apr 2021
Low-Complexity Probing via Finding Subnetworks Steven Cao Victor Sanh Alexander M. Rush 43 54 0 08 Apr 2021
Picking BERT's Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis Michael A. Lepori R. Thomas McCoy 50 24 0 24 Nov 2020
When Do You Need Billions of Words of Pretraining Data? Yian Zhang Alex Warstadt Haau-Sing Li Samuel R. Bowman 58 141 0 10 Nov 2020
Pareto Probing: Trading Off Accuracy for Complexity Tiago Pimentel Naomi Saphra Adina Williams Ryan Cotterell 55 60 0 05 Oct 2020
An information theoretic view on selecting linguistic probes Zining Zhu Frank Rudzicz 42 19 0 15 Sep 2020
CausaLM: Causal Model Explanation Through Counterfactual Language Models Amir Feder Nadav Oved Uri Shalit Roi Reichart CML LRM 92 161 0 27 May 2020
A Tale of a Probe and a Parser Rowan Hall Maudslay Josef Valvoda Tiago Pimentel Adina Williams Ryan Cotterell 51 55 0 04 May 2020
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering Qingqing Cao H. Trivedi A. Balasubramanian Niranjan Balasubramanian 66 68 0 02 May 2020
Investigating Transferability in Pretrained Language Models Alex Tamkin Trisha Singh D. Giovanardi Noah D. Goodman MILM 62 48 0 30 Apr 2020
Asking without Telling: Exploring Latent Ontologies in Contextual Representations Julian Michael Jan A. Botha Ian Tenney 43 43 0 29 Apr 2020
Analyzing analytical methods: The case of phonology in neural models of spoken language Grzegorz Chrupała Bertrand Higy Afra Alishahi 42 20 0 15 Apr 2020
Information-Theoretic Probing with Minimum Description Length Elena Voita Ivan Titov 82 275 0 27 Mar 2020
A Primer in BERTology: What we know about how BERT works Anna Rogers Olga Kovaleva Anna Rumshisky OffRL 87 1,497 0 27 Feb 2020