Understanding intermediate layers using linear classifier probes

5 October 2016

Papers citing "Understanding intermediate layers using linear classifier probes"

50 / 187 papers shown

Title
Designing a Dashboard for Transparency and Control of Conversational AI Yida Chen Aoyu Wu Trevor DePodesta Catherine Yeh Kenneth Li ... Jan Riecke Shivam Raval Olivia Seow Martin Wattenberg Fernanda Viégas 44 16 0 12 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 42 7 0 31 May 2024
On Fairness of Low-Rank Adaptation of Large Models Zhoujie Ding Ken Ziyu Liu Pura Peetathawatchai Berivan Isik Sanmi Koyejo 48 4 0 27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 68 7 0 26 May 2024
A Multi-Perspective Analysis of Memorization in Large Language Models Bowen Chen Namgi Han Yusuke Miyao 46 1 0 19 May 2024
Linear Explanations for Individual Neurons Tuomas P. Oikarinen Tsui-Wei Weng FAtt MILM 31 6 0 10 May 2024
A separability-based approach to quantifying generalization: which layer is best? Luciano Dyballa Evan Gerritz Steven W. Zucker OOD 37 3 0 02 May 2024
Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition H. Ghaffari Paul Devos 45 0 0 26 Apr 2024
Does Transformer Interpretability Transfer to RNNs? Gonccalo Paulo Thomas Marshall Nora Belrose 63 6 0 09 Apr 2024
Joint-Embedding Masked Autoencoder for Self-supervised Learning of Dynamic Functional Connectivity from the Human Brain Jungwon Choi Hyungi Lee Byung-Hoon Kim Juho Lee 80 0 0 11 Mar 2024
Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations GuanWen Qiu Da Kuang Surbhi Goel 27 8 0 05 Mar 2024
Language Models Represent Beliefs of Self and Others Wentao Zhu Zhining Zhang Yizhou Wang MILM LRM 50 8 0 28 Feb 2024
Descriptive Kernel Convolution Network with Improved Random Walk Kernel Meng-Chieh Lee Lingxiao Zhao L. Akoglu 23 3 0 08 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits Stephen Casper Carson Ezell Charlotte Siegmann Noam Kolt Taylor Lynn Curtis ... Michael Gerovitch David Bau Max Tegmark David M. Krueger Dylan Hadfield-Menell AAML 34 78 0 25 Jan 2024
Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable? Sonia Laguna Ricards Marcinkevics Moritz Vandenhirtz Julia E. Vogt 35 17 0 24 Jan 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 34 87 0 11 Jan 2024
Enhancing Contrastive Learning with Efficient Combinatorial Positive Pairing Jaeill Kim Duhun Hwang Eunjung Lee Jangwon Suh Jimyeong Kim Wonjong Rhee 33 0 0 11 Jan 2024
FlexModel: A Framework for Interpretability of Distributed Large Language Models Matthew Choi Muhammad Adil Asif John Willes David Emerson AI4CE ALM 27 1 0 05 Dec 2023
Revisiting Topic-Guided Language Models Carolina Zheng Keyon Vafa David M. Blei BDL 29 1 0 04 Dec 2023
Identifying Spurious Correlations using Counterfactual Alignment Joseph Paul Cohen Louis Blankemeier Akshay S. Chaudhari CML 55 1 0 01 Dec 2023
Looped Transformers are Better at Learning Learning Algorithms Liu Yang Kangwook Lee Robert D. Nowak Dimitris Papailiopoulos 24 24 0 21 Nov 2023
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots Ruixiang Tang Jiayi Yuan Yiming Li Zirui Liu Rui Chen Xia Hu AAML 36 13 0 28 Oct 2023
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 35 27 0 26 Oct 2023
Reset It and Forget It: Relearning Last-Layer Weights Improves Continual and Transfer Learning Lapo Frati Neil Traft Jeff Clune Nick Cheney CLL 27 0 0 12 Oct 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 47 142 0 03 Oct 2023
Uncovering the Hidden Cost of Model Compression Diganta Misra Muawiz Chaudhary Agam Goyal Bharat Runwal Pin-Yu Chen VLM 36 0 0 29 Aug 2023
Causal Intersectionality and Dual Form of Gradient Descent for Multimodal Analysis: a Case Study on Hateful Memes Yosuke Miyanishi M. Nguyen 34 2 0 19 Aug 2023
Concept backpropagation: An Explainable AI approach for visualising learned concepts in neural network models Patrik Hammersborg Inga Strümke FAtt 26 0 0 24 Jul 2023
Systematic Architectural Design of Scale Transformed Attention Condenser DNNs via Multi-Scale Class Representational Response Similarity Analysis Andrew Hryniowski Alexander Wong 16 0 0 16 Jun 2023
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning Jifan Zhang Yifang Chen Gregory H. Canal Stephen Mussmann Arnav M. Das ... Yinglun Zhu Jeffrey Bilmes S. Du Kevin G. Jamieson Robert D. Nowak VLM 33 10 0 16 Jun 2023
From `Snippet-lects' to Doculects and Dialects: Leveraging Neural Representations of Speech for Placing Audio Signals in a Language Landscape Severine Guillaume Guillaume Wisniewski Alexis Michaud 23 2 0 29 May 2023
Gaussian Process Probes (GPP) for Uncertainty-Aware Probing Zehao Wang Alexander Ku Jason Baldridge Thomas L. Griffiths Been Kim UQCV 26 11 0 29 May 2023
Reverse Engineering Self-Supervised Learning Ido Ben-Shaul Ravid Shwartz-Ziv Tomer Galanti S. Dekel Yann LeCun SSL 23 34 0 24 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval Arijit Ray Filip Radenovic Abhimanyu Dubey Bryan A. Plummer Ranjay Krishna Kate Saenko CoGe VLM 41 34 0 05 May 2023
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution Jaeill Kim Suhyun Kang Duhun Hwang Jungwook Shin Wonjong Rhee DRL 13 21 0 04 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 22 193 0 14 Mar 2023
SR-init: An interpretable layer pruning method Hui Tang Yao Lu Qi Xuan 15 8 0 14 Mar 2023
Revisiting Pre-training in Audio-Visual Learning Ruoxuan Feng Wenke Xia Di Hu 30 1 0 07 Feb 2023
Identifiability of latent-variable and structural-equation models: from linear to nonlinear Aapo Hyvarinen Ilyes Khemakhem R. Monti CML 30 41 0 06 Feb 2023
Trustworthy Social Bias Measurement Rishi Bommasani Percy Liang 27 10 0 20 Dec 2022
A Natural Bias for Language Generation Models Clara Meister Wojciech Stokowiec Tiago Pimentel Lei Yu Laura Rimell A. Kuncoro MILM 33 6 0 19 Dec 2022
ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning Shachar Don-Yehiya Elad Venezian Colin Raffel Noam Slonim Yoav Katz Leshem Choshen MoMe 28 52 0 02 Dec 2022
Supervised Pretraining for Molecular Force Fields and Properties Prediction Xiang Gao Weihao Gao Wen Xiao Zhirui Wang Chong Wang Liang Xiang AI4CE 25 8 0 23 Nov 2022
Layer-Stack Temperature Scaling Amr Khalifa Michael C. Mozer Hanie Sedghi Behnam Neyshabur Ibrahim M. Alabdulmohsin 78 2 0 18 Nov 2022
Emergence of Concepts in DNNs? Tim Räz 21 0 0 11 Nov 2022
Reinforcement Learning in an Adaptable Chess Environment for Detecting Human-understandable Concepts Patrik Hammersborg Inga Strümke 17 5 0 10 Nov 2022
COPEN: Probing Conceptual Knowledge in Pre-trained Language Models Hao Peng Xiaozhi Wang Shengding Hu Hailong Jin Lei Hou Juanzi Li Zhiyuan Liu Qun Liu 18 22 0 08 Nov 2022
A Law of Data Separation in Deep Learning Hangfeng He Weijie J. Su OOD 24 36 0 31 Oct 2022
Probing for targeted syntactic knowledge through grammatical error detection Christopher Davis Christopher Bryant Andrew Caines Marek Rei P. Buttery 22 3 0 28 Oct 2022
The Curious Case of Benign Memorization Sotiris Anagnostidis Gregor Bachmann Lorenzo Noci Thomas Hofmann AAML 49 8 0 25 Oct 2022