Understanding intermediate layers using linear classifier probes

5 October 2016

Papers citing "Understanding intermediate layers using linear classifier probes"

50 / 187 papers shown

Title
Revealing economic facts: LLMs know more than they say Marcus Buckmann Quynh Anh Nguyen Edward Hill 28 0 0 13 May 2025
Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation Volodymyr Havrylov Haiwen Huang Dan Zhang Andreas Geiger 128 0 0 04 May 2025
Demystifying optimized prompts in language models Rimon Melamed Lucas H. McCabe H. H. Huang 39 0 0 04 May 2025
Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality Pramook Khungurn Sukit Seripanitkarn Phonphrm Thawatdamrongkit Supasorn Suwajanakorn DiffM 77 0 0 30 Apr 2025
Investigating task-specific prompts and sparse autoencoders for activation monitoring Henk Tillman Dan Mossing LLMSV 50 0 0 28 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control Hannah Cyberey David E. Evans LLMSV 76 0 0 23 Apr 2025
V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations Zhiyuan Fan Yumeng Wang Sandeep Polisetty Yi Ren Fung 50 0 0 23 Apr 2025
Decoding Vision Transformers: the Diffusion Steering Lens Ryota Takatsuki Sonia Joseph Ippei Fujisawa Ryota Kanai DiffM 30 0 0 18 Apr 2025
Interpreting the Linear Structure of Vision-language Model Embedding Spaces Isabel Papadimitriou Huangyuan Su Thomas Fel Naomi Saphra Sham Kakade Stephanie Gil VLM 52 0 0 16 Apr 2025
Parameterized Synthetic Text Generation with SimpleStories Lennart Finke Thomas Dooms Thomas Dooms Mat Allen Emerald Zhang Juan Diego Rodriguez Noa Nabeshima Thomas Marshall Dan Braun SyDa 32 0 0 12 Apr 2025
Among Us: A Sandbox for Measuring and Detecting Agentic Deception Satvik Golechha Adrià Garriga-Alonso LLMAG 52 2 0 05 Apr 2025
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models Zhanke Zhou Zhaocheng Zhu Xuan Li Mikhail Galkin Xiao Feng Sanmi Koyejo Jian Tang Bo Han LRM 56 0 0 28 Mar 2025
ASIDE: Architectural Separation of Instructions and Data in Language Models Egor Zverev Evgenii Kortukov Alexander Panfilov Soroush Tabesh Alexandra Volkova Sebastian Lapuschkin Wojciech Samek Christoph H. Lampert AAML 54 1 0 13 Mar 2025
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model Qiyuan Deng X. Bai Kehai Chen Yaowei Wang Liqiang Nie Min Zhang OffRL 66 0 0 13 Mar 2025
A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification Basudha Pal Siyuan Huang 56 0 0 09 Mar 2025
Statistical Deficiency for Task Inclusion Estimation Loïc Fosse Frédéric Béchet Benoit Favre Géraldine Damnati Gwénolé Lecorvé Maxime Darrin Philippe Formont Pablo Piantanida 139 0 0 07 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Jonathan Jacobi Gal Niv LRM ReLM 60 0 0 03 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models Junsol Kim James Evans Aaron Schein 77 2 0 03 Mar 2025
Disentangling Visual Transformers: Patch-level Interpretability for Image Classification Guillaume Jeanneret Loïc Simon F. Jurie ViT 49 0 0 24 Feb 2025
Towards Active Participant Centric Vertical Federated Learning: Some Representations May Be All You Need Jon Irureta Jon Imaz Aizea Lojo Javier Fernandez-Marques Marco González Iñigo Perona FedML 90 1 0 20 Feb 2025
Bayesian Comparisons Between Representations Heiko H. Schütt FAtt 179 0 0 20 Feb 2025
Language Models Can Predict Their Own Behavior Dhananjay Ashok Jonathan May ReLM AI4TS LRM 63 0 0 18 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 49 0 0 15 Feb 2025
Superpose Singular Features for Model Merging Haiquan Qiu You Wu Quanming Yao MoMe 48 0 0 15 Feb 2025
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach J. Yang Dapeng Chen Yajing Sun Rongjun Li Zhiyong Feng Wei Peng 51 5 0 19 Jan 2025
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning Eitan Wagner Nitay Alon J. Barnby Omri Abend LRM 85 2 0 18 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael I. Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 128 1 0 16 Dec 2024
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations Huaizhi Ge Yiming Li Qifan Wang Yongfeng Zhang Ruixiang Tang AAML SILM 86 0 0 19 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 40 5 0 07 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Constantin Weisser Severin Field Helen Yannakoudakis Stephen Casper 39 2 0 02 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 69 3 0 30 Oct 2024
Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation Jaechang Kim Jinmin Goh Inseok Hwang Jaewoong Cho Jungseul Ok ELM 28 1 0 28 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 59 9 0 18 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 4 0 18 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization Phillip Guo Aaquib Syed Abhay Sheshadri Aidan Ewart Gintare Karolina Dziugaite KELM MU 33 5 0 16 Oct 2024
Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models Kushal Tatariya Vladimir Araujo Thomas Bauwens Miryam de Lhoneux VLM 33 0 0 15 Oct 2024
Inference and Verbalization Functions During In-Context Learning Junyi Tao Xiaoyin Chen Nelson F. Liu LRM ReLM 26 0 0 12 Oct 2024
Temporal Reasoning Transfer from Text to Video Lei Li Yuanxin Liu Linli Yao Peiyuan Zhang Chenxin An Lean Wang Xu Sun Lingpeng Kong Qi Liu LRM 48 7 0 08 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting David X. Wu A. Sahai 68 6 0 06 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis Daoyang Li Mingyu Jin Qingcheng Zeng Mengnan Du 63 2 0 22 Sep 2024
Self-Contrastive Forward-Forward Algorithm Xing Chen Dongshu Liu Jérémie Laydevant Julie Grollier 36 2 0 17 Sep 2024
Joint Estimation and Prediction of City-wide Delivery Demand: A Large Language Model Empowered Graph-based Learning Approach Tong Nie Junlin He Yuewen Mei Guoyang Qin Guilong Li Jian Sun Wei Ma 37 3 0 30 Aug 2024
LLMs' morphological analyses of complex FST-generated Finnish words Anssi Moisio Mathias Creutz M. Kurimo 49 1 0 11 Jul 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 34 5 0 27 Jun 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models Matteo Bortoletto Constantin Ruhdorfer Lei Shi Andreas Bulling AI4MH LRM 46 5 0 25 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs Jannik Kossen Jiatong Han Muhammed Razzak Lisa Schut Shreshth A. Malik Yarin Gal HILM 60 34 0 22 Jun 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers Manuel Mondal Ljiljana Dolamic Gérôme Bovet Philippe Cudré-Mauroux Julien Audiffren 40 2 0 21 Jun 2024
Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell Taiming Lu Muhan Gao Kuai Yu Adam Byerly Daniel Khashabi 49 12 0 20 Jun 2024
On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models Sree Harsha Tanneru Dan Ley Chirag Agarwal Himabindu Lakkaraju LRM 31 4 0 15 Jun 2024