Emergent Linear Representations in World Models of Self-Supervised Sequence Models

2 September 2023

Papers citing "Emergent Linear Representations in World Models of Self-Supervised Sequence Models"

44 / 44 papers shown

Title
Contextures: Representations from Contexts Runtian Zhai Kai Yang Che-Ping Tsai Burak Varici Zico Kolter Pradeep Ravikumar 119 0 0 02 May 2025
Improving Reasoning Performance in Large Language Models via Representation Engineering Bertram Højer Oliver Jarvis Stefan Heinrich LRM 83 1 0 28 Apr 2025
The Geometry of Self-Verification in a Task-Specific Reasoning Model Andrew Lee Lihao Sun Chris Wendler Fernanda Viégas Martin Wattenberg LRM 34 0 0 19 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models Patrick Queiroz Da Silva Hari Sethuraman Dheeraj Rajagopal Hannaneh Hajishirzi Sachin Kumar LLMSV 29 1 0 06 Apr 2025
Shared Global and Local Geometry of Language Model Embeddings Andrew Lee Melanie Weber F. Viégas Martin Wattenberg FedML 79 3 0 27 Mar 2025
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? Yuhang Liu Dong Gong Erdun Gao Zhen Zhang Zhen Zhang Biwei Huang Anton van den Hengel Javen Qinfeng Shi Javen Qinfeng Shi 157 0 0 12 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models Thomas Winninger Boussad Addad Katarzyna Kapusta AAML 68 0 0 08 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Richard Ren Arunim Agarwal Mantas Mazeika Cristina Menghini Robert Vacareanu ... Matias Geralnik Adam Khoja Dean Lee Summer Yue Dan Hendrycks HILM ALM 90 0 0 05 Mar 2025
Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation Jonathan Jacobi Gal Niv LRM ReLM 60 0 0 03 Mar 2025
Linear Representations of Political Perspective Emerge in Large Language Models Junsol Kim James Evans Aaron Schein 77 2 0 03 Mar 2025
ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation Weilong Dong Xinwei Wu Renren Jin Shaoyang Xu Deyi Xiong 65 7 0 31 Dec 2024
ICLR: In-Context Learning of Representations Core Francisco Park Andrew Lee Ekdeep Singh Lubana Yongyi Yang Maya Okawa Kento Nishi Martin Wattenberg Hidenori Tanaka AIFin 120 3 0 29 Dec 2024
How Do Artificial Intelligences Think? The Three Mathematico-Cognitive Factors of Categorical Segmentation Operated by Synthetic Neurons Michael Pichat William Pogrund Armanush Gasparian Paloma Pichat Samuel Demarchi Michael Veillet-Guillem 42 3 0 26 Dec 2024
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models Konstantin Donhauser Kristina Ulicna Gemma Elyse Moran Aditya Ravuri Kian Kenyon-Dean Cian Eastwood Jason Hartford 81 0 0 20 Dec 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 69 3 0 30 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 62 10 0 18 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 5 0 18 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 62 17 0 15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 52 7 0 10 Oct 2024
Provable Weak-to-Strong Generalization via Benign Overfitting David X. Wu A. Sahai 73 6 0 06 Oct 2024
Attention layers provably solve single-location regression P. Marion Raphael Berthier Gérard Biau Claire Boyer 140 2 0 02 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Understanding Generative AI Content with Embedding Models Max Vargas Reilly Cannon A. Engel Anand D. Sarwate Tony Chiang 52 3 0 19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 82 19 0 02 Jul 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models Matteo Bortoletto Constantin Ruhdorfer Lei Shi Andreas Bulling AI4MH LRM 46 5 0 25 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 50 26 0 03 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 42 7 0 31 May 2024
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models Wenshan Wu Shaoguang Mao Yadong Zhang Yan Xia Li Dong Lei Cui Furu Wei LRM 56 20 0 04 Apr 2024
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models Asma Ghandeharioun Avi Caciularu Adam Pearce Lucas Dixon Mor Geva 34 87 0 11 Jan 2024
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 36 6 0 21 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 75 7 0 07 Nov 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets Samuel Marks Max Tegmark HILM 102 169 0 10 Oct 2023
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers Anna Langedijk Hosein Mohebbi Gabriele Sarti Willem H. Zuidema Jaap Jumelet 32 10 0 05 Oct 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 47 142 0 03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 36 97 0 27 Sep 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 160 188 0 02 May 2023
Formal Semantic Geometry over Transformer-based Variational AutoEncoder Yingji Zhang Danilo S. Carvalho Ian Pratt-Hartmann André Freitas 26 4 0 12 Oct 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 135 25 0 04 Oct 2022
Linearly Mapping from Image to Text Space Jack Merullo Louis Castricato Carsten Eickhoff Ellie Pavlick VLM 167 104 0 30 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 125 318 0 21 Sep 2022
Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence Frederik Pahde Maximilian Dreyer Leander Weber Moritz Weckbecker Christopher J. Anders Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 60 7 0 07 Feb 2022
Probing Classifiers: Promises, Shortcomings, and Advances Yonatan Belinkov 226 405 0 24 Feb 2021
What you can cram into a single vector: Probing sentence embeddings for linguistic properties Alexis Conneau Germán Kruszewski Guillaume Lample Loïc Barrault Marco Baroni 201 882 0 03 May 2018
Efficient Estimation of Word Representations in Vector Space Tomáš Mikolov Kai Chen G. Corrado J. Dean 3DV 266 31,267 0 16 Jan 2013