The Linear Representation Hypothesis and the Geometry of Large Language Models

7 November 2023

Papers citing "The Linear Representation Hypothesis and the Geometry of Large Language Models"

50 / 128 papers shown

Title
A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering Daniel Scalena Elisabetta Fersini Malvina Nissim LLMSV 78 0 0 27 Nov 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models Javier Ferrando Oscar Obeso Senthooran Rajamanoharan Neel Nanda 82 12 0 21 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit Zeqing He Zhibo Wang Zhixuan Chu Huiyu Xu Rui Zheng Kui Ren Chun Chen 57 3 0 17 Nov 2024
Towards Utilising a Range of Neural Activations for Comprehending Representational Associations Laura O'Mahony Nikola S. Nikolov David JP O'Sullivan 33 0 0 15 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 42 5 0 07 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control Yuxin Xiao Chaoqun Wan Yonggang Zhang Wenxiao Wang Binbin Lin Xiaofei He Xu Shen Jieping Ye 29 0 0 04 Nov 2024
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning Md Rifat Arefin G. Subbaraj Nicolas Angelard-Gontier Yann LeCun Irina Rish Ravid Shwartz-Ziv C. Pal LRM 173 0 0 04 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks Nathalie Maria Kirch Constantin Weisser Severin Field Helen Yannakoudakis Stephen Casper 39 2 0 02 Nov 2024
ResiDual Transformer Alignment with Spectral Decomposition Lorenzo Basile Valentino Maiorca Luca Bortolussi Emanuele Rodolà Francesco Locatello 48 1 0 31 Oct 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 69 3 0 30 Oct 2024
Cross-Entropy Is All You Need To Invert the Data Generating Process Patrik Reizinger Alice Bizeul Attila Juhos Julia E. Vogt Randall Balestriero Wieland Brendel David Klindt SSL OOD BDL DRL 102 3 0 29 Oct 2024
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups Davide Ghilardi Federico Belotti Marco Molinari 40 2 0 28 Oct 2024
Fine-Tuning Pre-trained Language Models for Robust Causal Representation Learning Jialin Yu Yuxiang Zhou Yulan He Nevin L. Zhang Ricardo Silva 36 0 0 18 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders Joshua Engels Logan Riggs Max Tegmark LLMSV 65 10 0 18 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 5 0 18 Oct 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors Weixuan Wang J. Yang Wei Peng LLMSV 28 3 0 16 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering Alessandro Stolfo Vidhisha Balachandran Safoora Yousefi Eric Horvitz Besmira Nushi LLMSV 62 17 0 15 Oct 2024
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization Noam Razin Sadhika Malladi Adithya Bhaskar Danqi Chen Sanjeev Arora Boris Hanin 99 16 0 11 Oct 2024
Efficient Dictionary Learning with Switch Sparse Autoencoders Anish Mudide Joshua Engels Eric J. Michaud Max Tegmark Christian Schroeder de Witt 23 7 0 10 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 55 7 0 10 Oct 2024
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text Jun Hirako Ryohei Sasano Koichi Takeda 39 2 0 06 Oct 2024
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View Lijie Hu Liang Liu Shu Yang Xin Chen Zhen Tan Muhammad Asif Ali Mengdi Li Di Wang LRM 46 1 0 04 Oct 2024
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation Ahmed Abdulaal Hugo Fry Nina Montaña-Brown Ayodeji Ijishakin Jack Gao Stephanie L. Hyland Daniel C. Alexander Daniel Coelho De Castro MedIm 39 8 0 04 Oct 2024
Towards a Law of Iterated Expectations for Heuristic Estimators Paul Christiano Jacob Hilton Andrea Lincoln Eric Neyman Mark Xu 16 0 0 02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language Models Amrita Bhattacharjee Shaona Ghosh Traian Rebedea Christopher Parisien LLMSV 34 4 0 02 Oct 2024
Sparse Attention Decomposition Applied to Circuit Tracing Gabriel Franco Mark Crovella 36 0 0 01 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Robust LLM safeguarding via refusal feature adversarial training L. Yu Virginie Do Karen Hambardzumyan Nicola Cancedda AAML 62 10 0 30 Sep 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders David Chanin James Wilken-Smith Tomáš Dulka Hardik Bhatnagar Joseph Bloom 23 20 0 22 Sep 2024
Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles Kulin Shah Nishanth Dikkala Xin Wang Rina Panigrahy ELM ReLM LRM 37 9 0 16 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 31 3 0 06 Sep 2024
Programming Refusal with Conditional Activation Steering Bruce W. Lee Inkit Padhi K. Ramamurthy Erik Miehling Pierre L. Dognin Manish Nagireddy Amit Dhurandhar LLMSV 105 14 0 06 Sep 2024
Attention Heads of Large Language Models: A Survey Zifan Zheng Yezhaohui Wang Yuxin Huang Shichao Song Mingchuan Yang Bo Tang Zhiyu Li Zhiyu Li LRM 58 22 0 05 Sep 2024
The representation landscape of few-shot learning and fine-tuning in large language models Diego Doimo Alessandro Serra A. Ansuini Alberto Cazzaniga 96 4 0 05 Sep 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability Aaron Mueller Jannik Brinkmann Millicent Li Samuel Marks Koyena Pal ... Arnab Sen Sharma Jiuding Sun Eric Todd David Bau Yonatan Belinkov CML 52 18 0 02 Aug 2024
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Adam Karvonen Benjamin Wright Can Rager Rico Angell Jannik Brinkmann Logan Smith C. M. Verdun David Bau Samuel Marks 38 26 0 31 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors Daniel Tan David Chanin Aengus Lynch Dimitrios Kanoulas Brooks Paige Adrià Garriga-Alonso Robert Kirk LLMSV 84 17 0 17 Jul 2024
Compositional Structures in Neural Embedding and Interaction Decompositions Matthew Trager Alessandro Achille Pramuditha Perera L. Zancato Stefano Soatto CoGe 37 0 0 12 Jul 2024
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space Core Francisco Park Maya Okawa Andrew Lee Ekdeep Singh Lubana Hidenori Tanaka 62 7 0 27 Jun 2024
Transformer Normalisation Layers and the Independence of Semantic Subspaces S. Menary Samuel Kaski Andre Freitas 44 2 0 25 Jun 2024
Towards a Science Exocortex Kevin G. Yager 80 0 0 24 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment Asma Ghandeharioun Ann Yuan Marius Guerard Emily Reif Michael A. Lepori Lucas Dixon LLMSV 44 7 0 17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction Andy Arditi Oscar Obeso Aaquib Syed Daniel Paleka Nina Panickssery Wes Gurnee Neel Nanda 50 136 0 17 Jun 2024
Breaking the Attention Bottleneck Kalle Hilsenbek 89 0 0 16 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models Jack Merullo Carsten Eickhoff Ellie Pavlick 58 13 0 13 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets Duanyu Feng Bowen Qin Chen Huang Youcheng Huang Zheng-Wei Zhang Wenqiang Lei 44 2 0 12 Jun 2024
PaCE: Parsimonious Concept Engineering for Large Language Models Jinqi Luo Tianjiao Ding Kwan Ho Ryan Chan D. Thaker Aditya Chattopadhyay Chris Callison-Burch René Vidal CVBM 42 7 0 06 Jun 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize Tianren Zhang Chujie Zhao Guanyu Chen Yizhou Jiang Feng Chen OOD MLT OODD 77 3 0 05 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 50 27 0 03 Jun 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass Ethan Shen Alan Fan Sarah M Pratt Jae Sung Park Matthew Wallingford Sham Kakade Ari Holtzman Ranjay Krishna Ali Farhadi Aditya Kusupati 47 2 0 28 May 2024