ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.03658
  4. Cited By
The Linear Representation Hypothesis and the Geometry of Large Language
  Models

The Linear Representation Hypothesis and the Geometry of Large Language Models

7 November 2023
Kiho Park
Yo Joong Choe
Victor Veitch
    LLMSV
    MILM
ArXivPDFHTML

Papers citing "The Linear Representation Hypothesis and the Geometry of Large Language Models"

50 / 128 papers shown
Title
A gentle push funziona benissimo: making instructed models in Italian
  via contrastive activation steering
A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering
Daniel Scalena
Elisabetta Fersini
Malvina Nissim
LLMSV
78
0
0
27 Nov 2024
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
82
12
0
21 Nov 2024
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
Zeqing He
Zhibo Wang
Zhixuan Chu
Huiyu Xu
Rui Zheng
Kui Ren
Chun Chen
57
3
0
17 Nov 2024
Towards Utilising a Range of Neural Activations for Comprehending
  Representational Associations
Towards Utilising a Range of Neural Activations for Comprehending Representational Associations
Laura O'Mahony
Nikola S. Nikolov
David JP O'Sullivan
33
0
0
15 Nov 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
42
5
0
07 Nov 2024
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse
  Activation Control
Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control
Yuxin Xiao
Chaoqun Wan
Yonggang Zhang
Wenxiao Wang
Binbin Lin
Xiaofei He
Xu Shen
Jieping Ye
29
0
0
04 Nov 2024
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Md Rifat Arefin
G. Subbaraj
Nicolas Angelard-Gontier
Yann LeCun
Irina Rish
Ravid Shwartz-Ziv
C. Pal
LRM
173
0
0
04 Nov 2024
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Maria Kirch
Constantin Weisser
Severin Field
Helen Yannakoudakis
Stephen Casper
39
2
0
02 Nov 2024
ResiDual Transformer Alignment with Spectral Decomposition
ResiDual Transformer Alignment with Spectral Decomposition
Lorenzo Basile
Valentino Maiorca
Luca Bortolussi
Emanuele Rodolà
Francesco Locatello
48
1
0
31 Oct 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
Emanuele Marconato
Sébastien Lachapelle
Sebastian Weichwald
Luigi Gresele
69
3
0
30 Oct 2024
Cross-Entropy Is All You Need To Invert the Data Generating Process
Cross-Entropy Is All You Need To Invert the Data Generating Process
Patrik Reizinger
Alice Bizeul
Attila Juhos
Julia E. Vogt
Randall Balestriero
Wieland Brendel
David Klindt
SSL
OOD
BDL
DRL
102
3
0
29 Oct 2024
Efficient Training of Sparse Autoencoders for Large Language Models via
  Layer Groups
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups
Davide Ghilardi
Federico Belotti
Marco Molinari
40
2
0
28 Oct 2024
Fine-Tuning Pre-trained Language Models for Robust Causal Representation
  Learning
Fine-Tuning Pre-trained Language Models for Robust Causal Representation Learning
Jialin Yu
Yuxiang Zhou
Yulan He
Nevin L. Zhang
Ricardo Silva
36
0
0
18 Oct 2024
Decomposing The Dark Matter of Sparse Autoencoders
Decomposing The Dark Matter of Sparse Autoencoders
Joshua Engels
Logan Riggs
Max Tegmark
LLMSV
65
10
0
18 Oct 2024
Do LLMs "know" internally when they follow instructions?
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
51
5
0
18 Oct 2024
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang
J. Yang
Wei Peng
LLMSV
28
3
0
16 Oct 2024
Improving Instruction-Following in Language Models through Activation Steering
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo
Vidhisha Balachandran
Safoora Yousefi
Eric Horvitz
Besmira Nushi
LLMSV
62
17
0
15 Oct 2024
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin
Sadhika Malladi
Adithya Bhaskar
Danqi Chen
Sanjeev Arora
Boris Hanin
99
16
0
11 Oct 2024
Efficient Dictionary Learning with Switch Sparse Autoencoders
Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide
Joshua Engels
Eric J. Michaud
Max Tegmark
Christian Schroeder de Witt
23
7
0
10 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
55
7
0
10 Oct 2024
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text
CiMaTe: Citation Count Prediction Effectively Leveraging the Main Text
Jun Hirako
Ryohei Sasano
Koichi Takeda
39
2
0
06 Oct 2024
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View
Understanding Reasoning in Chain-of-Thought from the Hopfieldian View
Lijie Hu
Liang Liu
Shu Yang
Xin Chen
Zhen Tan
Muhammad Asif Ali
Mengdi Li
Di Wang
LRM
46
1
0
04 Oct 2024
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable
  Radiology Report Generation
An X-Ray Is Worth 15 Features: Sparse Autoencoders for Interpretable Radiology Report Generation
Ahmed Abdulaal
Hugo Fry
Nina Montaña-Brown
Ayodeji Ijishakin
Jack Gao
Stephanie L. Hyland
Daniel C. Alexander
Daniel Coelho De Castro
MedIm
39
8
0
04 Oct 2024
Towards a Law of Iterated Expectations for Heuristic Estimators
Towards a Law of Iterated Expectations for Heuristic Estimators
Paul Christiano
Jacob Hilton
Andrea Lincoln
Eric Neyman
Mark Xu
16
0
0
02 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language
  Models
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
34
4
0
02 Oct 2024
Sparse Attention Decomposition Applied to Circuit Tracing
Sparse Attention Decomposition Applied to Circuit Tracing
Gabriel Franco
Mark Crovella
36
0
0
01 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
2
0
30 Sep 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
62
10
0
30 Sep 2024
A is for Absorption: Studying Feature Splitting and Absorption in Sparse
  Autoencoders
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin
James Wilken-Smith
Tomáš Dulka
Hardik Bhatnagar
Joseph Bloom
23
20
0
22 Sep 2024
Causal Language Modeling Can Elicit Search and Reasoning Capabilities on
  Logic Puzzles
Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles
Kulin Shah
Nishanth Dikkala
Xin Wang
Rina Panigrahy
ELM
ReLM
LRM
37
9
0
16 Sep 2024
Residual Stream Analysis with Multi-Layer SAEs
Residual Stream Analysis with Multi-Layer SAEs
Tim Lawson
Lucy Farnik
Conor Houghton
Laurence Aitchison
31
3
0
06 Sep 2024
Programming Refusal with Conditional Activation Steering
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
K. Ramamurthy
Erik Miehling
Pierre L. Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
105
14
0
06 Sep 2024
Attention Heads of Large Language Models: A Survey
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Zhiyu Li
Zhiyu Li
LRM
58
22
0
05 Sep 2024
The representation landscape of few-shot learning and fine-tuning in
  large language models
The representation landscape of few-shot learning and fine-tuning in large language models
Diego Doimo
Alessandro Serra
A. Ansuini
Alberto Cazzaniga
96
4
0
05 Sep 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical
  Grounding of Causal Interpretability
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
52
18
0
02 Aug 2024
Measuring Progress in Dictionary Learning for Language Model
  Interpretability with Board Game Models
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen
Benjamin Wright
Can Rager
Rico Angell
Jannik Brinkmann
Logan Smith
C. M. Verdun
David Bau
Samuel Marks
38
26
0
31 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
17
0
17 Jul 2024
Compositional Structures in Neural Embedding and Interaction
  Decompositions
Compositional Structures in Neural Embedding and Interaction Decompositions
Matthew Trager
Alessandro Achille
Pramuditha Perera
L. Zancato
Stefano Soatto
CoGe
37
0
0
12 Jul 2024
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept
  Space
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Core Francisco Park
Maya Okawa
Andrew Lee
Ekdeep Singh Lubana
Hidenori Tanaka
62
7
0
27 Jun 2024
Transformer Normalisation Layers and the Independence of Semantic
  Subspaces
Transformer Normalisation Layers and the Independence of Semantic Subspaces
S. Menary
Samuel Kaski
Andre Freitas
44
2
0
25 Jun 2024
Towards a Science Exocortex
Towards a Science Exocortex
Kevin G. Yager
80
0
0
24 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
44
7
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
50
136
0
17 Jun 2024
Breaking the Attention Bottleneck
Breaking the Attention Bottleneck
Kalle Hilsenbek
89
0
0
16 Jun 2024
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Jack Merullo
Carsten Eickhoff
Ellie Pavlick
58
13
0
13 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin
  for Preference Datasets
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng-Wei Zhang
Wenqiang Lei
44
2
0
12 Jun 2024
PaCE: Parsimonious Concept Engineering for Large Language Models
PaCE: Parsimonious Concept Engineering for Large Language Models
Jinqi Luo
Tianjiao Ding
Kwan Ho Ryan Chan
D. Thaker
Aditya Chattopadhyay
Chris Callison-Burch
René Vidal
CVBM
42
7
0
06 Jun 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Tianren Zhang
Chujie Zhao
Guanyu Chen
Yizhou Jiang
Feng Chen
OOD
MLT
OODD
77
3
0
05 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
50
27
0
03 Jun 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive
  Inference Pass
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Ethan Shen
Alan Fan
Sarah M Pratt
Jae Sung Park
Matthew Wallingford
Sham Kakade
Ari Holtzman
Ranjay Krishna
Ali Farhadi
Aditya Kusupati
47
2
0
28 May 2024
Previous
123
Next