Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

29 March 2021

Papers citing "Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors"

22 / 22 papers shown

Title
Are Sparse Autoencoders Useful for Java Function Bug Detection? Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso 31 0 0 15 May 2025
UNet with Axial Transformer : A Neural Weather Model for Precipitation Nowcasting Maitreya Sonawane Sumit Mamtani 65 0 0 28 Apr 2025
Understanding the Repeat Curse in Large Language Models from a Feature Perspective Junchi Yao Shu Yang Jianhua Xu Lijie Hu Mengdi Li Di Wang 29 0 0 19 Apr 2025
The Complexity of Learning Sparse Superposed Features with Feedback Akash Kumar 235 0 0 08 Feb 2025
Out-of-distribution generalization via composition: a lens through induction heads in Transformers Jiajun Song Zhuoyan Xu Yiqiao Zhong 88 4 0 31 Dec 2024
Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu David Wu Jimeng Sun 58 1 0 31 Oct 2024
Focus On This, Not That! Steering LLMs With Adaptive Feature Specification Tom A. Lamb Adam Davies Alasdair Paren Philip Torr Francesco Pinto 52 0 0 30 Oct 2024
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering Yu Zhao Alessio Devoto Giwon Hong Xiaotang Du Aryo Pradipta Gema Hongru Wang Xuanli He Kam-Fai Wong Pasquale Minervini KELM LLMSV 45 17 0 21 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 58 8 0 10 Oct 2024
Residual Stream Analysis with Multi-Layer SAEs Tim Lawson Lucy Farnik Conor Houghton Laurence Aitchison 31 3 0 06 Sep 2024
Understanding Generative AI Content with Embedding Models Max Vargas Reilly Cannon A. Engel Anand D. Sarwate Tony Chiang 60 3 0 19 Aug 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models Daking Rai Yilun Zhou Shi Feng Abulhair Saparov Ziyu Yao 85 22 0 02 Jul 2024
Codebook Features: Sparse and Discrete Interpretability for Neural Networks Alex Tamkin Mohammad Taufeeque Noah D. Goodman 35 27 0 26 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 36 101 0 27 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 33 347 0 15 Sep 2023
Explaining black box text modules in natural language with language models Chandan Singh Aliyah R. Hsu Richard Antonello Shailee Jain Alexander G. Huth Bin-Xia Yu Jianfeng Gao MILM 34 47 0 17 May 2023
Minimalistic Unsupervised Learning with the Sparse Manifold Transform Yubei Chen Zeyu Yun Yi Ma Bruno A. Olshausen Yann LeCun 54 8 0 30 Sep 2022
Interpreting Embedding Spaces by Conceptualization Adi Simhi Shaul Markovitch 24 5 0 22 Aug 2022
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces Timothee Mickus Denis Paperno Mathieu Constant 31 20 0 07 Jun 2022
Explainable Patterns for Distinction and Prediction of Moral Judgement on Reddit Ion Stagkos Efstathiadis Guilherme Paulino-Passos Francesca Toni 31 9 0 26 Jan 2022
Translation Error Detection as Rationale Extraction M. Fomicheva Lucia Specia Nikolaos Aletras 21 23 0 27 Aug 2021
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers Hila Chefer Shir Gur Lior Wolf ViT 31 302 0 29 Mar 2021