Title
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories Tianlong Wang Xianfeng Jiao Yifan He Zhongzhi Chen Yinghao Zhu Xu Chu Junyi Gao Yasha Wang Liantao Ma LLMSV 71 7 0 26 May 2024
Securing the Future of GenAI: Policy and Technology Mihai Christodorescu Craven S. Feizi Neil Zhenqiang Gong Mia Hoffmann ... Jessica Newman Emelia Probasco Yanjun Qi Khawaja Shams Turek SILM 52 3 0 21 May 2024
When LLMs Meet Cybersecurity: A Systematic Literature Review Jie Zhang Haoyu Bu Hui Wen Yu Chen Lun Li Hongsong Zhu 45 36 0 06 May 2024
KAN: Kolmogorov-Arnold Networks Ziming Liu Yixuan Wang Sachin Vaidya Fabian Ruehle James Halverson Marin Soljacic Thomas Y. Hou Max Tegmark 98 475 0 30 Apr 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models Samuel Marks Can Rager Eric J. Michaud Yonatan Belinkov David Bau Aaron Mueller 46 115 0 28 Mar 2024
Language Models Represent Beliefs of Self and Others Wentao Zhu Zhining Zhang Yizhou Wang MILM LRM 50 8 0 28 Feb 2024
Carrying over algorithm in transformers J. Kruthoff 24 0 0 15 Jan 2024
In-Context Reinforcement Learning for Variable Action Spaces Viacheslav Sinii Alexander Nikulin Vladislav Kurenkov Ilya Zisman Sergey Kolesnikov 24 14 0 20 Dec 2023
Forbidden Facts: An Investigation of Competing Objectives in Llama-2 Tony T. Wang Miles Wang Kaivu Hariharan Nir Shavit 21 2 0 14 Dec 2023
FlexModel: A Framework for Interpretability of Distributed Large Language Models Matthew Choi Muhammad Adil Asif John Willes David Emerson AI4CE ALM 27 1 0 05 Dec 2023
Identifying Linear Relational Concepts in Large Language Models David Chanin Anthony Hunter Oana-Maria Camburu LLMSV KELM 23 4 0 15 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 75 7 0 07 Nov 2023
Identifying Interpretable Visual Features in Artificial and Biological Neural Systems David A. Klindt Sophia Sanborn Francisco Acosta Frédéric Poitevin Nina Miolane MILM FAtt 44 7 0 17 Oct 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 47 142 0 03 Oct 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 33 335 0 15 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP Vedant Palit Rohan Pandey Aryaman Arora Paul Pu Liang 34 20 0 27 Aug 2023
Identifying Interpretable Subspaces in Image Representations Neha Kalibhat S. Bhardwaj Bayan Bruss Hamed Firooz Maziar Sanjabi S. Feizi FAtt 42 26 0 20 Jul 2023
Uncovering Unique Concept Vectors through Latent Space Decomposition Mara Graziani Laura Mahony An-phi Nguyen Henning Muller Vincent Andrearczyk 43 4 0 13 Jul 2023
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability Ziming Liu Eric Gan Max Tegmark 26 36 0 04 May 2023
Redundancy and Concept Analysis for Code-trained Language Models Arushi Sharma Zefu Hu Christopher Quinn Ali Jannesari 73 1 0 01 May 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models Alex Foote Neel Nanda Esben Kran Ionnis Konstas Fazl Barez MILM 28 2 0 22 Apr 2023
Visual DNA: Representing and Comparing Images using Distributions of Neuron Activations Benjamin Ramtoula Matthew Gadd Paul Newman D. Martini 28 10 0 20 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 22 193 0 14 Mar 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability David Lindner János Kramár Sebastian Farquhar Matthew Rahtz Tom McGrath Vladimir Mikulik 29 72 0 12 Jan 2023
Circumventing interpretability: How to defeat mind-readers Lee D. Sharkey 35 3 0 21 Dec 2022
Schrödinger's Bat: Diffusion Models Sometimes Generate Polysemous Words in Superposition Jennifer C. White Ryan Cotterell DiffM 38 5 0 23 Nov 2022
Interpreting Neural Networks through the Polytope Lens Sid Black Lee D. Sharkey Léo Grinsztajn Eric Winsor Daniel A. Braun ... Kip Parker Carlos Ramón Guevara Beren Millidge Gabriel Alfour Connor Leahy FAtt MILM 31 22 0 22 Nov 2022
CRAFT: Concept Recursive Activation FacTorization for Explainability Thomas Fel Agustin Picard Louis Bethune Thibaut Boissin David Vigouroux Julien Colin Rémi Cadène Thomas Serre 19 102 0 17 Nov 2022
Engineering Monosemanticity in Toy Models Adam Jermyn Nicholas Schiefer Evan Hubinger MILM 25 9 0 16 Nov 2022
Polysemanticity and Capacity in Neural Networks Adam Scherlis Kshitij Sachan Adam Jermyn Joe Benton Buck Shlegeris MILM 135 25 0 04 Oct 2022
Measuring Self-Supervised Representation Quality for Downstream Classification using Discriminative Features Neha Kalibhat Kanika Narang Hamed Firooz Maziar Sanjabi S. Feizi SSL 38 7 0 03 Mar 2022
Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence Frederik Pahde Maximilian Dreyer Leander Weber Moritz Weckbecker Christopher J. Anders Thomas Wiegand Wojciech Samek Sebastian Lapuschkin 60 7 0 07 Feb 2022