Title
BeHonest: Benchmarking Honesty in Large Language Models Steffi Chern Zhulin Hu Yuqing Yang Ethan Chern Yuan Guo Jiahe Jin Binjie Wang Pengfei Liu HILM ALM 100 3 0 19 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience Martina G. Vilas Federico Adolfi David Poeppel Gemma Roig 73 6 0 03 Jun 2024
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning Subhabrata Dutta Joykirat Singh Soumen Chakrabarti Tanmoy Chakraborty LRM 66 25 0 28 Feb 2024
Towards Uncovering How Large Language Model Works: An Explainability Perspective Haiyan Zhao Fan Yang Bo Shen Himabindu Lakkaraju Jundong Li 69 11 0 16 Feb 2024
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity Andrew Lee Xiaoyan Bai Itamar Pres Martin Wattenberg Jonathan K. Kummerfeld Rada Mihalcea 93 117 0 03 Jan 2024
Efficient Large Language Models: A Survey Zhongwei Wan Xin Wang Che Liu Samiul Alam Yu Zheng ... Shen Yan Yi Zhu Quanlu Zhang Mosharaf Chowdhury Mi Zhang LM&MA 35 130 0 06 Dec 2023
Show Your Work with Confidence: Confidence Bands for Tuning Curves Nicholas Lourie Kyunghyun Cho He He 28 2 0 16 Nov 2023
How do Language Models Bind Entities in Context? Jiahai Feng Jacob Steinhardt 84 39 0 26 Oct 2023
Towards Understanding Sycophancy in Language Models Mrinank Sharma Meg Tong Tomasz Korbak David Duvenaud Amanda Askell ... Oliver Rausch Nicholas Schiefer Da Yan Miranda Zhang Ethan Perez 284 226 0 20 Oct 2023
Getting aligned on representational alignment Ilia Sucholutsky Lukas Muttenthaler Adrian Weller Andi Peng Andreea Bobu ... Thomas Unterthiner Andrew Kyle Lampinen Klaus-Robert Muller M. Toneva Thomas Griffiths 98 88 0 18 Oct 2023
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks Ziming Liu Mikail Khona Ila R. Fiete Max Tegmark 69 12 0 11 Oct 2023
Copy Suppression: Comprehensively Understanding an Attention Head Callum McDougall Arthur Conmy Cody Rushing Thomas McGrath Neel Nanda MILM 49 45 0 06 Oct 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 104 156 0 03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 169 108 0 27 Sep 2023
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve R. Thomas McCoy Shunyu Yao Dan Friedman Matthew Hardy Thomas Griffiths 46 153 0 24 Sep 2023
Sparse Autoencoders Find Highly Interpretable Features in Language Models Hoagy Cunningham Aidan Ewart Logan Riggs R. Huben Lee Sharkey MILM 90 412 0 15 Sep 2023
Neurons in Large Language Models: Dead, N-gram, Positional Elena Voita Javier Ferrando Christoforos Nalmpantis MILM 119 54 0 09 Sep 2023
Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf Yuzhuang Xu Shuo Wang Peng Li Ziyue Wang Xiaolong Wang Weidong Liu Yang Liu LLMAG 40 202 0 09 Sep 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions Peter S. Park Simon Goldstein Aidan O'Gara Michael Chen Dan Hendrycks 60 153 0 28 Aug 2023
Deception Abilities Emerged in Large Language Models Thilo Hagendorff LLMAG 54 83 0 31 Jul 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations Tom McGrath Matthew Rahtz János Kramár Vladimir Mikulik Shane Legg MILM LRM 38 72 0 28 Jul 2023
Overthinking the Truth: Understanding how Language Models Process False Demonstrations Danny Halawi Jean-Stanislas Denain Jacob Steinhardt 63 59 0 18 Jul 2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla Tom Lieberum Matthew Rahtz János Kramár Neel Nanda G. Irving Rohin Shah Vladimir Mikulik 82 113 0 18 Jul 2023
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought L. Wong Gabriel Grand Alexander K. Lew Noah D. Goodman Vikash K. Mansinghka Jacob Andreas J. Tenenbaum LRM AI4CE 32 106 0 22 Jun 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model Kenneth Li Oam Patel Fernanda Viégas Hanspeter Pfister Martin Wattenberg KELM HILM 85 548 0 06 Jun 2023
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors Paul S. Scotti Atmadeep Banerjee J. Goode Stepan Shabalin A. Nguyen ... Nathalie Verlinde Elad Yundler David Weisberg K. A. Norman Tanishq Mathew Abraham DiffM 73 118 0 29 May 2023
Model evaluation for extreme risks Toby Shevlane Sebastian Farquhar Ben Garfinkel Mary Phuong Jess Whittlestone ... Vijay Bolina Jack Clark Yoshua Bengio Paul Christiano Allan Dafoe ELM 73 159 0 24 May 2023
How Language Model Hallucinations Can Snowball Muru Zhang Ofir Press William Merrill Alisa Liu Noah A. Smith HILM LRM 113 274 0 22 May 2023
Scaling laws for language encoding models in fMRI Richard Antonello Aditya R. Vaidya Alexander G. Huth MedIm 62 64 0 19 May 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing Wes Gurnee Neel Nanda Matthew Pauly Katherine Harvey Dmitrii Troitskii Dimitris Bertsimas MILM 185 211 0 02 May 2023
GPT-4 Technical Report OpenAI OpenAI OpenAI Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad ... Shengjia Zhao Tianhao Zheng Juntang Zhuang William Zhuk Barret Zoph LLMAG MLLM 1.2K 14,179 0 15 Mar 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas Icard Noah D. Goodman CML 96 107 0 05 Mar 2023
Progress measures for grokking via mechanistic interpretability Neel Nanda Lawrence Chan Tom Lieberum Jess Smith Jacob Steinhardt 71 431 0 12 Jan 2023
Transformers learn in-context by gradient descent J. Oswald Eyvind Niklasson E. Randazzo João Sacramento A. Mordvintsev A. Zhmoginov Max Vladymyrov MLT 91 487 0 15 Dec 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 292 549 0 01 Nov 2022
Omnigrok: Grokking Beyond Algorithmic Data Ziming Liu Eric J. Michaud Max Tegmark 83 82 0 03 Oct 2022
In-context Learning and Induction Heads Catherine Olsson Nelson Elhage Neel Nanda Nicholas Joseph Nova Dassarma ... Tom B. Brown Jack Clark Jared Kaplan Sam McCandlish C. Olah 305 510 0 24 Sep 2022
Toy Models of Superposition Nelson Elhage Tristan Hume Catherine Olsson Nicholas Schiefer T. Henighan ... Sam McCandlish Jared Kaplan Dario Amodei Martin Wattenberg C. Olah AAML MILM 172 363 0 21 Sep 2022
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks Tilman Raukur A. Ho Stephen Casper Dylan Hadfield-Menell AAML AI4CE 74 132 0 27 Jul 2022
Single-phase deep learning in cortico-cortical networks Will Greedy He Zhu Joe Pemberton J. Mellor Rui Ponte Costa 41 37 0 23 Jun 2022
Towards Understanding Grokking: An Effective Theory of Representation Learning Ziming Liu O. Kitouni Niklas Nolte Eric J. Michaud Max Tegmark Mike Williams AI4CE 72 152 0 20 May 2022
When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes Mycal Tucker Tiwalayo Eisape Peng Qian R. Levy J. Shah MILM 38 12 0 20 Apr 2022
Quantifying Memorization Across Neural Language Models Nicholas Carlini Daphne Ippolito Matthew Jagielski Katherine Lee Florian Tramèr Chiyuan Zhang PILM 100 614 0 15 Feb 2022
Locating and Editing Factual Associations in GPT Kevin Meng David Bau A. Andonian Yonatan Belinkov KELM 215 1,344 0 10 Feb 2022
Survey of Hallucination in Natural Language Generation Ziwei Ji Nayeon Lee Rita Frieske Tiezheng Yu D. Su ... Delong Chen Wenliang Dai Ho Shu Chan Andrea Madotto Pascale Fung HILM LRM 189 2,356 0 08 Feb 2022
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Alethea Power Yuri Burda Harrison Edwards Igor Babuschkin Vedant Misra 73 354 0 06 Jan 2022
An Explanation of In-context Learning as Implicit Bayesian Inference Sang Michael Xie Aditi Raghunathan Percy Liang Tengyu Ma ReLM BDL VPVLM LRM 177 746 0 03 Nov 2021
Causal Abstractions of Neural Networks Atticus Geiger Hanson Lu Thomas Icard Christopher Potts NAI CML 66 241 0 06 Jun 2021
Examining the Inductive Bias of Neural Language Models with Artificial Languages Jennifer C. White Ryan Cotterell 57 44 0 02 Jun 2021
Are Convolutional Neural Networks or Transformers more like human vision? Shikhar Tuli Ishita Dasgupta Erin Grant Thomas Griffiths ViT FaML 54 185 0 15 May 2021