Title
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i Kola Ayonrinde Louis Jaburi MILM 86 1 0 01 May 2025
ReSi: A Comprehensive Benchmark for Representational Similarity Measures Max Klabunde Tassilo Wald Tobias Schumacher Klaus H. Maier-Hein Markus Strohmaier Adriana Iamnitchi AI4TS VLM 76 5 0 13 Mar 2025
Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment Harrish Thasarathan Julian Forsyth Thomas Fel M. Kowal Konstantinos G. Derpanis 111 7 0 06 Feb 2025
We're Different, We're the Same: Creative Homogeneity Across LLMs Emily Wenger Yoed Kenett 91 3 0 31 Jan 2025
Dimensions underlying the representational alignment of deep neural networks with humans F. Mahner Lukas Muttenthaler Umut Güçlü M. Hebart 48 4 0 28 Jan 2025
Measuring Error Alignment for Decision-Making Systems Binxia Xu Antonis Bikakis Daniel Onah A. Vlachidis Luke Dickens 41 0 0 03 Jan 2025
Differentiable Optimization of Similarity Scores Between Models and Brains Nathan Cloos Moufan Li Markus Siegel S. Brincat Earl K. Miller Guangyu Robert Yang Christopher J. Cueva 45 6 0 31 Dec 2024
Quantifying Knowledge Distillation Using Partial Information Decomposition Pasan Dissanayake Faisal Hamman Barproda Halder Ilia Sucholutsky Qiuyi Zhang Sanghamitra Dutta 36 0 0 12 Nov 2024
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models Michael Lan Philip H. S. Torr Austin Meek Ashkan Khakzar David M. Krueger Fazl Barez 43 10 0 09 Oct 2024
Emergence of a High-Dimensional Abstraction Phase in Language Transformers Emily Cheng Diego Doimo Corentin Kervadec Iuri Macocco Jade Yu A. Laio Marco Baroni 112 11 0 24 May 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 34 6 0 09 May 2024
Learning with Language-Guided State Abstractions Andi Peng Ilia Sucholutsky Belinda Z. Li T. Sumers Thomas L. Griffiths Jacob Andreas Julie A. Shah LM&Ro 49 13 0 28 Feb 2024
Similarity of Neural Network Models: A Survey of Functional and Representational Measures Max Klabunde Tobias Schumacher M. Strohmaier Florian Lemmerich 52 64 0 10 May 2023
Human Uncertainty in Concept-Based AI Systems Katherine M. Collins Matthew Barker M. Zarlenga Naveen Raman Umang Bhatt M. Jamnik Ilia Sucholutsky Adrian Weller Krishnamurthy Dvijotham 66 39 0 22 Mar 2023
Analyzing Diffusion as Serial Reproduction Raja Marjieh Ilia Sucholutsky Thomas A. Langlois Nori Jacoby Thomas L. Griffiths DiffM 33 4 0 29 Sep 2022
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 227 502 0 28 Sep 2022
Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off M. Zarlenga Pietro Barbiero Gabriele Ciravegna G. Marra Francesco Giannini ... F. Precioso S. Melacci Adrian Weller Pietro Lio' M. Jamnik 79 52 0 19 Sep 2022
The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks Lukas Huber Robert Geirhos Felix Wichmann 54 16 0 20 May 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 313 11,953 0 04 Mar 2022
Passive Attention in Artificial Neural Networks Predicts Human Visual Selectivity Thomas A. Langlois H. C. Zhao Erin Grant Ishita Dasgupta Thomas L. Griffiths Nori Jacoby 47 15 0 14 Jul 2021
Zero-Shot Text-to-Image Generation Aditya A. Ramesh Mikhail Pavlov Gabriel Goh Scott Gray Chelsea Voss Alec Radford Mark Chen Ilya Sutskever VLM 255 4,781 0 24 Feb 2021
On the surprising similarities between supervised and self-supervised models Robert Geirhos Kantharaju Narayanappa Benjamin Mitzkus Matthias Bethge Felix Wichmann Wieland Brendel OOD SSL DRL 74 46 0 16 Oct 2020
On Completeness-aware Concept-Based Explanations in Deep Neural Networks Chih-Kuan Yeh Been Kim Sercan Ö. Arik Chun-Liang Li Tomas Pfister Pradeep Ravikumar FAtt 122 297 0 17 Oct 2019
Fine-Tuning Language Models from Human Preferences Daniel M. Ziegler Nisan Stiennon Jeff Wu Tom B. Brown Alec Radford Dario Amodei Paul Christiano G. Irving ALM 280 1,595 0 18 Sep 2019
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Chelsea Finn Pieter Abbeel Sergey Levine OOD 338 11,684 0 09 Mar 2017
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles Balaji Lakshminarayanan Alexander Pritzel Charles Blundell UQCV BDL 276 5,661 0 05 Dec 2016