Decomposing and Editing Predictions by Modeling Model Computation

Decomposing and Editing Predictions by Modeling Model Computation

17 April 2024

Harshay Shah

Papers citing "Decomposing and Editing Predictions by Modeling Model Computation"

10 / 10 papers shown

Title
Learning to Attribute with Attention Benjamin Cohen-Wang Yung-Sung Chuang Aleksander Madry 30 0 0 18 Apr 2025
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Shichang Zhang Tessa Han Usha Bhalla Hima Lakkaraju FAtt 147 0 0 17 Feb 2025
Jet Expansions of Residual Computation Yihong Chen Xiangxiang Xu Yao Lu Pontus Stenetorp Luca Franceschi 34 3 0 08 Oct 2024
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept YuXuan Wu Bonaventure F. P. Dossou Dianbo Liu MU 21 0 0 08 Oct 2024
Optimal ablation for interpretability Maximilian Li Lucas Janson FAtt 49 2 0 16 Sep 2024
ContextCite: Attributing Model Generation to Context Benjamin Cohen-Wang Harshay Shah Kristian Georgiev Aleksander Madry LRM 30 18 0 01 Sep 2024
When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models Ting-Yun Chang Jesse Thomason Robin Jia 45 4 0 19 Jun 2024
Interpreting the Second-Order Effects of Neurons in CLIP Yossi Gandelsman Alexei A. Efros Jacob Steinhardt MILM 59 16 0 06 Jun 2024
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP S. Balasubramanian Samyadeep Basu S. Feizi CLIP 31 3 0 03 Jun 2024
Learned feature representations are biased by complexity, learning order, position, and more Andrew Kyle Lampinen Stephanie C. Y. Chan Katherine Hermann AI4CE FaML SSL OOD 34 6 0 09 May 2024