pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

12 March 2024

Jing-ling Huang

Zheng Wang

Noah D. Goodman

Christopher D. Manning

Christopher Potts

Papers citing "pyvene: A Library for Understanding and Improving PyTorch Models via Interventions"

10 / 10 papers shown

Title
MIB: A Mechanistic Interpretability Benchmark Aaron Mueller Atticus Geiger Sarah Wiegreffe Dana Arad Iván Arcuschin ... Alessandro Stolfo Martin Tutek Amir Zur David Bau Yonatan Belinkov 43 1 0 17 Apr 2025
Enhancing Hallucination Detection through Noise Injection Litian Liu Reza Pourreza Sunny Panchal Apratim Bhattacharyya Yao Qin Roland Memisevic HILM 75 2 0 06 Feb 2025
Controllable Context Sensitivity and the Knob Behind It Julian Minder Kevin Du Niklas Stoehr Giovanni Monea Chris Wendler Robert West Ryan Cotterell KELM 49 3 0 11 Nov 2024
Personality Alignment of Large Language Models Minjun Zhu Linyi Yang Yue Zhang Yue Zhang ALM 64 5 0 21 Aug 2024
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data Mingshu Li 36 3 0 01 Aug 2024
Monitoring Latent World States in Language Models with Propositional Probes Jiahai Feng Stuart Russell Jacob Steinhardt HILM 43 6 0 27 Jun 2024
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 75 7 0 07 Nov 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language Models Mor Geva Jasmijn Bastings Katja Filippova Amir Globerson KELM 191 261 0 28 Apr 2023
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Atticus Geiger Zhengxuan Wu Christopher Potts Thomas F. Icard Noah D. Goodman CML 75 98 0 05 Mar 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 212 494 0 01 Nov 2022