ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.15255
  4. Cited By
How to use and interpret activation patching

How to use and interpret activation patching

23 April 2024
Stefan Heimersheim
Neel Nanda
ArXivPDFHTML

Papers citing "How to use and interpret activation patching"

36 / 36 papers shown
Title
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen
Jiaying Zhu
Xinyu Yang
Wenya Wang
LRM
9
0
0
15 May 2025
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation
Chiara Manna
Afra Alishahi
Frédéric Blain
Eva Vanmassenhove
24
0
0
13 May 2025
Towards Quantifying Commonsense Reasoning with Mechanistic Insights
Towards Quantifying Commonsense Reasoning with Mechanistic Insights
Abhinav Joshi
A. Ahmad
Divyaksh Shukla
Ashutosh Modi
ReLM
LRM
36
0
0
14 Apr 2025
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
Qi Liu
Jiaxin Mao
Ji-Rong Wen
LRM
29
0
0
10 Apr 2025
Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI
Mechanistic Interpretability of Fine-Tuned Vision Transformers on Distorted Images: Decoding Attention Head Behavior for Transparent and Trustworthy AI
Nooshin Bahador
50
1
0
24 Mar 2025
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack
Murong Yue
Ziyu Yao
SILM
AAML
56
0
0
18 Mar 2025
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
Philip Quirke
Clement Neo
Abir Harrasse
Dhruv Nathawani
Amir Abdullah
44
0
0
17 Mar 2025
(How) Do Language Models Track State?
Belinda Z. Li
Zifan Carl Guo
Jacob Andreas
LRM
46
0
0
04 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
102
0
0
24 Feb 2025
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Hiba Ahsan
Arnab Sen Sharma
Silvio Amir
David Bau
Byron C. Wallace
88
0
0
20 Feb 2025
Exploring Translation Mechanism of Large Language Models
Exploring Translation Mechanism of Large Language Models
Hongbin Zhang
Kehai Chen
Xuefeng Bai
Xiucheng Li
Yang Xiang
Min Zhang
64
1
0
17 Feb 2025
Designing Role Vectors to Improve LLM Inference Behaviour
Designing Role Vectors to Improve LLM Inference Behaviour
Daniele Potertì
Andrea Seveso
Fabio Mercorio
LLMSV
49
0
0
17 Feb 2025
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Transformer Dynamics: A neuroscientific approach to interpretability of large language models
Jesseba Fernando
Grigori Guitchounts
AI4CE
36
0
0
17 Feb 2025
Mechanistic Interpretability of Emotion Inference in Large Language Models
Mechanistic Interpretability of Emotion Inference in Large Language Models
Ala Nekouvaght Tak
Amin Banayeeanzade
Anahita Bolourani
Mina Kian
Robin Jia
Jonathan Gratch
54
0
0
08 Feb 2025
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
It's Not Just a Phase: On Investigating Phase Transitions in Deep Learning-based Side-channel Analysis
Sengim Karayalçin
Marina Krček
Stjepan Picek
AAML
75
0
0
01 Feb 2025
Representation in large language models
Cameron C. Yetman
41
1
0
03 Jan 2025
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando
Oscar Obeso
Senthooran Rajamanoharan
Neel Nanda
82
10
0
21 Nov 2024
How Transformers Solve Propositional Logic Problems: A Mechanistic
  Analysis
How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis
Guan Zhe Hong
Nishanth Dikkala
Enming Luo
Cyrus Rashtchian
Xin Wang
Rina Panigrahy
OffRL
LRM
NAI
36
0
0
06 Nov 2024
Unlearning-based Neural Interpretations
Unlearning-based Neural Interpretations
Ching Lam Choi
Alexandre Duplessis
Serge Belongie
FAtt
44
0
0
10 Oct 2024
How Language Models Prioritize Contextual Grammatical Cues?
How Language Models Prioritize Contextual Grammatical Cues?
Hamidreza Amirzadeh
A. Alishahi
Hosein Mohebbi
21
0
0
04 Oct 2024
Racing Thoughts: Explaining Contextualization Errors in Large Language Models
Racing Thoughts: Explaining Contextualization Errors in Large Language Models
Michael A. Lepori
Michael Mozer
Asma Ghandeharioun
LRM
85
1
0
02 Oct 2024
Optimal ablation for interpretability
Optimal ablation for interpretability
Maximilian Li
Lucas Janson
FAtt
49
2
0
16 Sep 2024
Attention Heads of Large Language Models: A Survey
Attention Heads of Large Language Models: A Survey
Zifan Zheng
Yezhaohui Wang
Yuxin Huang
Shichao Song
Mingchuan Yang
Bo Tang
Feiyu Xiong
Zhiyu Li
LRM
58
21
0
05 Sep 2024
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models
A Mechanistic Interpretation of Syllogistic Reasoning in Auto-Regressive Language Models
Geonhee Kim
Marco Valentino
André Freitas
LRM
AI4CE
28
7
0
16 Aug 2024
The Mechanics of Conceptual Interpretation in GPT Models: Interpretative
  Insights
The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights
Nura Aljaafari
Danilo S. Carvalho
André Freitas
KELM
32
0
0
05 Aug 2024
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An
  Axiomatic Approach
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
41
1
0
18 Jul 2024
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft
  Agent
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys
George Adamopoulos
Mehrab Hamidi
Stephanie Milani
Mohammad Reza Samsami
Artem Zholus
Sonia Joseph
Blake A. Richards
Irina Rish
Özgür Simsek
42
2
0
16 Jul 2024
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for
  Interpreting Neural Networks
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller
CML
30
10
0
05 Jul 2024
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai
Yilun Zhou
Shi Feng
Abulhair Saparov
Ziyu Yao
82
19
0
02 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
35
17
0
25 Jun 2024
Transcoders Find Interpretable LLM Feature Circuits
Transcoders Find Interpretable LLM Feature Circuits
Jacob Dunefsky
Philippe Chlenski
Neel Nanda
27
23
0
17 Jun 2024
Controlling Large Language Model Agents with Entropic Activation
  Steering
Controlling Large Language Model Agents with Entropic Activation Steering
Nate Rahn
P. DÓro
Marc G. Bellemare
LLMSV
30
6
0
01 Jun 2024
Exploring and steering the moral compass of Large Language Models
Exploring and steering the moral compass of Large Language Models
Alejandro Tlaie
LLMSV
32
3
0
27 May 2024
How does GPT-2 compute greater-than?: Interpreting mathematical
  abilities in a pre-trained language model
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
Michael Hanna
Ollie Liu
Alexandre Variengien
LRM
189
120
0
30 Apr 2023
Dissecting Recall of Factual Associations in Auto-Regressive Language
  Models
Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva
Jasmijn Bastings
Katja Filippova
Amir Globerson
KELM
191
261
0
28 Apr 2023
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
212
496
0
01 Nov 2022
1