Mechanistic Interpretability of GPT-like Models on Summarization Tasks

20 May 2025

Main:6 Pages

6 Figures

Bibliography:2 Pages

2 Tables

Abstract

Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

View on arXiv

@article{mishra2025_2505.17073,
  title={ Mechanistic Interpretability of GPT-like Models on Summarization Tasks },
  author={ Anurag Mishra },
  journal={arXiv preprint arXiv:2505.17073},
  year={ 2025 }
}

Comments on this paper