Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation
Understanding and interpreting the internal representations of large language models (LLMs) remains an open challenge. Patchscopes introduced a method for probing internal activations by patching them into new prompts, prompting models to self-explain their hidden representations. We introduce Superscopes, a technique that systematically amplifies superposed features in MLP outputs (multilayer perceptron) and hidden states before patching them into new contexts. Inspired by the "features as directions" perspective and the Classifier-Free Guidance (CFG) approach from diffusion models, Superscopes amplifies weak but meaningful features, enabling the interpretation of internal representations that previous methods failed to explain-all without requiring additional training. This approach provides new insights into how LLMs build context and represent complex concepts, further advancing mechanistic interpretability.
View on arXiv@article{jacobi2025_2503.02078, title={ Superscopes: Amplifying Internal Feature Representations for Language Model Interpretation }, author={ Jonathan Jacobi and Gal Niv }, journal={arXiv preprint arXiv:2503.02078}, year={ 2025 } }