Towards Combinatorial Interpretability of Neural Computation

10 April 2025

Micah Adler

Main:43 Pages

30 Figures

Bibliography:5 Pages

4 Tables

Abstract

We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable.

View on arXiv

@article{adler2025_2504.08842,
  title={ Towards Combinatorial Interpretability of Neural Computation },
  author={ Micah Adler and Dan Alistarh and Nir Shavit },
  journal={arXiv preprint arXiv:2504.08842},
  year={ 2025 }
}

Comments on this paper