4
0

Attention Mechanism, Max-Affine Partition, and Universal Approximation

Abstract

We establish the universal approximation capability of single-layer, single-head self- and cross-attention mechanisms with minimal attached structures. Our key insight is to interpret single-head attention as an input domain-partition mechanism that assigns distinct values to subregions. This allows us to engineer the attention weights such that this assignment imitates the target function. Building on this, we prove that a single self-attention layer, preceded by sum-of-linear transformations, is capable of approximating any continuous function on a compact domain under the LL_\infty-norm. Furthermore, we extend this construction to approximate any Lebesgue integrable function under LpL_p-norm for 1p<1\leq p <\infty. Lastly, we also extend our techniques and show that, for the first time, single-head cross-attention achieves the same universal approximation guarantees.

View on arXiv
@article{liu2025_2504.19901,
  title={ Attention Mechanism, Max-Affine Partition, and Universal Approximation },
  author={ Hude Liu and Jerry Yao-Chieh Hu and Zhao Song and Han Liu },
  journal={arXiv preprint arXiv:2504.19901},
  year={ 2025 }
}
Comments on this paper