Curse of High Dimensionality Issue in Transformer for Long-context Modeling

28 May 2025

Abstract

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitivethis http URLis available atthis https URL.

View on arXiv

@article{zhang2025_2505.22107,
  title={ Curse of High Dimensionality Issue in Transformer for Long-context Modeling },
  author={ Shuhai Zhang and Zeng You and Yaofo Chen and Zhiquan Wen and Qianyue Wang and Zhijie Qiu and Yuanqing Li and Mingkui Tan },
  journal={arXiv preprint arXiv:2505.22107},
  year={ 2025 }
}

Comments on this paper