Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from to for visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an -weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.
View on arXiv@article{kuo2025_2502.01906, title={ Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models }, author={ Chia-Wen Kuo and Sijie Zhu and Fan Chen and Xiaohui Shen and Longyin Wen }, journal={arXiv preprint arXiv:2502.01906}, year={ 2025 } }