Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models

4 February 2025

Abstract

Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM). However, these inputs are inherently different: visual inputs are multi-dimensional and contextually rich, often pre-encoded by models like CLIP, while textual inputs lack this structure. In this paper, we propose Decomposed Attention (D-Attn), a novel method that processes visual and textual embeddings differently by decomposing the 1-D causal self-attention in LVLMs. After the attention decomposition, D-Attn diagonalizes visual-to-visual self-attention, reducing computation from $\mathcal{O}(|V|^2)$ to $\mathcal{O}(|V|)$ for $|V|$ visual embeddings without compromising performance. Moreover, D-Attn debiases positional encodings in textual-to-visual cross-attention, further enhancing visual understanding. Finally, we introduce an $\alpha$ -weighting strategy to merge visual and textual information, maximally preserving the pre-trained LLM's capabilities with minimal modifications. Extensive experiments and rigorous analyses validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs. Code, data, and models will be publicly available.

View on arXiv

@article{kuo2025_2502.01906,
  title={ Rethinking Homogeneity of Vision and Text Tokens in Large Vision-and-Language Models },
  author={ Chia-Wen Kuo and Sijie Zhu and Fan Chen and Xiaohui Shen and Longyin Wen },
  journal={arXiv preprint arXiv:2502.01906},
  year={ 2025 }
}

Comments on this paper