v1v2 (latest)

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

25 January 2026

Zhihao He

Tieyuan Chen

Kangyu Wang

Ziran Qin

Yang Shao

Chaofan Gan

Shijie Li

Zuxuan Wu

Weiyao Lin

VGen

ArXiv (abs)PDF HTML Github (1★)

Main:10 Pages

11 Figures

Bibliography:1 Pages

17 Tables

Appendix:18 Pages

Abstract

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced atthis https URL.

View on arXiv

Comments on this paper