28

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Zhihao He
Tieyuan Chen
Kangyu Wang
Ziran Qin
Yang Shao
Chaofan Gan
Shijie Li
Zuxuan Wu
Weiyao Lin
Main:10 Pages
11 Figures
Bibliography:1 Pages
17 Tables
Appendix:18 Pages
Abstract

Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at this https URL.

View on arXiv
Comments on this paper