xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

21 October 2024

Papers citing "xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs"

2 / 2 papers shown

Title
VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding Zongxia Li Xiyang Wu Guangyao Shi Yubin Qin Hongyang Du Tianyi Zhou Dinesh Manocha Jordan Lee Boyd-Graber MLLM 57 0 0 02 May 2025
ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding Yi-Xing Peng Q. Yang Yu-Ming Tang Shenghao Fu Kun-Yu Lin Xihan Wei Wei-Shi Zheng 45 0 0 25 Apr 2025