Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2306.02858
Cited By
v1
v2
v3
v4 (latest)
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (19 upvotes)
Papers citing
"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"
50 / 669 papers shown
Title
CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding
Yuefei Chen
Jiang Liu
Xiaodong Lin
Ruixiang Tang
LRM
76
0
0
25 Nov 2025
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
Zhantao Gong
Liaoyuan Fan
Qing Guo
Xun Xu
Xulei Yang
Shijie Li
16
0
0
24 Nov 2025
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
Shaobo Wang
Tianle Niu
Runkang Yang
Deshan Liu
Xu He
Zichen Wen
Conghui He
Xuming Hu
Linfeng Zhang
VGen
126
0
0
24 Nov 2025
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
James Y. Huang
Sheng Zhang
Qianchu Liu
Guanghui Qin
Tinghui Zhu
Tristan Naumann
Muhao Chen
Hoifung Poon
VLM
LRM
65
0
0
24 Nov 2025
AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
C. Koutlis
Symeon Papadopoulos
48
0
0
24 Nov 2025
VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning
Boyu Chen
Zikang Wang
Zhengrong Yue
Kainan Yan
Chenyun Yu
...
Yafei Wen
Xiaoxin Chen
Yang Liu
Peng Li
Yali Wang
LLMAG
156
0
0
24 Nov 2025
Unboxing the Black Box: Mechanistic Interpretability for Algorithmic Understanding of Neural Networks
Bianka Kowalska
Halina Kwaśnicka
71
0
0
24 Nov 2025
EventBench: Towards Comprehensive Benchmarking of Event-based MLLMs
Shaoyu Liu
Jianing Li
Guanghui Zhao
Y. Zhang
Xiangyang Ji
9
0
0
23 Nov 2025
ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access
Timing Yang
Sucheng Ren
Alan Yuille
Feng Wang
VGen
38
0
0
23 Nov 2025
ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering
Yuxiang Nie
Han Wang
Yongjie Ye
Haiyang Yu
Weitao Jia
...
Zehui Dai
Jiacong Wang
Dingkang Yang
An-Lan Wang
Can Huang
ELM
52
0
0
23 Nov 2025
Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning
Xiaohong Liu
Xiufeng Song
Huayu Zheng
Lei Bai
Xiaoming Liu
Guangtao Zhai
DiffM
36
0
0
22 Nov 2025
Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM
Chiori Hori
Yoshiki Masuyama
Siddarth Jain
Radu Corcodel
Devesh K. Jha
Diego Romeres
Jonathan Le Roux
16
0
0
21 Nov 2025
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Jiaye Qian
Ge Zheng
Yuchen Zhu
Sibei Yang
MLLM
132
1
0
21 Nov 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Kaichen Zhang
Keming Wu
Zuhao Yang
Kairui Hu
Bin Wang
Ziwei Liu
X. Li
Lidong Bing
Lidong Bing
OffRL
ReLM
LRM
VLM
36
3
0
20 Nov 2025
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
Urjitkumar Patel
Fang-Chun Yeh
Chinmay Gondhalekar
148
0
0
19 Nov 2025
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu
Di Fu
Jiaxing Zhang
Gong Yu
Jiayu Zheng
Xiaoling Hu
Dongdi Zhao
Feiyang Li
Chao Chen
Yong Cao
33
0
0
19 Nov 2025
TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition
Wen Yin
Siyu Zhan
Cencen Liu
Xin Hu
Guiduo Duan
Xiurui Xie
Yuan-Fang Li
Tao He
101
0
0
19 Nov 2025
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Keda Tao
Kele Shao
Bohan Yu
Weiqiang Wang
Jian Liu
Huan Wang
VLM
148
0
0
18 Nov 2025
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
An Yu
Weiheng Lu
Jian Li
Zhenfei Zhang
Yunhang Shen
Felix X.-F. Ye
Ming-Ching Chang
53
0
0
18 Nov 2025
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
Wenxin Zhu
Andong Chen
Yuchen Song
Kehai Chen
Conghui Zhu
Ziyan Chen
Tiejun Zhao
LRM
305
0
0
17 Nov 2025
Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound
Dengming Zhang
W. You
Jingxiong Li
Weishen Lin
Wenda Shi
Xue Zhao
H. Zuo
Junxian Wu
Lingyun Sun
VLM
20
0
0
15 Nov 2025
Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets
Huy M. Le
Dat Nguyen
Phuc Binh Nguyen
Gia-Bao Le-Tran
Phu Truong Thien
...
T. Nguyen
H. Ngo
T. Nguyen
Binh T. Nguyen
Monojit Choudhury
24
0
0
15 Nov 2025
CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models
Jingyao Li
Jingyun Wang
Molin Tan
Haochen Wang
Cilin Yan
Likun Shi
Jiayin Cai
Xiaolong Jiang
Yao Hu
VLM
LRM
100
0
0
15 Nov 2025
Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding
Arun Ramachandran
Ramaswamy Govindarajan
M. Annavaram
Prakash Raghavendra
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
24
0
0
15 Nov 2025
A Structure-Agnostic Co-Tuning Framework for LLMs and SLMs in Cloud-Edge Systems
Yuze Liu
Yunhan Wang
Tiehua Zhang
Zhishu Shen
Cheng Peng
Libing Wu
Feng Xia
Jiong Jin
52
0
0
12 Nov 2025
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Yilong Chen
Xiang Bai
Zhibin Wang
Chengyu Bai
Yuhan Dai
Ming Lu
Shanghang Zhang
81
0
0
10 Nov 2025
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
Ying Cheng
Y. Lin
Min-Hung Chen
Fu-En Yang
S. Lai
87
0
0
10 Nov 2025
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
Tianhao Peng
Haochen Wang
Yuanxing Zhang
Zekun Wang
Zili Wang
...
Wei Ji
Pengfei Wan
Wenhao Huang
Zhaoxiang Zhang
Jiaheng Liu
ELM
216
0
0
10 Nov 2025
NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models
Kyuho Lee
Euntae Kim
Jinwoo Choi
Buru Chang
HILM
39
0
0
09 Nov 2025
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Aakriti Agrawal
Gouthaman KV
R. Aralikatti
Gauri Jagatap
Jiaxin Yuan
Vijay Kamarshi
Andrea Fanelli
Furong Huang
VLM
64
0
0
07 Nov 2025
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang
J. Yang
Pinzhi Huang
Ellis L Brown
Zihao Yang
...
Daohan Lu
Rob Fergus
Yann LeCun
Li Fei-Fei
Saining Xie
60
6
0
06 Nov 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
72
0
0
31 Oct 2025
Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh
Erfan Bagheri Soula
Omid Daliran
Simon Gottschalk
Mohsen Fayyaz
28
0
0
29 Oct 2025
Positional Preservation Embedding for Multimodal Large Language Models
Mouxiao Huang
Borui Jiang
Dehua Zheng
Hailin Hu
Kai Han
Xinghao Chen
VLM
173
0
0
27 Oct 2025
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Baoqi Pei
Yifei Huang
Jilan Xu
Yuping He
Guo Chen
Fei Wu
Yu Qiao
Jiangmiao Pang
EgoV
LRM
114
4
0
27 Oct 2025
MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection
Anisha Saha
Varsha Suresh
Timothy Hospedales
Vera Demberg
LRM
45
0
0
27 Oct 2025
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan
W. Zhang
Xin Li
Shihao Wang
Kehan Li
Wentong Li
Jun Xiao
Lei Zhang
Beng Chin Ooi
ObjD
242
0
0
27 Oct 2025
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Wenlong Li
Yifei Xu
Yuan Rao
Zhenhua Wang
Shuiguang Deng
76
0
0
26 Oct 2025
MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection
Shengtian Yang
Yue Feng
Yingshi Liu
Jingrou Zhang
Jie Qin
OffRL
72
0
0
24 Oct 2025
Towards Fine-Grained Human Motion Video Captioning
Guorui Song
Guocun Wang
Zhe Huang
Jing Lin
Xuefei Zhe
Jian Li
Haoqian Wang
24
0
0
24 Oct 2025
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
Minseok Kang
M. Lee
Minjung Kim
Donghyeong Kim
Sangyoun Lee
64
0
0
23 Oct 2025
HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models
Zelin Peng
Zhengqin Xu
Qingyang Liu
Xiaokang Yang
Wei Shen
93
0
0
23 Oct 2025
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
Jiahao Meng
X. Li
Haochen Wang
Yue Tan
Tao Zhang
...
Yunhai Tong
Anran Wang
Zhiyang Teng
Y. Wang
Z. Wang
VGen
LRM
236
3
0
23 Oct 2025
AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
Jiayu Zhang
Qilang Ye
Shuo Ye
Xun Lin
Zihan Song
Zitong Yu
56
0
0
21 Oct 2025
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
Zhaoyang Han
Qihan Lin
Hao Liang
Bowen Chen
Zhou Liu
Wentao Zhang
VLM
91
0
0
20 Oct 2025
HouseTour: A Virtual Real Estate A(I)gent
Ata Çelen
Marc Pollefeys
Daniel Barath
Iro Armeni
VGen
145
1
0
20 Oct 2025
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
Shraman Pramanick
E. Mavroudi
Yale Song
Rama Chellappa
Lorenzo Torresani
Triantafyllos Afouras
92
0
0
19 Oct 2025
EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning
Haoran Sun
Chen Cai
Huiping Zhuang
Kong Aik Lee
Lap-Pui Chau
Yi Wang
76
0
0
18 Oct 2025
RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
Kunyu Peng
Di Wen
Jia Fu
Jiamin Wu
Kailun Yang
...
Yufan Chen
Yuqian Fu
D. Paudel
Luc Van Gool
Rainer Stiefelhagen
77
0
0
18 Oct 2025
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
...
Andrew Tao
Song Han
Jan Kautz
Hongxu Yin
Pavlo Molchanov
98
3
0
17 Oct 2025
1
2
3
4
...
12
13
14
Next