Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2306.02858
Cited By
v1
v2
v3
v4 (latest)
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (19 upvotes)
Papers citing
"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"
50 / 669 papers shown
Title
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Yuatyong Chaichana
Pittawat Taveekitworachai
Warit Sirichotedumrong
Potsawee Manakul
Kunat Pipatanakul
AuLLM
96
0
0
17 Oct 2025
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Jinglei Zhang
Yuanfan Guo
Rolandos Alexandros Potamias
Jiankang Deng
Hang Xu
Chao Ma
LRM
59
2
0
16 Oct 2025
Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling
Deyue Zhang
Dongdong Yang
Junjie Mu
Quancheng Zou
Zonghao Ying
Wenzhuo Xu
Zhao Liu
Xuan Wang
X. Zhang
80
0
0
16 Oct 2025
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Natan Bagrov
Eugene Khvedchenia
Borys Tymchenko
Shay Aharon
Lior Kadoch
...
Yonatan Geifman
Ran Zilberstein
Tuomas Rintamaki
Matthieu Le
Andrew Tao
VLM
84
1
0
16 Oct 2025
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
Run Luo
Xiaobo Xia
Lu Wang
Longze Chen
Renke Shan
Jing Luo
Min Yang
Tat-Seng Chua
VGen
152
4
0
15 Oct 2025
Self-Augmented Visual Contrastive Decoding
Eun Woo Im
M. K. Ali
Vivek Gupta
89
0
0
15 Oct 2025
K-frames: Scene-Driven Any-k Keyframe Selection for long video understanding
Yifeng Yao
Yike Yun
Jing Wang
Huishuai Zhang
Dongyan Zhao
Ke Tian
Zhihao Wang
Minghui Qiu
Tao Wang
CLIP
VGen
44
1
0
14 Oct 2025
VideoLucy: Deep Memory Backtracking for Long Video Understanding
Jialong Zuo
Yongtai Deng
Lingdong Kong
J. Yang
Rui Jin
Y. Zhang
Nong Sang
Liang Pan
Ziwei Liu
Changxin Gao
57
0
0
14 Oct 2025
Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
Tanner Muturi
Blessing Agyei Kyem
Joshua Kofi Asamoah
Neema Jakisa Owor
Richard Dyzinela
Andrews Danyo
Y. Adu-Gyamfi
Armstrong Aboah
LRM
80
3
0
13 Oct 2025
Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis
Blessing Agyei Kyem
Neema Jakisa Owor
Andrews Danyo
Joshua Kofi Asamoah
Eugene Kofi Okrah Denteh
Tanner Muturi
Anthony Dontoh
Y. Adu-Gyamfi
Armstrong Aboah
61
2
0
13 Oct 2025
A Survey on Agentic Multimodal Large Language Models
Huanjin Yao
Ruifei Zhang
Jiaxing Huang
Jingyi Zhang
Yibo Wang
...
Ruolin Zhu
Yongcheng Jing
Shunyu Liu
Guanbin Li
Dacheng Tao
LM&Ro
AIFin
AI4TS
LRM
AI4CE
149
3
0
13 Oct 2025
When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance
Jinjin Cao
Zhiyang Chen
Zijun Wang
Liyuan Ma
Weijian Luo
Guojun Qi
VLM
93
0
0
12 Oct 2025
RO-Bench: Large-scale robustness evaluation of MLLMs with text-driven counterfactual videos
Zixi Yang
Jiapeng Li
Muxi Diao
Yinuo Jing
Kongming Liang
AAML
VGen
56
0
0
10 Oct 2025
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Ming Dai
Sen Yang
Boqiang Duan
Wankou Yang
Jingdong Wang
VOS
183
0
0
10 Oct 2025
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
Y. Huang
Yizhou Wang
Yun Fu
VLM
34
0
0
09 Oct 2025
Improving Temporal Understanding Logic Consistency in Video-Language Models via Attention Enhancement
Chengzhi Li
Heyan Huang
Ping Jian
Zhen Yang
Yaning Tian
60
0
0
09 Oct 2025
Addressing the ID-Matching Challenge in Long Video Captioning
Zhantao Yang
Huangji Wang
Ruili Feng
Han Zhang
Yuting Hu
Shangwen Zhu
Junyan Li
Yu Liu
Fan Cheng
68
0
0
08 Oct 2025
Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
Ruyang Liu
Shangkun Sun
Haoran Tang
Ge Li
Wei-Nan Gao
VGen
VLM
56
3
0
07 Oct 2025
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
Deheng Zhang
Yuqian Fu
Runyi Yang
Yang Miao
Tianwen Qian
...
Ajad Chhatkuli
Xuanjing Huang
Yu-Gang Jiang
Luc Van Gool
D. Paudel
EgoV
165
2
0
07 Oct 2025
From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning
Li Zeqiao
Wang Yijing
Wang Haoyu
Li Zheng
Li Peng
Liu Wenfei
Zuo zhiqiang
84
0
0
07 Oct 2025
When Thinking Drifts: Evidential Grounding for Robust Video Reasoning
M. Luo
Zihui Xue
Alex Dimakis
Kristen Grauman
VGen
LRM
204
3
0
07 Oct 2025
From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding
Shih-Yao Lin
Sibendu Paul
Caren Chen
106
1
0
07 Oct 2025
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang
Jing Bi
Pinxin Liu
Zhenyu Pan
Mingqian Feng
...
Zeliang Zhang
Daiki Shimada
Han Liu
Jiebo Luo
Chenliang Xu
MLLM
OffRL
VLM
LRM
462
7
0
06 Oct 2025
Zephyrus: An Agentic Framework for Weather Science
Sumanth Varambally
Marshall Fisher
Jas Thakker
Yiwei Chen
Zhirui Xia
...
Salva Rühling Cachay
Taylor Berg-Kirkpatrick
Duncan Watson-Parris
Yi-An Ma
Rose Yu
LLMAG
88
1
0
05 Oct 2025
Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
C. Wang
Donglin Bai
Yifan Yang
Xiao Jin
Anlan Zhang
...
Jingdong Sun
Chong Luo
Ting Cao
Lili Qiu
Suman Banerjee
184
1
0
05 Oct 2025
FrameOracle: Learning What to See and How Much to See in Videos
Chaoyu Li
Tianzhi Li
Fei Tao
Zhenyu Zhao
Ziqian Wu
Maozheng Zhao
Juntong Song
Cheng Niu
Pooyan Fazli
VLM
44
0
0
04 Oct 2025
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Xian Zhang
Zexi Wu
Zinuo Li
Hongming Xu
Luqi Gong
F. Boussaïd
Naoufel Werghi
Mohammed Bennamoun
VGen
60
0
0
03 Oct 2025
Augmenting LLMs for General Time Series Understanding and Prediction
Felix Parker
Nimeesha Chan
Chi Zhang
Kimia Ghobadi
AI4TS
80
0
0
01 Oct 2025
V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs
Zhengpeng Shi
Hengli Li
Yanpeng Zhao
Jianqun Zhou
Yuxuan Wang
Qinrong Cui
Wei Bi
Songchun Zhu
Bo Zhao
Zilong Zheng
VLM
58
0
0
30 Sep 2025
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi
Jacopo Staiano
Antonio Liotta
VLM
47
0
0
30 Sep 2025
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Yuntao Shou
Tao Meng
Wei Ai
Keqin Li
LRM
102
3
0
29 Sep 2025
Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA
Jianxin Liang
Tan Yue
Yuxuan Wang
Yueqian Wang
Zhihan Yin
Huishuai Zhang
Dongyan Zhao
56
0
0
29 Sep 2025
Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents
J. Li
Kun-Juan Wei
Zhe Xu
Zibo Su
Xu Yang
Cheng Deng
58
0
0
29 Sep 2025
UniVid: The Open-Source Unified Video Model
Jiabin Luo
Junhui Lin
Zeyu Zhang
Biao Wu
Meng Fang
Ling-Hao Chen
Hao Tang
VGen
178
5
0
29 Sep 2025
NeMo: Needle in a Montage for Video-Language Understanding
Zi-Yuan Hu
Shuo Liang
Duo Zheng
Yanyang Li
Yeyao Tao
...
Jianguang Yu
Jing-ling Huang
Meng Fang
Yin Li
Liwei Wang
105
1
0
29 Sep 2025
VideoScore2: Think before You Score in Generative Video Evaluation
Xuan He
Dongfu Jiang
Ping Nie
Minghao Liu
Z. L. Jiang
...
Qunshu Lin
Yuanxing Zhang
Ge Zhang
Wenhao Huang
Wenhu Chen
EGVM
VGen
LRM
1.2K
3
0
26 Sep 2025
Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Zeyu Wang
Baiyu Chen
Kun Yan
Hongjing Piao
Hao Xue
Flora D. Salim
Yuanchun Shi
Yuntao Wang
64
0
0
26 Sep 2025
HyCoVAD: A Hybrid SSL-LLM Model for Complex Video Anomaly Detection
Mohammad Mahdi Hemmatyar
Mahdi Jafari
Mohammad Amin Yousefi
Mohammad Reza Nemati
Mobin Azadani
Hamid Reza Rastad
Amirmohammad Akbari
76
0
0
26 Sep 2025
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Abdul Waheed
Zhen Wu
Dareen Alharthi
Seungone Kim
Bhiksha Raj
ELM
60
0
0
25 Sep 2025
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Ziang Yan
Xinhao Li
Yinan He
Zhengrong Yue
Xiangyu Zeng
Yali Wang
Yu Qiao
Limin Wang
Yi Wang
MLLM
VLM
LRM
177
7
0
25 Sep 2025
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Manyi Yao
Bingbing Zhuang
Sparsh Garg
Amit Roy-Chowdhury
Christian Shelton
Manmohan Chandraker
Abhishek Aich
LRM
147
0
0
23 Sep 2025
COLT: Enhancing Video Large Language Models with Continual Tool Usage
Yuyang Liu
Xinyuan Shi
Xiaondan Liang
KELM
CLL
165
0
0
23 Sep 2025
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Ye Liu
Zongyang Ma
Junfu Pu
Zhongang Qi
Yang Wu
Mingyu Ding
Chang Wen Chen
MLLM
ObjD
LRM
195
0
0
22 Sep 2025
Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
Chenglin Li
Feng Han
FengTao
Ruilin Li
Qianglong Chen
Jingqi Tong
Yin Zhang
Jiaqi Wang
LRM
125
0
0
22 Sep 2025
Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media
Zihan Ding
Junlong Chen
Per Ola Kristensson
Junxiao Shen
Xinyi Wang
VGen
96
0
0
20 Sep 2025
Automated Procedural Analysis via Video-Language Models for AI-assisted Nursing Skills Assessment
Shen Chang
Dennis Liu
Renran Tian
Kristen L. Swartzell
Stacie L. Klingler
Amy M. Nagle
Nan Kong
40
0
0
20 Sep 2025
Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Taesoo Kim
Yongsik Jo
Hyunmin Song
Taehwan Kim
40
0
0
18 Sep 2025
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Nisarg A. Shah
Amir Ziai
Chaitanya Ekanadham
Vishal M. Patel
VGen
CoGe
ELM
93
0
0
17 Sep 2025
Dense Video Understanding with Gated Residual Tokenization
Haichao Zhang
Wenhao Chai
Shwai He
Ang Li
Yun Fu
VGen
98
0
0
17 Sep 2025
Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
Zhihao He
Tianyao He
Yun Xu
Yun Xu
Huabin Liu
Chaofan Gan
Gui Zou
W. Lin
92
1
0
16 Sep 2025
Previous
1
2
3
4
5
...
12
13
14
Next