Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.02858
Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"
50 / 702 papers shown
Title
Leveraging Modality Tags for Enhanced Cross-Modal Video Retrieval
A. Fragomeni
Dima Damen
Michael Wray
35
0
0
02 Apr 2025
Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo
Shuailei Ma
Shijie Ma
Xiaoyi Bao
Chen-Wei Xie
Kecheng Zheng
Tingyu Weng
Siyang Sun
Yun Zheng
Wei Zou
MLLM
AuLLM
65
2
0
02 Apr 2025
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
46
1
0
02 Apr 2025
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Min Shi
Shihao Wang
Chieh-Yun Chen
Jitesh Jain
Kai Wang
Junjun Xiong
Guilin Liu
Zhiding Yu
Humphrey Shi
40
2
0
02 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
Jun Liu
Peijie Wang
Jing Tao
Zhou Su
63
1
0
01 Apr 2025
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
Junyu Xie
Tengda Han
Max Bain
Arsha Nagrani
Eshika Khandelwal
Gül Varol
Weidi Xie
Andrew Zisserman
DiffM
VGen
63
0
0
01 Apr 2025
Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning
Yubo Zhang
Pedro Botelho
Trevor Gordon
Gil Zussman
I. Kadota
55
0
0
31 Mar 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Lucas Ventura
Antoine Yang
Cordelia Schmid
Gül Varol
41
0
0
31 Mar 2025
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding
Qi Wu
Quanlong Zheng
Yanhao Zhang
Junlin Xie
Jinguo Luo
...
Peng Liu
Qingsong Xie
Ru Zhen
Haonan Lu
Zhenyu Yang
VLM
62
0
0
31 Mar 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
56
0
0
29 Mar 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
31
0
0
28 Mar 2025
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
57
0
0
28 Mar 2025
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Erika Mori
Yue Qiu
Hirokatsu Kataoka
Y. Aoki
55
0
0
27 Mar 2025
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVM
VGen
82
1
0
27 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
Fahad Shahbaz Khan
204
0
0
27 Mar 2025
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Bohan Liu
Chenjia Li
Lalithkumar Seenivasan
Mathias Unberath
VOS
83
2
0
27 Mar 2025
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Xiao Guo
Xiufeng Song
Yue Zhang
Xiaohong Liu
X. Liu
63
1
0
26 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
77
0
0
26 Mar 2025
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Haitong Liu
Kuofeng Gao
Yang Bai
Jinmin Li
Jinxiao Shan
Tao Dai
Shu-Tao Xia
AAML
65
1
0
26 Mar 2025
Towards Online Multi-Modal Social Interaction Understanding
Xuzhao Li
Shijian Deng
Bolin Lai
Weiguo Pian
James M. Rehg
Yapeng Tian
46
0
0
25 Mar 2025
ACVUBench: Audio-Centric Video Understanding Benchmark
Yuqing Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Yong Li
P. Li
Yifan Jiang
W. Li
Z. Ma
Chao Zhang
AuLLM
CoGe
73
0
0
25 Mar 2025
Video-T1: Test-Time Scaling for Video Generation
F. Liu
Hanyang Wang
Yimo Cai
Kaiyan Zhang
Xiaohang Zhan
Yueqi Duan
DiffM
VGen
89
3
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
70
1
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
80
1
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Xia Hu
Bo Yuan
VLM
64
0
0
24 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yuqing Yang
Afshin Dehghan
59
2
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
100
1
0
24 Mar 2025
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova
Arsha Nagrani
Bernt Schiele
Hilde Kuehne
Christian Rupprecht
53
0
0
24 Mar 2025
Can Text-to-Video Generation help Video-Language Alignment?
Luca Zanella
Massimiliano Mancini
Willi Menapace
Sergey Tulyakov
Yiming Wang
Elisa Ricci
DiffM
VGen
65
0
0
24 Mar 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
59
0
0
22 Mar 2025
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
Yuchen Sun
Shanhui Zhao
Tao Yu
Hao Wen
Samith Va
Mengwei Xu
Yuanchun Li
Chongyang Zhang
LLMAG
67
0
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
48
0
0
22 Mar 2025
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Yiming Zhao
Y. Zeng
Yukun Qi
Yi Liu
Lin Yen-Chen
Zehui Chen
Xikun Bao
Jie Zhao
Feng Zhao
VLM
58
2
0
22 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
Han Wang
Kai Hu
Liangcai Gao
185
0
0
20 Mar 2025
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
Xuanming Cui
Jaiminkumar Ashokbhai Bhoi
Chionh Wei Peng
Adriel Kuek
Ser-Nam Lim
48
0
0
20 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLM
MQ
91
0
0
20 Mar 2025
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
50
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
64
1
0
20 Mar 2025
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
63
0
0
19 Mar 2025
VITED: Video Temporal Evidence Distillation
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
219
0
0
17 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Yong-Jin Liu
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
147
0
0
17 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
59
0
0
17 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
49
1
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
Binghui Wang
Xin Wu
Qingbin Liu
65
0
0
17 Mar 2025
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Haoqi Yuan
Yu Bai
Yuhui Fu
Bohan Zhou
Yicheng Feng
Xinrun Xu
Yi Zhan
Börje F. Karlsson
Zongqing Lu
LM&Ro
90
0
0
16 Mar 2025
Multi Activity Sequence Alignment via Implicit Clustering
Taein Kwon
Zador Pataki
Mahdi Rad
Marc Pollefeys
HAI
AI4TS
65
0
0
16 Mar 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yansen Wang
Shengqiong Wu
Yujie Zhang
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
95
11
0
16 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
233
0
0
16 Mar 2025
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
Xirui Zhou
Lianlei Shan
Xiaolin Gui
66
0
0
14 Mar 2025
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
Leqi Shen
Tao He
Guoqiang Gong
Fan Yang
Yuhui Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
50
0
0
14 Mar 2025
Previous
1
2
3
4
5
...
13
14
15
Next