Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2306.02858
Cited By
v1
v2
v3
v4 (latest)
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (19 upvotes)
Papers citing
"Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"
50 / 875 papers shown
Title
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
Xiang Wang
Benyou Wang
Qiang Liu
Haoyang Li
LM&MA
ELM
AuLLM
641
2
0
17 Oct 2024
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng
Yun Xing
Zesen Cheng
Yang Zhou
Hang Zhang
Xin Li
Deli Zhao
Shijian Lu
Chunyan Miao
Lidong Bing
248
25
0
16 Oct 2024
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Neural Information Processing Systems (NeurIPS), 2024
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLM
CLIP
325
5
0
16 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
316
11
0
16 Oct 2024
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLM
VLM
180
2
0
15 Oct 2024
It's Just Another Day: Unique Video Captioning by Discriminative Prompting
Asian Conference on Computer Vision (ACCV), 2024
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
176
3
0
15 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
149
4
0
15 Oct 2024
Character-aware audio-visual subtitling in context
Asian Conference on Computer Vision (ACCV), 2024
Jaesung Huh
Andrew Zisserman
244
0
0
14 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
Neural Information Processing Systems (NeurIPS), 2024
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
227
12
1
14 Oct 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Kai Han
Jianyuan Guo
Yehui Tang
W. He
Enhua Wu
Yunhe Wang
MLLM
VLM
161
16
0
14 Oct 2024
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
Juseong Jin
Chang Wook Jeong
190
8
0
13 Oct 2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Sungkyung Kim
Adam Lee
Junyoung Park
Andrew Chung
Jusang Oh
Jay-Yoon Lee
84
9
0
12 Oct 2024
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yingqiang Gao
Lukas Fischer
Alexa Lintner
Sarah Ebling
166
4
0
11 Oct 2024
G
2
^{2}
2
TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
131
0
0
10 Oct 2024
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
International Conference on Learning Representations (ICLR), 2024
Qingni Wang
Tiantian Geng
Zhiyuan Wang
Teng Wang
Bo Fu
Feng Zheng
357
13
0
10 Oct 2024
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Tianhao Shen
Chao Zhang
181
6
0
09 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoV
MLLM
243
17
0
09 Oct 2024
Temporal Reasoning Transfer from Text to Video
International Conference on Learning Representations (ICLR), 2024
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Dianbo Sui
Qi Liu
LRM
143
20
0
08 Oct 2024
Grounding is All You Need? Dual Temporal Grounding for Video Dialog
You Qin
Wei Ji
Xinze Lan
Hao Fei
Xun Yang
Dan Guo
Roger Zimmermann
Lizi Liao
VGen
217
2
0
08 Oct 2024
Enhancing Temporal Modeling of Video LLMs via Time Gating
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zi-Yuan Hu
Yiwu Zhong
Shijia Huang
Michael R. Lyu
Liwei Wang
VLM
148
6
0
08 Oct 2024
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond
ACM Multimedia (MM), 2024
Soyeon Caren Han
Feiqi Cao
Josiah Poon
Roberto Navigli
MLLM
VLM
92
5
0
08 Oct 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
International Conference on Learning Representations (ICLR), 2024
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
213
41
0
08 Oct 2024
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta
Shreyas Verma
Ujjwala Anantheswaran
Kevin Scaria
Mihir Parmar
Swaroop Mishra
Chitta Baral
ReLM
LRM
193
18
0
06 Oct 2024
Realizing Video Summarization from the Path of Language-based Semantic Understanding
Kuan-Chen Mu
Zhi-Yi Chin
Wei-Chen Chiu
109
0
0
06 Oct 2024
StoryNavi: On-Demand Narrative-Driven Reconstruction of Video Play With Generative AI
Alston Lantian Xu
Tianwei Ma
Tianmeng Liu
Can Liu
Alvaro Cassinelli
VGen
117
0
0
04 Oct 2024
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Tengfei Yu
Xuebo Liu
Zhiyi Hou
Liang Ding
Dacheng Tao
Min Zhang
134
5
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
International Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
517
87
0
04 Oct 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Jiuxiang Gu
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
198
50
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
International Conference on Learning Representations (ICLR), 2024
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
271
34
0
04 Oct 2024
Visual Prompting in LLMs for Enhancing Emotion Recognition
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Qixuan Zhang
Zhifeng Wang
Dylan Zhang
Wenjia Niu
Sabrina Caldwell
Tom Gedeon
Yang Liu
Zhenyue Qin
100
6
0
03 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDa
VGen
404
248
0
03 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization Benchmark
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
219
5
0
02 Oct 2024
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jiapeng Wang
Chengyu Wang
Kunzhe Huang
Jun Huang
Lianwen Jin
CLIP
VLM
214
20
0
01 Oct 2024
Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language Models
IEEE Transactions on Vehicular Technology (IEEE Trans. Veh. Technol.), 2024
Yizhou Huang
Yihua Cheng
Kezhi Wang
LRM
116
3
0
30 Sep 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Neural Information Processing Systems (NeurIPS), 2024
Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang
Mike Zheng Shou
VLM
VOS
MLLM
191
67
0
29 Sep 2024
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Xiao Wang
Yue Yu
Zijia Lin
Fuzheng Zhang
Di Zhang
Liqiang Nie
VGen
143
5
0
29 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
253
17
0
27 Sep 2024
EgoLM: Multi-Modal Language Model of Egocentric Motions
Computer Vision and Pattern Recognition (CVPR), 2024
Fangzhou Hong
Vladimir Guzov
Hyo Jin Kim
Yuting Ye
Richard Newcombe
Ziwei Liu
Lingni Ma
142
10
0
26 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Neural Information Processing Systems (NeurIPS), 2024
Ye Liu
Zongyang Ma
Chen Ma
Yang Wu
Ying Shan
Chang Wen Chen
199
47
0
26 Sep 2024
LLM4Brain: Training a Large Language Model for Brain Video Understanding
Ruizhe Zheng
Lichao Sun
113
2
0
26 Sep 2024
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning
Neural Information Processing Systems (NeurIPS), 2024
Yun Xu
Huabin Liu
Tianyao He
Yihang Chen
Chaofan Gan
...
Cheng Zhong
Yang Zhang
Yingxue Wang
Hui Lin
Weiyao Lin
VGen
CML
307
19
0
26 Sep 2024
EAGLE: Egocentric AGgregated Language-video Engine
ACM Multimedia (MM), 2024
Jing Bi
Yunlong Tang
Luchuan Song
Ali Vosoughi
Nguyen Nguyen
Chenliang Xu
178
15
0
26 Sep 2024
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Yue Yu
Yu-Gang Jiang
146
1
0
25 Sep 2024
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Na Zhao
Zhiyu Tan
Hao Li
Yue Yu
MLLM
448
38
0
25 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
305
9
0
23 Sep 2024
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Computer Vision and Pattern Recognition (CVPR), 2024
Yan Shu
Peitian Zhang
Zheng Liu
Minghao Qin
Yueze Wang
Tiejun Huang
Bo Zhao
VLM
331
125
0
22 Sep 2024
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang
Bingxin Xu
Weitai Kang
Mu Cai
Yuheng Li
Zehao Wen
Zhen Dong
Kurt Keutzer
Yong Jae Lee
Yan Yan
212
11
0
19 Sep 2024
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanmin Wu
Jiayi Lei
...
Guanglu Song
Peng Gao
Yu Liu
Chunyuan Li
Hongsheng Li
MLLM
247
38
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
365
5
0
19 Sep 2024
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Bo-Kai Ruan
Hao-Tang Tsui
Yung-Hui Li
Hong-Han Shuai
LM&Ro
367
14
0
15 Sep 2024
Previous
1
2
3
...
10
11
12
...
16
17
18
Next