ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 875 papers shown
Title
Roadmap towards Superhuman Speech Understanding using Large Language
  Models
Roadmap towards Superhuman Speech Understanding using Large Language Models
Fan Bu
Yuhao Zhang
Xiang Wang
Benyou Wang
Qiang Liu
Haoyang Li
LM&MAELMAuLLM
641
2
0
17 Oct 2024
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
  Multimodal Models across Language, Visual, and Audio
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng
Yun Xing
Zesen Cheng
Yang Zhou
Hang Zhang
Xin Li
Deli Zhao
Shijian Lu
Chunyan Miao
Lidong Bing
248
25
0
16 Oct 2024
TransAgent: Transfer Vision-Language Foundation Models with
  Heterogeneous Agent Collaboration
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent CollaborationNeural Information Processing Systems (NeurIPS), 2024
Yiwei Guo
Shaobin Zhuang
Kunchang Li
Yu Qiao
Yali Wang
VLMCLIP
325
5
0
16 Oct 2024
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu
Linchao Zhu
Yi Yang
316
11
0
16 Oct 2024
OMCAT: Omni Context Aware Transformer
OMCAT: Omni Context Aware Transformer
Arushi Goel
Karan Sapra
Matthieu Le
Rafael Valle
Andrew Tao
Bryan Catanzaro
MLLMVLM
180
2
0
15 Oct 2024
It's Just Another Day: Unique Video Captioning by Discriminative
  Prompting
It's Just Another Day: Unique Video Captioning by Discriminative PromptingAsian Conference on Computer Vision (ACCV), 2024
Toby Perrett
Tengda Han
Dima Damen
Andrew Zisserman
176
3
0
15 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video
  Understanding in Large Language Models
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
149
4
0
15 Oct 2024
Character-aware audio-visual subtitling in context
Character-aware audio-visual subtitling in contextAsian Conference on Computer Vision (ACCV), 2024
Jaesung Huh
Andrew Zisserman
244
0
0
14 Oct 2024
When Does Perceptual Alignment Benefit Vision Representations?
When Does Perceptual Alignment Benefit Vision Representations?Neural Information Processing Systems (NeurIPS), 2024
Shobhita Sundaram
Stephanie Fu
Lukas Muttenthaler
Netanel Y. Tamir
Lucy Chai
Simon Kornblith
Trevor Darrell
Phillip Isola
227
12
1
14 Oct 2024
Free Video-LLM: Prompt-guided Visual Perception for Efficient
  Training-free Video LLMs
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
Kai Han
Jianyuan Guo
Yehui Tang
W. He
Enhua Wu
Yunhe Wang
MLLMVLM
161
16
0
14 Oct 2024
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large
  Language and Vision Models
Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
Juseong Jin
Chang Wook Jeong
190
8
0
13 Oct 2024
Towards Efficient Visual-Language Alignment of the Q-Former for Visual
  Reasoning Tasks
Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning TasksConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Sungkyung Kim
Adam Lee
Junyoung Park
Andrew Chung
Jusang Oh
Jay-Yoon Lee
84
9
0
12 Oct 2024
Audio Description Generation in the Era of LLMs and VLMs: A Review of
  Transferable Generative AI Technologies
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI TechnologiesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yingqiang Gao
Lukas Fischer
Alexa Lintner
Sarah Ebling
166
4
0
11 Oct 2024
G$^{2}$TR: Generalized Grounded Temporal Reasoning for Robot Instruction
  Following by Combining Large Pre-trained Models
G2^{2}2TR: Generalized Grounded Temporal Reasoning for Robot Instruction Following by Combining Large Pre-trained Models
Riya Arora
N. N.
Aman Tambi
Sandeep S. Zachariah
Souvik Chakraborty
Rohan Paul
LM&Ro
131
0
0
10 Oct 2024
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Qingni Wang
Tiantian Geng
Zhiyuan Wang
Teng Wang
Bo Fu
Feng Zheng
357
13
0
10 Oct 2024
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning
  using Multi-Round Preference Optimization
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Tianhao Shen
Chao Zhang
181
6
0
09 Oct 2024
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
Hanrong Ye
Haotian Zhang
Erik Daxberger
Lin Chen
Zongyu Lin
...
Haoxuan You
Dan Xu
Zhe Gan
Jiasen Lu
Yinfei Yang
EgoVMLLM
243
17
0
09 Oct 2024
Temporal Reasoning Transfer from Text to Video
Temporal Reasoning Transfer from Text to VideoInternational Conference on Learning Representations (ICLR), 2024
Lei Li
Yuanxin Liu
Linli Yao
Peiyuan Zhang
Chenxin An
Lean Wang
Xu Sun
Dianbo Sui
Qi Liu
LRM
143
20
0
08 Oct 2024
Grounding is All You Need? Dual Temporal Grounding for Video Dialog
Grounding is All You Need? Dual Temporal Grounding for Video Dialog
You Qin
Wei Ji
Xinze Lan
Hao Fei
Xun Yang
Dan Guo
Roger Zimmermann
Lizi Liao
VGen
217
2
0
08 Oct 2024
Enhancing Temporal Modeling of Video LLMs via Time Gating
Enhancing Temporal Modeling of Video LLMs via Time GatingConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zi-Yuan Hu
Yiwu Zhong
Shijia Huang
Michael R. Lyu
Liwei Wang
VLM
148
6
0
08 Oct 2024
Multimodal Large Language Models and Tunings: Vision, Language, Sensors,
  Audio, and Beyond
Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and BeyondACM Multimedia (MM), 2024
Soyeon Caren Han
Feiqi Cao
Josiah Poon
Roberto Navigli
MLLMVLM
92
5
0
08 Oct 2024
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
TRACE: Temporal Grounding Video LLM via Causal Event ModelingInternational Conference on Learning Representations (ICLR), 2024
Yongxin Guo
Jingyu Liu
Mingda Li
Xiaoying Tang
Qingbin Liu
Xiaoying Tang
213
41
0
08 Oct 2024
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
Himanshu Gupta
Shreyas Verma
Ujjwala Anantheswaran
Kevin Scaria
Mihir Parmar
Swaroop Mishra
Chitta Baral
ReLMLRM
193
18
0
06 Oct 2024
Realizing Video Summarization from the Path of Language-based Semantic
  Understanding
Realizing Video Summarization from the Path of Language-based Semantic Understanding
Kuan-Chen Mu
Zhi-Yi Chin
Wei-Chen Chiu
109
0
0
06 Oct 2024
StoryNavi: On-Demand Narrative-Driven Reconstruction of Video Play With
  Generative AI
StoryNavi: On-Demand Narrative-Driven Reconstruction of Video Play With Generative AI
Alston Lantian Xu
Tianwei Ma
Tianmeng Liu
Can Liu
Alvaro Cassinelli
VGen
117
0
0
04 Oct 2024
Self-Powered LLM Modality Expansion for Large Speech-Text Models
Self-Powered LLM Modality Expansion for Large Speech-Text ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Tengfei Yu
Xuebo Liu
Zhiyi Hou
Liang Ding
Dacheng Tao
Min Zhang
134
5
0
04 Oct 2024
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
AuroraCap: Efficient, Performant Video Detailed Captioning and a New BenchmarkInternational Conference on Learning Representations (ICLR), 2024
Wenhao Chai
Enxin Song
Y. Du
Chenlin Meng
Vashisht Madhavan
Omer Bar-Tal
Jeng-Neng Hwang
Saining Xie
Christopher D. Manning
3DV
517
87
0
04 Oct 2024
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
Haibo Wang
Zhiyang Xu
Yu Cheng
Shizhe Diao
Jiuxiang Gu
Yixin Cao
Qifan Wang
Weifeng Ge
Lifu Huang
198
50
0
04 Oct 2024
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Frame-Voyager: Learning to Query Frames for Video Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Sicheng Yu
Chengkai Jin
Huanyu Wang
Zhenghao Chen
Sheng Jin
...
Zhenbang Sun
Bingni Zhang
Jiawei Wu
Hao Zhang
Qianru Sun
271
34
0
04 Oct 2024
Visual Prompting in LLMs for Enhancing Emotion Recognition
Visual Prompting in LLMs for Enhancing Emotion RecognitionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Qixuan Zhang
Zhifeng Wang
Dylan Zhang
Wenjia Niu
Sabrina Caldwell
Tom Gedeon
Yang Liu
Zhenyue Qin
100
6
0
03 Oct 2024
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang
Jinming Wu
W. Li
Bo Li
Zejun Ma
Ziwei Liu
Chunyuan Li
SyDaVGen
404
248
0
03 Oct 2024
UAL-Bench: The First Comprehensive Unusual Activity Localization
  Benchmark
UAL-Bench: The First Comprehensive Unusual Activity Localization BenchmarkIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Hasnat Md Abdullah
Tian Liu
Kangda Wei
Shu Kong
Ruihong Huang
219
5
0
02 Oct 2024
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP
  Models
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Jiapeng Wang
Chengyu Wang
Kunzhe Huang
Jun Huang
Lianwen Jin
CLIPVLM
214
20
0
01 Oct 2024
Efficient Driving Behavior Narration and Reasoning on Edge Device Using
  Large Language Models
Efficient Driving Behavior Narration and Reasoning on Edge Device Using Large Language ModelsIEEE Transactions on Vehicular Technology (IEEE Trans. Veh. Technol.), 2024
Yizhou Huang
Yihua Cheng
Kezhi Wang
LRM
116
3
0
30 Sep 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
  Videos
One Token to Seg Them All: Language Instructed Reasoning Segmentation in VideosNeural Information Processing Systems (NeurIPS), 2024
Zechen Bai
Tong He
Haiyang Mei
Pichao Wang
Ziteng Gao
Joya Chen
Lei Liu
Zheng Zhang
Mike Zheng Shou
VLMVOSMLLM
191
67
0
29 Sep 2024
Video DataFlywheel: Resolving the Impossible Data Trinity in
  Video-Language Understanding
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language UnderstandingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Xiao Wang
Yue Yu
Zijia Lin
Fuzheng Zhang
Di Zhang
Liqiang Nie
VGen
143
5
0
29 Sep 2024
From Seconds to Hours: Reviewing MultiModal Large Language Models on
  Comprehensive Long Video Understanding
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
Heqing Zou
Tianze Luo
Guiyang Xie
Victor
Zhang
...
Guangcong Wang
Juanyang Chen
Zhuochen Wang
Hansheng Zhang
Huaijian Zhang
VLM
253
17
0
27 Sep 2024
EgoLM: Multi-Modal Language Model of Egocentric Motions
EgoLM: Multi-Modal Language Model of Egocentric MotionsComputer Vision and Pattern Recognition (CVPR), 2024
Fangzhou Hong
Vladimir Guzov
Hyo Jin Kim
Yuting Ye
Richard Newcombe
Ziwei Liu
Lingni Ma
142
10
0
26 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
E.T. Bench: Towards Open-Ended Event-Level Video-Language UnderstandingNeural Information Processing Systems (NeurIPS), 2024
Ye Liu
Zongyang Ma
Chen Ma
Yang Wu
Ying Shan
Chang Wen Chen
199
47
0
26 Sep 2024
LLM4Brain: Training a Large Language Model for Brain Video Understanding
LLM4Brain: Training a Large Language Model for Brain Video Understanding
Ruizhe Zheng
Lichao Sun
113
2
0
26 Sep 2024
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning
MECD: Unlocking Multi-Event Causal Discovery in Video ReasoningNeural Information Processing Systems (NeurIPS), 2024
Yun Xu
Huabin Liu
Tianyao He
Yihang Chen
Chaofan Gan
...
Cheng Zhong
Yang Zhang
Yingxue Wang
Hui Lin
Weiyao Lin
VGenCML
307
19
0
26 Sep 2024
EAGLE: Egocentric AGgregated Language-video Engine
EAGLE: Egocentric AGgregated Language-video EngineACM Multimedia (MM), 2024
Jing Bi
Yunlong Tang
Luchuan Song
Ali Vosoughi
Nguyen Nguyen
Chenliang Xu
178
15
0
26 Sep 2024
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts
  Comprehension for Multimodal Large Language Models
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Yue Yu
Yu-Gang Jiang
146
1
0
25 Sep 2024
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
EventHallusion: Diagnosing Event Hallucinations in Video LLMs
Jiacheng Zhang
Yang Jiao
Shaoxiang Chen
Na Zhao
Zhiyu Tan
Hao Li
Yue Yu
MLLM
448
38
0
25 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
305
9
0
23 Sep 2024
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video
  Understanding
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2024
Yan Shu
Peitian Zhang
Zheng Liu
Minghao Qin
Yueze Wang
Tiejun Huang
Bo Zhao
VLM
331
125
0
22 Sep 2024
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free
  Manner
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang
Bingxin Xu
Weitai Kang
Mu Cai
Yuheng Li
Zehao Wen
Zhen Dong
Kurt Keutzer
Yong Jae Lee
Yan Yan
212
11
0
19 Sep 2024
MMSearch: Benchmarking the Potential of Large Models as Multi-modal
  Search Engines
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang
Renrui Zhang
Ziyu Guo
Yanmin Wu
Jiayi Lei
...
Guanglu Song
Peng Gao
Yu Liu
Chunyuan Li
Hongsheng Li
MLLM
247
38
0
19 Sep 2024
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal
  Reasoning with Large Language Models
From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models
Shengsheng Qian
Zuyi Zhou
Dizhan Xue
Bing Wang
Changsheng Xu
LRM
365
5
0
19 Sep 2024
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Bo-Kai Ruan
Hao-Tang Tsui
Yung-Hui Li
Hong-Han Shuai
LM&Ro
367
14
0
15 Sep 2024
Previous
123...101112...161718
Next