ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 669 papers shown
Title
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos
Yichi Zhang
Xin Luna Dong
Mohammad Kachuee
Andrea Madotto
Anuj Kumar
Babak Damavandi
J. Chai
Seungwhan Moon
206
2
0
06 Jun 2025
Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge
Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge
Constantin Patsch
Marsil Zakour
Yuankai Wu
Eckehard G. Steinbach
117
1
0
06 Jun 2025
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Movie Facts and Fibs (MF2^22): A Benchmark for Long Movie Understanding
Emmanouil Zaranis
António Farinhas
Saul Santos
Beatriz Canaverde
Miguel Moura Ramos
...
Raffaella Bernardi
Raquel Fernández
Sandro Pezzelle
Vlad Niculae
Andre F. T. Martins
187
3
0
06 Jun 2025
Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models
Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models
Hugues Thomas
Chen Chen
Jian Zhang
145
0
0
06 Jun 2025
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs
Lidong Lu
Guo Chen
Ruoyao Xiao
Yicheng Liu
Tong Lu
VLMLRM
247
6
0
05 Jun 2025
Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques
Jisu An
Junseok Lee
Jeoungeun Lee
Yongseok Son
332
1
0
05 Jun 2025
Track Any Anomalous Object: A Granular Video Anomaly Detection PipelineComputer Vision and Pattern Recognition (CVPR), 2025
Yuzhi Huang
Chenxin Li
H. Zhang
Zixu Lin
Yunlong Lin
...
Xinyu Liu
Jiechao Gao
Yue Huang
Xinghao Ding
Yixuan Yuan
203
2
0
05 Jun 2025
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
Daeun Lee
Jaehong Yoon
Jaemin Cho
Mohit Bansal
LRM
240
2
0
04 Jun 2025
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision
Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric VisionComputer Vision and Pattern Recognition (CVPR), 2025
Tomoya Yoshida
Shuhei Kurita
Taichi Nishimura
Shinsuke Mori
236
1
0
04 Jun 2025
Video Anomaly Detection with Semantics-Aware Information Bottleneck
Video Anomaly Detection with Semantics-Aware Information Bottleneck
Juntong Li
Lingwei Dang
Yukun Su
Yun Hao
Qingxin Xiao
Yongwei Nie
Qingyao Wu
Qingyao Wu
179
1
0
03 Jun 2025
Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Yang Yao
Lingyu Li
Jiaxin Song
Chiyu Chen
Zhenqi He
...
Xin Wang
Tianle Gu
Jie Li
Yan Teng
Yingchun Wang
LRM
222
0
0
03 Jun 2025
Is Extending Modality The Right Path Towards Omni-Modality?
Is Extending Modality The Right Path Towards Omni-Modality?
Tinghui Zhu
Kai Zhang
Muhao Chen
Eric Fosler-Lussier
VLM
210
3
0
02 Jun 2025
Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
Chunhui Zhang
Z. Ouyang
Kwonjoon Lee
Nakul Agarwal
Sean Dae Houlihan
Soroush Vosoughi
Shao-Yuan Lo
LRM
157
3
0
02 Jun 2025
Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review
Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review
Yuchen Fang
Hao Miao
Yuxuan Liang
Liwei Deng
Yue Cui
...
Yan Zhao
T. Pedersen
Christian S. Jensen
Xiaofang Zhou
Kai Zheng
AI4TSAI4CE
196
5
0
02 Jun 2025
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded DialoguesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Youngmin Kim
Jiwan Chung
Jisoo Kim
Sunghyun Lee
Sangkyu Lee
Junhyeok Kim
Cheoljong Yang
Youngjae Yu
VGen
88
2
0
01 Jun 2025
SiLVR: A Simple Language-based Video Reasoning Framework
SiLVR: A Simple Language-based Video Reasoning Framework
Ce Zhang
Yan-Bo Lin
Ziyang Wang
Mohit Bansal
Gedas Bertasius
LRM
130
5
0
30 May 2025
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT
Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT
Zhuobai Dong
Junchao Yi
Ziyuan Zheng
Haochen Han
Xiangxi Zheng
Alex Jinpeng Wang
Fangming Liu
Linjie Li
ReLMLRM
137
1
0
30 May 2025
DisTime: Distribution-based Time Representation for Video Large Language Models
DisTime: Distribution-based Time Representation for Video Large Language Models
Yingsen Zeng
Zepeng Huang
Yujie Zhong
Chengjian Feng
Jie Hu
Lin Ma
Yang Liu
VGen
214
1
0
30 May 2025
Period-LLM: Extending the Periodic Capability of Multimodal Large Language Model
Period-LLM: Extending the Periodic Capability of Multimodal Large Language ModelComputer Vision and Pattern Recognition (CVPR), 2025
Yuting Zhang
Hao Lu
Qingyong Hu
Yin Wang
Kaishen Yuan
Xin Liu
Kaishun Wu
MLLMLRM
183
4
0
30 May 2025
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
Leveraging Auxiliary Information in Text-to-Video Retrieval: A Review
A. Fragomeni
Dima Damen
Michael Wray
171
0
0
29 May 2025
Tell me Habibi, is it Real or Fake?
Tell me Habibi, is it Real or Fake?
Kartik Kuckreja
Parul Gupta
Injy Hamed
Thamar Solorio
Muhammad Haris Khan
Abhinav Dhall
228
5
0
28 May 2025
Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics
Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics
Yinjie Zhao
Heng Zhao
Bihan Wen
Yew-Soon Ong
Joey Tianyi Zhou
VGen
103
1
0
28 May 2025
Fostering Video Reasoning via Next-Event Prediction
Fostering Video Reasoning via Next-Event Prediction
Haonan Wang
Hongfu Liu
Xiangyan Liu
C. Du
Kenji Kawaguchi
Ye Wang
Tianyu Pang
AI4TSLRM
172
2
0
28 May 2025
HoliTom: Holistic Token Merging for Fast Video Large Language Models
HoliTom: Holistic Token Merging for Fast Video Large Language Models
Kele Shao
Keda Tao
Can Qin
Haoxuan You
Yang Sui
Huan Wang
VLM
457
11
0
27 May 2025
HuMoCon: Concept Discovery for Human Motion Understanding
HuMoCon: Concept Discovery for Human Motion UnderstandingComputer Vision and Pattern Recognition (CVPR), 2025
Qihang Fang
Chengcheng Tang
Bugra Tekin
Shugao Ma
Yanchao Yang
142
1
0
27 May 2025
The Role of Video Generation in Enhancing Data-Limited Action Understanding
The Role of Video Generation in Enhancing Data-Limited Action UnderstandingInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Wei Li
Dezhao Luo
Dongbao Yang
Zhenhang Li
Weiping Wang
Yu Zhou
DiffMVGen
517
0
0
26 May 2025
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
TDVE-Assessor: Benchmarking and Evaluating the Quality of Text-Driven Video Editing with LMMs
Juntong Wang
Jiarui Wang
Huiyu Duan
Guangtao Zhai
Xiongkuo Min
144
6
0
26 May 2025
Multi-modal brain encoding models for multi-modal stimuli
Multi-modal brain encoding models for multi-modal stimuliInternational Conference on Learning Representations (ICLR), 2025
R. Mamidi
Khushbu Pahwa
Mounika Marreddy
Maneesh Singh
Subba Reddy Oota
Bapi S. Raju
120
8
0
26 May 2025
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
Pooneh Mousavi
Yingzhi Wang
Mirco Ravanelli
Cem Subakan
AuLLM
293
0
0
26 May 2025
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought
Chao Huang
Benfeng Wang
Jie Wen
Chengliang Liu
Wei Wang
Li Shen
Xiaochun Cao
LRM
231
5
0
26 May 2025
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Yunxin Li
Xinyu Chen
Zitao Li
Zhenyu Liu
L. Wang
Tong Lu
Baotian Hu
Min Zhang
OffRLLRM
339
7
0
25 May 2025
RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models
RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models
Yuqi Liu
Qin Jin
Tianyuan Qu
Xuan Liu
Yang Du
Bei Yu
Jiaya Jia
349
0
0
25 May 2025
Multimodal Conversation Structure Understanding
Multimodal Conversation Structure Understanding
Kent K. Chang
Mackenzie Cramer
Anna Ho
Ti Ti Nguyen
Yilin Yuan
David Bamman
223
1
0
23 May 2025
From Evaluation to Defense: Advancing Safety in Video Large Language Models
From Evaluation to Defense: Advancing Safety in Video Large Language Models
Yiwei Sun
Peiqi Jiang
Chuanbin Liu
Luohao Lin
Zhiying Lu
Hongtao Xie
165
1
0
22 May 2025
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks
Vignesh Gopinathan
Urs Zimmermann
Michael Arnold
Matthias Rottmann
154
0
0
22 May 2025
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position EncodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Junlong Tong
Jinlan Fu
Zixuan Lin
Yingqi Fan
Anhao Zhao
Hui Su
Xiaoyu Shen
309
2
0
22 May 2025
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Tony Montes
Fernando Lozano
233
2
0
21 May 2025
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Subrata Biswas
Mohammad Nur Hossain Khan
Bashima Islam
294
2
0
21 May 2025
Clapper: Compact Learning and Video Representation in VLMs
Clapper: Compact Learning and Video Representation in VLMs
Lingyu Kong
Hongzhi Zhang
Jingyuan Zhang
Jianzhao Huang
Kunze Li
Qi Wang
Fuzheng Zhang
VLM
153
0
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
322
3
0
20 May 2025
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu
Yiyu Wang
Junpeng Ma
Linfeng Zhang
VLM
119
8
0
20 May 2025
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Wentao Ma
Weiming Ren
Yiming Jia
Zhuofeng Li
Ping Nie
Ge Zhang
Wenhu Chen
209
5
0
20 May 2025
A Challenge to Build Neuro-Symbolic Video Agents
A Challenge to Build Neuro-Symbolic Video Agents
Sahil Shah
Harsh Goel
Sai Shankar Narasimhan
Minkyu Choi
S P Sharan
Oguzhan Akcin
Sandeep Chinchali
AI4TS
183
1
0
20 May 2025
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
Thong Nguyen
Zhiyuan Hu
Xu Lin
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
264
1
0
19 May 2025
Large Language Models and Their Applications in Roadway Safety and Mobility Enhancement: A Comprehensive Review
Large Language Models and Their Applications in Roadway Safety and Mobility Enhancement: A Comprehensive Review
Muhammad Monjurul Karim
Yan Shi
Shucheng Zhang
Bingzhang Wang
Mehrdad Nasri
Yinhai Wang
131
6
0
19 May 2025
Visuospatial Cognitive Assistant
Visuospatial Cognitive Assistant
Qi Feng
LM&Ro
269
2
0
18 May 2025
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts
Qi Feng
LRM
222
5
0
18 May 2025
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Xuannan Liu
Zekun Li
Xue Sun
Peipei Li
Shuhan Xia
Xing Cui
Huaibo Huang
Xi Yang
Ran He
EGVMAAML
218
7
0
17 May 2025
ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
Hao Gu
Jiangyan Yi
Chenglong Wang
Jianhua Tao
Zheng Lian
Jiayi He
Yong Ren
Yujie Chen
Zhengqi Wen
240
0
0
16 May 2025
Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot
Sage Deer: A Super-Aligned Driving Generalist Is Your Copilot
Hao Lu
Jiaqi Tang
Jiyao Wang
Yaojie Lu
Xu Cao
...
Bin Huang
Dengbo He
Shuiguang Deng
Hao Chen
Ying-Cong Chen
250
1
0
15 May 2025
Previous
123456...121314
Next