ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 669 papers shown
Title
Multimodal Video Emotion Recognition with Reliable Reasoning Priors
Multimodal Video Emotion Recognition with Reliable Reasoning Priors
Zhepeng Wang
Yingjian Zhu
Guanghao Dong
Hongzhu Yi
F. Chen
Xinming Wang
Jun Xie
52
0
0
29 Jul 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
Yuying Ge
Yixiao Ge
Chen Li
Teng Wang
Junfu Pu
...
Xiaojing Zhang
Yangyu Tao
Han Hu
Di Wang
Mingyu Ding
104
8
0
28 Jul 2025
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao
Keda Tao
Kejia Zhang
Sicheng Feng
Mu Cai
Yuzhang Shang
Haoxuan You
Can Qin
Yang Sui
Huan Wang
349
9
0
27 Jul 2025
Object-centric Video Question Answering with Visual Grounding and Referring
Object-centric Video Question Answering with Visual Grounding and Referring
Haochen Wang
Qirui Chen
Cilin Yan
Jiayin Cai
Xiaolong Jiang
Yao Hu
Weidi Xie
Stratis Gavves
MLLMVOS
164
3
0
25 Jul 2025
LMM-Det: Make Large Multimodal Models Excel in Object Detection
LMM-Det: Make Large Multimodal Models Excel in Object Detection
Jincheng Li
Chunyu Xie
Ji Ao
Dawei Leng
Yuhui Yin
MLLMObjDVLM
175
3
0
24 Jul 2025
SV3.3B: A Sports Video Understanding Model for Action Recognition
SV3.3B: A Sports Video Understanding Model for Action Recognition
Sai Varun Kodathala
Yashwanth Reddy Vutukoori
Rakesh Vunnam
172
2
0
23 Jul 2025
HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs
HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLMs
Zhaolin Cai
Fan Li
Ziwei Zheng
Yanjun Qin
100
1
0
23 Jul 2025
QuMAB: Query-based Multi-Annotator Behavior Modeling with Reliability under Sparse Labels
QuMAB: Query-based Multi-Annotator Behavior Modeling with Reliability under Sparse Labels
Liyun Zhang
Zheng Lian
Hong Liu
Takanori Takebe
Yuta Nakashima
130
0
0
23 Jul 2025
MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing
MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing
Shreelekha Revankar
Utkarsh Mall
Cheng Perng Phoo
Kavita Bala
Bharath Hariharan
74
0
0
22 Jul 2025
Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models
Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models
Mohamad Ballout
Serwan Jassim
Elia Bruni
78
0
0
22 Jul 2025
Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models
Toward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models
Tz-Ying Wu
Tahani Trigui
S. N. Sridhar
Anand Bodas
Subarna Tripathi
62
0
0
22 Jul 2025
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent
Jiaao Li
Kaiyuan Li
Chen Gao
Yong Li
Xinlei Chen
77
2
0
21 Jul 2025
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
Xiaoyi Bao
Chenwei Xie
Hao Tang
Tingyu Weng
Xiaofeng Wang
Yun Zheng
Xingang Wang
VGen
99
1
0
21 Jul 2025
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
Ce Zhang
Yale Song
Ruta Desai
Michael L. Iuzzolino
Joseph Tighe
Gedas Bertasius
Satwik Kottur
115
1
0
20 Jul 2025
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding
Yuanhan Zhang
Yunice Chew
Yuhao Dong
Aria Leo
Bo Hu
Yu Qiao
ELM
107
2
0
20 Jul 2025
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Peiran Wu
Yunze Liu
Zhengdong Zhu
Enmin Zhou
Junxiao Shen
133
2
0
15 Jul 2025
MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models
Qiyan Zhao
Xiaofeng Zhang
Yiheng Li
Yun Xing
Xiaosong Yuan
Feilong Tang
Sinan Fan
Xuhang Chen
Xuyao Zhang
Dahan Wang
119
3
0
12 Jul 2025
Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
Cross-Modal Dual-Causal Learning for Long-Term Action Recognition
Xu Shaowu
Jia Xibin
Gao Junyu
Sun Qianmei
Chang Jing
Fan Chao
117
0
0
09 Jul 2025
Spatio-Temporal LLM: Reasoning about Environments and Actions
Spatio-Temporal LLM: Reasoning about Environments and Actions
Haozhen Zheng
Beitong Tian
Mingyuan Wu
Zhenggang Tang
Klara Nahrstedt
Alex Schwing
LRM
94
2
0
07 Jul 2025
Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
Yifan Jiang
Yibo Xue
Yukun Kang
Pin Zheng
Jian Peng
Feiran Wu
Changliang Xu
DiffMVGen
172
0
0
05 Jul 2025
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu
Enxin Song
Wenhao Chai
Xuexiang Wen
Tian-Chun Ye
Gaoang Wang
244
3
0
03 Jul 2025
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges
Sanjeda Akter
Ibne Farabi Shihab
Anuj Sharma
VLM
189
2
0
02 Jul 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language ModelComputer Vision and Pattern Recognition (CVPR), 2025
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
312
4
0
01 Jul 2025
MotionGPT3: Human Motion as a Second Modality
MotionGPT3: Human Motion as a Second Modality
Bingfan Zhu
Biao Jiang
S. Wang
Bin Wang
Tao Chen
Linjie Luo
Youyi Zheng
Xin Chen
151
1
0
30 Jun 2025
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Amir Aghdam
Vincent Tao Hu
Bjorn Ommer
VLM
167
1
0
28 Jun 2025
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang
Jiahui Yang
Jianqin Yin
Zhenbo Luo
Jian Luan
200
17
0
27 Jun 2025
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Yue Zhang
Jilei Sun
Yunhui Guo
Vibhav Gogate
LRM
124
1
0
27 Jun 2025
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models
Zeqian Li
Shangzhe Di
Zhonghua Zhai
Weilin Huang
Yanfeng Wang
Weidi Xie
VLM
54
4
0
23 Jun 2025
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
Guankun Wang
Junyi Wang
Wenjin Mo
Long Bai
Kun Yuan
...
N. Padoy
Zhen Lei
Hongbin Liu
Nassir Navab
Hongliang Ren
121
1
0
22 Jun 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLMVLM
134
1
0
20 Jun 2025
PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
Yizhe Li
Sanping Zhou
Zheng Qin
Le Wang
ViT
140
0
0
19 Jun 2025
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models
Changli Tang
Yixuan Li
Yudong Yang
Jimin Zhuang
Guangzhi Sun
Wei Li
Zejun Ma
Chao Zhang
227
2
0
18 Jun 2025
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
Xiaoqi Wang
Yi Wang
Lap-Pui Chau
140
1
0
17 Jun 2025
MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models
MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models
Geewook Kim
Minjoon Seo
139
1
0
16 Jun 2025
Action Dubber: Timing Audible Actions via Inflectional Flow
Action Dubber: Timing Audible Actions via Inflectional Flow
Wenlong Wan
Weiying Zheng
Tianyi Xiang
Guiqing Li
Shengfeng He
125
0
0
16 Jun 2025
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding
Zhucun Xue
Jiangning Zhang
Xurong Xie
Yuxuan Cai
Yong-Jin Liu
Xiangtai Li
Dacheng Tao
VGenVLM
246
4
0
16 Jun 2025
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Yuan Gao
Mattia Piccinini
Yuchen Zhang
Dingrui Wang
Korbinian Moller
...
Steven Peters
Andrea Stocco
Bassam Alrifaee
Marco Pavone
Johannes Betz
167
14
0
13 Jun 2025
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
Bo-Cheng Chiu
Jen-Jee Chen
Yu-Chee Tseng
Feng-Chi Chen
213
0
0
13 Jun 2025
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic SoundscapesInternational Conference on Learning Representations (ICLR), 2025
Tony Alex
S. Ahmed
A. Mustafa
Muhammad Awais
Philip J. B. Jackson
129
7
0
13 Jun 2025
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Qizhe Zhang
Mengzhen Liu
Lichen Li
Ming Lu
Yuan Zhang
Junwen Pan
Qi She
Shanghang Zhang
VLM
292
15
0
12 Jun 2025
CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
CreatiPoster: Towards Editable and Controllable Multi-Layer Graphic Design Generation
Zhao Zhang
Yutao Cheng
Dexiang Hong
Maoke Yang
Gonglei Shi
Lei Ma
H. Zhang
Jie Shao
Xinglong Wu
DiffM
239
2
0
12 Jun 2025
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran
Adrien Bardes
David Fan
Q. Garrido
Russell Howes
...
Sarath Chandar
Franziska Meier
Yann LeCun
Michael G. Rabbat
Nicolas Ballas
216
94
0
11 Jun 2025
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation
Hyeongcheol Park
Jiyoung Seo
MinHyuk Jang
Hogun Park
Ha Dam Baek
Gyusam Chang
Hyeonsoo Im
Sangpil Kim
184
0
0
11 Jun 2025
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision
Ayush Gupta
A. Roy
Rama Chellappa
Nathaniel D. Bastian
Alvaro Velasquez
Susmit Jha
129
0
0
11 Jun 2025
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
Boyu Chen
Siran Chen
Kunchang Li
Qinglin Xu
Yu Qiao
Yali Wang
VOS
131
5
0
09 Jun 2025
Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models
Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models
Ruiyang Zhang
Hu Zhang
Hao Fei
Zhedong Zheng
UQCV
182
0
0
09 Jun 2025
EgoM2P: Egocentric Multimodal Multitask Pretraining
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li
Yutong Chen
Yiqian Wu
Kaifeng Zhao
Marc Pollefeys
Siyu Tang
EgoVVLM
287
3
0
09 Jun 2025
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
Liangliang You
Junchi Yao
Shu Yang
Guimin Hu
Lijie Hu
Di Wang
MLLM
167
2
0
08 Jun 2025
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
Sanjoy Chowdhury
Mohamed Elmoghany
Yohan Abeysinghe
Mahmoud Ahmed
Sayan Nag
Salman Khan
Mohamed Elhoseiny
Dinesh Manocha
245
3
0
08 Jun 2025
How Important are Videos for Training Video LLMs?
How Important are Videos for Training Video LLMs?
George Lydakis
Alexander Hermans
A. Athar
Daan de Geus
Bastian Leibe
VLM
96
0
0
07 Jun 2025
Previous
12345...121314
Next