ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 875 papers shown
Title
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
344
4
0
29 Mar 2025
Understanding Co-speech Gestures in-the-wild
Understanding Co-speech Gestures in-the-wild
Sindhu B. Hegde
KR Prajwal
Taein Kwon
Andrew Zisserman
SLR
315
1
0
28 Mar 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Make Some Noise: Towards LLM audio reasoning and generation using sound tokensIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
163
1
0
28 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
Fahad Shahbaz Khan
832
3
0
27 Mar 2025
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Bohan Liu
Chenjia Li
Lalithkumar Seenivasan
Mathias Unberath
VOS
355
13
0
27 Mar 2025
Vision-to-Music Generation: A Survey
Vision-to-Music Generation: A Survey
Zhaokai Wang
Chenxi Bao
Le Zhuo
Jingrui Han
Yang Yue
Yihong Tang
Victor Shea-Jay Huang
Yue Liao
EGVMVGen
298
3
0
27 Mar 2025
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Erika Mori
Yue Qiu
Hirokatsu Kataoka
Y. Aoki
196
0
0
27 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
286
5
0
26 Mar 2025
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face DetectorComputer Vision and Pattern Recognition (CVPR), 2025
Xiao Guo
Xiufeng Song
Yue Zhang
Xiaohong Liu
Xuyang Liu
288
20
0
26 Mar 2025
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Protecting Your Video Content: Disrupting Automated Video-based LLM AnnotationsComputer Vision and Pattern Recognition (CVPR), 2025
Haitong Liu
Kuofeng Gao
Yang Bai
Jinmin Li
Jinxiao Shan
Tao Dai
Shu-Tao Xia
AAML
251
4
0
26 Mar 2025
Towards Online Multi-Modal Social Interaction Understanding
Towards Online Multi-Modal Social Interaction Understanding
Xuzhao Li
Shijian Deng
Bolin Lai
Weiguo Pian
James M. Rehg
Yapeng Tian
335
4
0
25 Mar 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
Audio-centric Video Understanding Benchmark without Text Shortcut
Yue Yang
Jimin Zhuang
Guangzhi Sun
Changli Tang
Yongqian Li
P. Li
Yifan Jiang
W. Li
Tianhao Shen
Chao Zhang
AuLLMCoGe
302
5
0
25 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGenVLM
454
29
0
24 Mar 2025
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video BenchmarksComputer Vision and Pattern Recognition (CVPR), 2025
Nina Shvetsova
Arsha Nagrani
Bernt Schiele
Hilde Kuehne
Christian Rupprecht
188
1
0
24 Mar 2025
Video-T1: Test-Time Scaling for Video Generation
Video-T1: Test-Time Scaling for Video Generation
Fan Liu
Hanyang Wang
Yimo Cai
Kaiyan Zhang
Xiaohang Zhan
Yueqi Duan
DiffMVGen
334
15
0
24 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yue Yang
Afshin Dehghan
346
13
0
24 Mar 2025
Can Text-to-Video Generation help Video-Language Alignment?
Can Text-to-Video Generation help Video-Language Alignment?Computer Vision and Pattern Recognition (CVPR), 2025
Luca Zanella
Goran Frehse
Willi Menapace
Sergey Tulyakov
Yiming Wang
Elisa Ricci
DiffMVGen
246
1
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
281
4
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
278
3
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language ModelComputer Vision and Pattern Recognition (CVPR), 2025
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Helen Zhou
Bo Yuan
VLM
323
18
0
24 Mar 2025
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Yiming Zhao
Y. Zeng
Yukun Qi
Yi Liu
Lin Yen-Chen
Zehui Chen
Xikun Bao
Jie Zhao
Feng Zhao
VLM
270
3
0
22 Mar 2025
RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation
RDTF: Resource-efficient Dual-mask Training Framework for Multi-frame Animated Sticker Generation
Zhiqiang Yuan
Ting Zhang
Ying Deng
Jiapei Zhang
Yeshuang Zhu
Zexi Jia
Jie Zhou
Jinchao Zhang
VGen
166
1
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
214
1
0
22 Mar 2025
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration
GUI-Xplore: Empowering Generalizable GUI Agents with One ExplorationComputer Vision and Pattern Recognition (CVPR), 2025
Yuchen Sun
Shanhui Zhao
Tao Yu
Hao Wen
Samith Va
Mengwei Xu
Yan Liang
Chongyang Zhang
LLMAG
232
9
0
22 Mar 2025
Agentic Keyframe Search for Video Question Answering
Agentic Keyframe Search for Video Question Answering
Sunqi Fan
Meng-Hao Guo
Shuojin Yang
165
3
0
20 Mar 2025
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025
Zhihang Liu
Chen-Wei Xie
Nianzu Yang
Liming Zhao
Longxiang Tang
Yun Zheng
Chuanbin Liu
Hongtao Xie
VLM
195
11
0
20 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLMMQ
279
7
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal RepresentationsComputer Vision and Pattern Recognition (CVPR), 2025
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
234
7
0
20 Mar 2025
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?
Xuanming Cui
Jaiminkumar Ashokbhai Bhoi
Chionh Wei Peng
Adriel Kuek
Ser-Nam Lim
222
0
0
20 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question AnsweringIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Han Wang
Kai Hu
Liangcai Gao
531
1
0
20 Mar 2025
A Review on Large Language Models for Visual Analytics
A Review on Large Language Models for Visual Analytics
Navya Sonal Agarwal
Sanjay Kumar Sonbhadra
274
5
0
19 Mar 2025
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
Weiyu Guo
Ziyang Chen
Shaoguang Wang
Jianxiang He
Yijie Xu
Jinhui Ye
Ying Sun
Hui Xiong
279
15
0
17 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Wenshu Fan
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&RoLRM
821
30
0
17 Mar 2025
VITED: Video Temporal Evidence Distillation
VITED: Video Temporal Evidence DistillationComputer Vision and Pattern Recognition (CVPR), 2025
Yujie Lu
Yale Song
William Yang Wang
Lorenzo Torresani
Tushar Nagarajan
918
2
0
17 Mar 2025
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit CooperationComputer Vision and Pattern Recognition (CVPR), 2025
Henghui Du
Guangyao Li
Chang Zhou
Chunjie Zhang
Alan Zhao
D. Hu
172
10
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Efficient Motion-Aware Video MLLMComputer Vision and Pattern Recognition (CVPR), 2025
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
Binghai Wang
Xin Wu
Qingbin Liu
209
3
0
17 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
856
10
0
16 Mar 2025
Multi Activity Sequence Alignment via Implicit Clustering
Multi Activity Sequence Alignment via Implicit Clustering
Taein Kwon
Zador Pataki
Mahdi Rad
Marc Pollefeys
HAIAI4TS
205
0
0
16 Mar 2025
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Haoqi Yuan
Yu Bai
Yuhui Fu
Bohan Zhou
Yicheng Feng
Weishuai Zeng
Yi Zhan
Börje F. Karlsson
Zongqing Lu
LM&Ro
382
9
0
16 Mar 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Longji Xu
Shengqiong Wu
Yujiao Shi
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
457
98
0
16 Mar 2025
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
Leqi Shen
Tao He
Guoqiang Gong
Fan Yang
Yuhui Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
133
3
0
14 Mar 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
Ali Vosoughi
Chong Chen
Chenliang Xu
LRM
186
15
0
14 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Lei Ma
Mamba
214
17
0
14 Mar 2025
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
Xirui Zhou
Lianlei Shan
Xiaolin Gui
160
14
0
14 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Yuxiao Chen
DiffMVGen
224
2
0
13 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Xiping Hu
Yang Liu
Ziwei Sun
Longji Xu
VLM
1.1K
14
0
13 Mar 2025
Towards Graph Foundation Models: A Transferability Perspective
Longji Xu
Wenqi Fan
Suhang Wang
Yao Ma
207
5
0
13 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
Jing Zhang
Haoyang Li
Yiran Chen
VLM
423
4
0
13 Mar 2025
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
Henglyu Liu
Andong Chen
Kehai Chen
X. Bai
M. Zhong
Yuan Qiu
Min Zhang
205
1
0
13 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
243
3
0
12 Mar 2025
Previous
123...678...161718
Next