ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 669 papers shown
Title
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
Haodi Ma
Vyom Pathak
Daisy Zhe Wang
65
0
0
15 Sep 2025
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei-Hsin Yeh
Yu-An Su
Chih-Ning Chen
Yi-Hsueh Lin
Calvin Ku
Wen-Hsin Chiu
Min-Chun Hu
Lun-Wei Ku
48
0
0
15 Sep 2025
GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
Wan Xu
Feng Zhu
Yihan Zeng
Yuanfan Guo
Ming-Yu Liu
Hang Xu
W. Zuo
32
0
0
14 Sep 2025
Video Understanding by Design: How Datasets Shape Architectures and Insights
Video Understanding by Design: How Datasets Shape Architectures and Insights
Lei Wang
Piotr Koniusz
Yongsheng Gao
3DVVGenAI4TS
181
0
0
11 Sep 2025
Harnessing Object Grounding for Time-Sensitive Video Understanding
Harnessing Object Grounding for Time-Sensitive Video Understanding
Tz-Ying Wu
S. N. Sridhar
Subarna Tripathi
65
0
0
08 Sep 2025
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
Qianrui Zhou
Hua Xu
Yifan Wang
Xinzhi Dong
Hanlei Zhang
48
0
0
01 Sep 2025
Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models
Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models
Hyunjong Ok
Jaeho Lee
40
0
0
01 Sep 2025
OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
Junzhe Chen
Tianshu Zhang
Shiyu Huang
Yuwei Niu
Chao Sun
Rongzhou Zhang
G. Zhou
Lijie Wen
Xuming Hu
MLLM
132
0
0
31 Aug 2025
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
Xiangchen Wang
Jinrui Zhang
Teng Wang
Haigang Zhang
Feng Zheng
63
0
0
31 Aug 2025
SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
Zhen Chen
Xingjian Luo
Kun Yuan
J. Wu
Danny Tat Ming Chan
Nassir Navab
Hongbin Liu
Zhen Lei
Jiebo Luo
128
1
0
30 Aug 2025
UItron: Foundational GUI Agent with Advanced Perception and Planning
UItron: Foundational GUI Agent with Advanced Perception and Planning
Zhixiong Zeng
Jing Huang
Liming Zheng
Wenkang Han
Yufeng Zhong
Lei Chen
Longrong Yang
Yingjie Chu
Yuzhi He
Lin Ma
LLMAG
121
4
0
29 Aug 2025
DriveQA: Passing the Driving Knowledge Test
DriveQA: Passing the Driving Knowledge Test
Maolin Wei
Wanzhou Liu
Eshed Ohn-Bar
ELM
74
1
0
29 Aug 2025
SUMMA: A Multimodal Large Language Model for Advertisement Summarization
SUMMA: A Multimodal Large Language Model for Advertisement Summarization
Weitao Jia
Shuo Yin
Zhoufutu Wen
Han Wang
Zehui Dai
Kun Zhang
Zhenyu Li
Tao Zeng
Xiaohui Lv
72
0
0
28 Aug 2025
CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
Nannan Zhu
Yonghao Dong
T. Wang
Xueqian Li
Shengjun Deng
...
Tiantian Geng
Guo Niu
Hanyan Huang
Xiongfei Yao
Shuaiwei Jiao
LRM
136
2
0
27 Aug 2025
MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment
Zhiting Gao
Dan Song
Diqiong Jiang
Chao Xue
An-an Liu
VGen
108
0
0
27 Aug 2025
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference
Pengfei Jiang
Hanjun Li
Linglan Zhao
Fei Chao
Ke Yan
Shouhong Ding
Rongrong Ji
56
2
0
25 Aug 2025
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Language-Guided Temporal Token Pruning for Efficient VideoLLM Processing
Yogesh Kumar
VLM
46
0
0
25 Aug 2025
Beyond Play and Pause: Turning GPT-4o Spatial Weakness into a Strength for In-Depth Interactive Video Learning
Beyond Play and Pause: Turning GPT-4o Spatial Weakness into a Strength for In-Depth Interactive Video Learning
Sajad Goudarzi
Samaneh Zamanifard
16
0
0
23 Aug 2025
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick
Saransh Sharma
Abhik Jana
Pawan Goyal
116
1
0
22 Aug 2025
Aligning Moments in Time using Video Queries
Aligning Moments in Time using Video Queries
Yogesh Kumar
Uday Agarwal
Manish Gupta
Anand Mishra
163
1
0
21 Aug 2025
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
Pengcheng Fang
Yuxia Chen
Rui Guo
VGen
48
0
0
21 Aug 2025
An Empirical Study on How Video-LLMs Answer Video Questions
An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou
Ziyu Ma
Zicheng Duan
Haoyu He
Feng Chen
Akide Liu
Bohan Zhuang
Jianfei Cai
H. Rezatofighi
92
1
0
21 Aug 2025
Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting
Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting
Gyusam Chang
Tuan-Anh Vu
Vivek Alumootil
Harris Song
Deanna Pham
Sangpil Kim
M. Khalid Jawed
3DGS
77
1
0
20 Aug 2025
NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video Understanding
NoteIt: A System Converting Instructional Videos to Interactable Notes Through Multimodal Video UnderstandingACM Symposium on User Interface Software and Technology (UIST), 2025
Running Zhao
Zhihan Jiang
Xinchen Zhang
Chirui Chang
Handi Chen
Weipeng Deng
Luyao Jin
Xiaojuan Qi
Xun Qian
Edith C.H. Ngai
60
0
0
20 Aug 2025
RynnEC: Bringing MLLMs into Embodied World
RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang
Yuqian Yuan
Yunxuan Mao
Kehan Li
Jiangpin Liu
Zhikai Wang
Xin Li
F. Wang
Deli Zhao
VGenLM&Ro
112
3
0
19 Aug 2025
ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving
ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving
Can Cui
Yupeng Zhou
Juntong Peng
Sung-Yeon Park
Zichong Yang
Prashanth Sankaranarayanan
Jiaru Zhang
Ruqi Zhang
Ziran Wang
48
2
0
18 Aug 2025
Region-Level Context-Aware Multimodal Understanding
Region-Level Context-Aware Multimodal Understanding
Hongliang Wei
Xianqi Zhang
Xingtao Wang
Xiaopeng Fan
Debin Zhao
VLM
101
0
0
17 Aug 2025
UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting
UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting
Sehyuk Park
S. Han
Eduard Hovy
AI4TS
52
0
0
16 Aug 2025
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Wenbin An
Jiahao Nie
Yaqiang Wu
Feng Tian
Shijian Lu
Q. Zheng
MLLM
102
0
0
14 Aug 2025
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics
Simindokht Jahangard
Mehrzad Mohammadi
Yi Shen
Zhixi Cai
Hamid Rezatofighi
165
1
0
14 Aug 2025
Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning
Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning
Wenlong Liang
Rui Zhou
Yang Ma
Bing Zhang
Songlin Li
Yijia Liao
Ping Kuang
LM&Ro3DVAI4CE
96
5
0
14 Aug 2025
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
Junyan Ye
Shihong Deng
Zihao Wang
Leqi Zhu
Zhenghao Hu
...
Zhiyuan Yan
Jinghua Yu
Jiaming Song
Conghui He
Weijia Li
VLM
140
22
0
13 Aug 2025
Episodic Memory Representation for Long-form Video Understanding
Episodic Memory Representation for Long-form Video Understanding
Yun Wang
Long Zhang
Jingren Liu
Jiaqi Yan
Zhanjie Zhang
Jiahao Zheng
Xun Yang
Dapeng Wu
Xiangyu Chen
Xuelong Li
76
3
0
13 Aug 2025
KFFocus: Highlighting Keyframes for Enhanced Video Understanding
KFFocus: Highlighting Keyframes for Enhanced Video Understanding
Ming-Jun Nie
Chunwei Wang
Hang Xu
Li Zhang
VGen
53
0
0
12 Aug 2025
TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding
TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding
Jin-Seop Lee
SungJoon Lee
Jaehan Ahn
YunSeok Choi
Jee-Hyong Lee
VLM
58
2
0
11 Aug 2025
MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
Fan Zhang
Minghan Li
Chong Deng
Xue Yang
Zheng Lian
...
Xian Wu
Kun Wang
Xiangang Li
Jieping Ye
Pheng-Ann Heng
AI4MH
84
3
0
11 Aug 2025
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning
Siminfar Samakoush Galougah
Rishie Raj
Sanjoy Chowdhury
Sayan Nag
Ramani Duraiswami
104
1
0
10 Aug 2025
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Jianxiang He
Shaoguang Wang
Weiyu Guo
Yijie Xu
Ziyang Chen
Yijie Xu
Ziyang Chen
81
0
0
09 Aug 2025
eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos
eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos
Xuecheng Wu
Dingkang Yang
Danlei Huang
Xinyi Yin
Yifan Wang
...
Liangyu Fu
Yang Liu
Junxiao Xue
Hadi Amirpour
Wei Zhou
117
1
0
09 Aug 2025
Aligning Effective Tokens with Video Anomaly in Large Language Models
Aligning Effective Tokens with Video Anomaly in Large Language Models
Yingxian Chen
Jiahui Liu
Ruidi Fan
Yanwei Li
Chirui Chang
Shizhen Zhao
W. Fok
Xiaojuan Qi
Yik-Chung Wu
91
0
0
08 Aug 2025
ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos
ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in VideosAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Mohammad Zia Ur Rehman
Anukriti Bhatnagar
Omkar Kabde
Shubhi Bansal
Nagendra Kumar
76
6
0
07 Aug 2025
A Survey on Video Temporal Grounding with Multimodal Large Language Model
A Survey on Video Temporal Grounding with Multimodal Large Language ModelIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Yue Yu
Wei Liu
Y. Liu
Meng-yang Liu
Liqiang Nie
Zhouchen Lin
C. Chen
AI4TSVLMLRM
109
5
0
07 Aug 2025
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
Changho Choi
Youngwoo Shin
Gyojin Han
Dong-Jae Lee
Junmo Kim
96
0
0
07 Aug 2025
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering
Yiran Meng
Junhong Ye
Wei Zhou
Guanghui Yue
Xudong Mao
Ruomei Wang
Baoquan Zhao
54
0
0
05 Aug 2025
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Haolin Yang
Feilong Tang
Linxiao Zhao
Xiang An
Ming Hu
...
Yifan Lu
Xiaofeng Zhang
Abdalla Swikir
Junjun He
Zongyuan Ge
187
1
0
03 Aug 2025
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
Shuangkang Fang
I-Chao Shen
Yufeng Wang
Yi-Hsuan Tsai
Y. Yang
Shuchang Zhou
Wenrui Ding
Takeo Igarashi
M. Yang
AI4CE
116
2
0
02 Aug 2025
Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning
Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning
Zhiyuan Han
Beier Zhu
Yanlong Xu
Peipei Song
Xun Yang
98
3
0
02 Aug 2025
SGCap: Decoding Semantic Group for Zero-shot Video Captioning
SGCap: Decoding Semantic Group for Zero-shot Video Captioning
Zeyu Pan
Ping Li
Wenxiao Wang
VLM
74
0
0
02 Aug 2025
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi
Mohamed Ilyas Lakhal
Ozge Mercanoglu Sincan
Richard Bowden
SLR
142
0
0
31 Jul 2025
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Kaining Ying
Henghui Ding
Guangquan Jie
Yu Jiang
VOS
225
5
0
30 Jul 2025
Previous
123456...121314
Next