ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXivPDFHTML

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 702 papers shown
Title
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
A. Vosoughi
Chong Chen
Chenliang Xu
LRM
76
3
0
14 Mar 2025
LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs
Leqi Shen
Tao He
Guoqiang Gong
Fan Yang
Yuhui Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
50
0
0
14 Mar 2025
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen
Zhengrong Yue
Siran Chen
Zehua Wang
Yang Liu
Ziwei Sun
Yansen Wang
VLM
240
0
0
13 Mar 2025
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation
Henglyu Liu
Andong Chen
Kehai Chen
X. Bai
M. Zhong
Yuan Qiu
Min Zhang
45
0
0
13 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
Jingyang Zhang
Yiming Li
Yiran Chen
VLM
79
2
0
13 Mar 2025
Towards Graph Foundation Models: A Transferability Perspective
Yansen Wang
Wenqi Fan
Suhang Wang
Yao Ma
43
1
0
13 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Yuxiao Chen
DiffM
VGen
53
0
0
13 Mar 2025
Generative Frame Sampler for Long Video Understanding
Linli Yao
Haoning Wu
Kun Ouyang
Yuyao Zhang
Caiming Xiong
Bei Chen
Xu Sun
Junnan Li
VLM
VGen
58
0
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
234
1
0
12 Mar 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin
Zhiyu Tan
Mengping Yang
Xiaomeng Yang
Hao Li
90
0
0
12 Mar 2025
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Haoyu Zhang
Qiaohui Chu
Meng Liu
Yunxiao Wang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Yaowei Wang
Liqiang Nie
EgoV
75
0
0
12 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
63
0
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
210
0
0
12 Mar 2025
EgoBlind: Towards Egocentric Visual Assistance for the Blind People
Junbin Xiao
Nanxin Huang
Hao Qiu
Zhulin Tao
Xun Yang
Richang Hong
Ming Wang
Angela Yao
EgoV
VLM
70
0
0
11 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
214
1
0
11 Mar 2025
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments
Dongping Li
Tielong Cai
Tianci Tang
Wenhao Chai
Katherine Rose Driggs-Campbell
Gaoang Wang
LM&Ro
71
0
0
11 Mar 2025
Linguistic Knowledge Transfer Learning for Speech Enhancement
Kuo-Hsuan Hung
Xugang Lu
Szu-Wei Fu
Huan-Hsin Tseng
Hsin-Yi Lin
Chii-Wann Lin
Yu Tsao
VLM
70
0
0
10 Mar 2025
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen
Xufang Luo
Dongsheng Li
OffRL
LRM
75
3
0
10 Mar 2025
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
Guiwei Zhang
Tianyu Zhang
Mohan Zhou
Yalong Bai
Biye Li
66
0
0
10 Mar 2025
KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
Xiaoming Shi
Zeming Liu
Chenkai Zhang
Yiming Lei
Haitao Leng
...
Qingjie Liu
Wanxiang Che
Shaoguo Liu
Size Li
Yunhong Wang
59
1
0
10 Mar 2025
GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Xiang Lan
Feng Wu
Kai He
Qinghao Zhao
Shenda Hong
Mengling Feng
AI4TS
66
3
0
08 Mar 2025
A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
Wenzhuo Du
G. Wang
Guancheng Chen
Hang Zhao
Xiaochen Li
Jian Gao
244
0
0
08 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
216
0
0
08 Mar 2025
Is Your Video Language Model a Reliable Judge?
M. Liu
Wensheng Zhang
67
2
0
07 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
Hairu Wang
Can Huang
B. Li
VLM
63
2
0
06 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
49
2
0
05 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
71
2
0
05 Mar 2025
MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality
Shuaike Li
Kai Zhang
Qiang Liu
Enhong Chen
KELM
83
1
0
04 Mar 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou
Ke Mei
Yu Lu
Tianyi Wang
Fengyun Rao
94
2
0
03 Mar 2025
Towards Improved Text-Aligned Codebook Learning: Multi-Hierarchical Codebook-Text Alignment with Long Text
Guotao Liang
Baoquan Zhang
Zhiyuan Wen
Junteng Zhao
Yunming Ye
Kola Ye
Yao He
57
0
0
03 Mar 2025
Learning to Generate Long-term Future Narrations Describing Activities of Daily Living
Ramanathan Rajendiran
Debaditya Roy
Basura Fernando
VGen
46
0
0
03 Mar 2025
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos
Kangda Wei
Zhengyu Zhou
Bingqing Wang
Jun Araki
Lukas Lange
Ruihong Huang
Z. Feng
37
0
0
28 Feb 2025
Adaptive Keyframe Sampling for Long Video Understanding
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
85
0
0
28 Feb 2025
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
Yuhao Li
Mirana Claire Angel
Salman Khan
Yu Zhu
Jinqiu Sun
Yanning Zhang
Fahad Shahbaz Khan
VGen
51
0
0
27 Feb 2025
VideoA11y: Method and Dataset for Accessible Video Description
VideoA11y: Method and Dataset for Accessible Video Description
Chaoyu Li
Sid Padmanabhuni
Maryam Cheema
H. Seifi
Pooyan Fazli
VGen
70
0
0
27 Feb 2025
Omni-SILA: Towards Omni-scene Driven Visual Sentiment Identifying, Locating and Attributing in Videos
Jiamin Luo
Jingjing Wang
Junxiao Ma
Yujie Jin
Shoushan Li
Guodong Zhou
38
0
0
26 Feb 2025
Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM
Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM
Junxiao Ma
Jingjing Wang
Jiamin Luo
Peiying Yu
Guodong Zhou
41
0
0
26 Feb 2025
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
Wei Zhao
Pengxiang Ding
Hao Fei
Zhefei Gong
Shuanghao Bai
Han Zhao
Donglin Wang
93
6
0
24 Feb 2025
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
Gengyuan Zhang
Mingcong Ding
Tong Liu
Yao Zhang
Volker Tresp
84
1
0
24 Feb 2025
Introducing Visual Perception Token into Multimodal Large Language Model
Introducing Visual Perception Token into Multimodal Large Language Model
Runpeng Yu
Xinyin Ma
Xinchao Wang
MLLM
LRM
83
0
0
24 Feb 2025
OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering
OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering
Jiahao Nick Li
Zhuohao Jerry Zhang
Zhang
67
1
0
24 Feb 2025
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Hantao Lou
Changye Li
Yalan Qin
Yaodong Yang
53
1
0
22 Feb 2025
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Bo-Kai Ruan
Hao-Tang Tsui
Yung-Hui Li
Hong-Han Shuai
LM&Ro
86
5
0
20 Feb 2025
Can Hallucination Correction Improve Video-Language Alignment?
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
64
0
0
20 Feb 2025
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
Caihua Liu
Xu Li
Wenjing Xue
Wei Tang
Xia Feng
56
0
0
20 Feb 2025
Pretrained Image-Text Models are Secretly Video Captioners
Pretrained Image-Text Models are Secretly Video Captioners
Chunhui Zhang
Yiren Jian
Z. Ouyang
Soroush Vosoughi
VLM
95
4
0
20 Feb 2025
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning
Rui Zhao
Qirui Yuan
Jinyu Li
Haofeng Hu
Yun Li
Chengyuan Zheng
Fei Gao
LRM
52
4
0
19 Feb 2025
Megrez-Omni Technical Report
Boxun Li
Yadong Li
Zehan Li
Congyi Liu
Weilin Liu
...
Dong Zhou
Yueqing Zhuang
Shengen Yan
Guohao Dai
Yansen Wang
51
0
0
19 Feb 2025
Magma: A Foundation Model for Multimodal AI Agents
Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang
Reuben Tan
Qianhui Wu
Ruijie Zheng
Baolin Peng
...
Seonghyeon Ye
Joel Jang
Yuquan Deng
Lars Liden
Jianfeng Gao
VLM
AI4TS
122
9
0
18 Feb 2025
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding
Weikang Qiu
Zheng Huang
Haoyu Hu
Aosong Feng
Yujun Yan
Rex Ying
47
0
0
18 Feb 2025
Previous
123456...131415
Next