ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXivPDFHTML

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 703 papers shown
Title
GUI Action Narrator: Where and When Did That Action Take Place?
GUI Action Narrator: Where and When Did That Action Take Place?
Qinchen Wu
Difei Gao
Kevin Qinghong Lin
Zhuoyu Wu
Xiangwu Guo
Peiran Li
Weichen Zhang
Hengxu Wang
Mike Zheng Shou
45
3
0
19 Jun 2024
DrVideo: Document Retrieval Based Long Video Understanding
DrVideo: Document Retrieval Based Long Video Understanding
Ziyu Ma
Chenhui Gou
Hengcan Shi
Bin Sun
Shutao Li
Hamid Rezatofighi
Jianfei Cai
VLM
36
13
0
18 Jun 2024
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via
  Multi-modal LLM
Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
Huaxin Zhang
Xiaohao Xu
Xiang Wang
Jialong Zuo
Chuchu Han
Xiaonan Huang
Changxin Gao
Yuehuan Wang
Nong Sang
69
17
0
18 Jun 2024
VoCo-LLaMA: Towards Vision Compression with Large Language Models
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Xubing Ye
Yukang Gan
Xiaoke Huang
Yixiao Ge
Yansong Tang
MLLM
VLM
45
23
0
18 Jun 2024
VideoLLM-online: Online Video Large Language Model for Streaming Video
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen
Zhaoyang Lv
Shiwei Wu
Kevin Qinghong Lin
Chenan Song
Difei Gao
Jia-Wei Liu
Ziteng Gao
Dongxing Mao
Mike Zheng Shou
MLLM
MoMe
57
50
0
17 Jun 2024
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
Dantong Niu
Yuvan Sharma
Giscard Biamby
Jerome Quenum
Yutong Bai
Baifeng Shi
Trevor Darrell
Roei Herzig
LM&Ro
VLM
50
24
0
17 Jun 2024
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li
Xinyu Chen
Baotian Hu
Longyue Wang
Haoyuan Shi
Min-Ling Zhang
MLLM
LRM
63
26
0
17 Jun 2024
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in
  Multimodal Large Language Model
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Jiahao Huo
Yibo Yan
Boren Hu
Yutao Yue
Xuming Hu
LRM
MLLM
45
7
0
17 Jun 2024
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with
  Instruction Tuning
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Zebang Cheng
Zhi-Qi Cheng
Jun-Yan He
Jingdong Sun
Kai Wang
Yuxiang Lin
Zheng Lian
Xiaojiang Peng
Alexander G. Hauptmann
MLLM
40
31
0
17 Jun 2024
WildVision: Evaluating Vision-Language Models in the Wild with Human
  Preferences
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Yujie Lu
Dongfu Jiang
Wenhu Chen
William Yang Wang
Yejin Choi
Bill Yuchen Lin
VLM
58
26
0
16 Jun 2024
Investigating Video Reasoning Capability of Large Language Models with
  Tropes in Movies
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies
Hung-Ting Su
Chun-Tong Chao
Ya-Ching Hsu
Xudong Lin
Yulei Niu
Hung-Yi Lee
Winston H. Hsu
LRM
41
1
0
16 Jun 2024
Exploring the Potential of Multimodal LLM with Knowledge-Intensive
  Multimodal ASR
Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR
Minghan Wang
Yuxia Wang
Thuy-Trang Vu
Ehsan Shareghi
Gholamreza Haffari
37
0
0
16 Jun 2024
GPT-4o: Visual perception performance of multimodal large language
  models in piglet activity understanding
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
Yiqi Wu
Xiaodan Hu
Ziming Fu
Siling Zhou
Jiangong Li
MLLM
40
10
0
14 Jun 2024
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
Rohit K Bharadwaj
Hanan Gani
Muzammal Naseer
Fahad Shahbaz Khan
Salman Khan
70
3
0
14 Jun 2024
Multimodal Large Language Models with Fusion Low Rank Adaptation for
  Device Directed Speech Detection
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar
Oggi Rudovic
Sameer Dharur
Florian Pesce
G. Krishna
Aswin Sivaraman
Jack Berkowitz
Ahmed Hussen Abdelaziz
Saurabh N. Adya
Ahmed H. Tewfik
VLM
60
0
0
13 Jun 2024
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video
  Understanding
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad A Khan
VLM
MLLM
42
51
0
13 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLM
LRM
49
1
0
13 Jun 2024
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zijia Zhao
Haoyu Lu
Yuqi Huo
Yifan Du
Tongtian Yue
Longteng Guo
Bingning Wang
Weipeng Chen
Jing Liu
49
2
0
13 Jun 2024
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
  in Videos
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He
Weixi Feng
Kaizhi Zheng
Yujie Lu
Wanrong Zhu
...
Zhengyuan Yang
Kevin Lin
William Yang Wang
Lijuan Wang
Xin Eric Wang
VGen
LRM
51
12
0
12 Jun 2024
Understanding Sounds, Missing the Questions: The Challenge of Object
  Hallucination in Large Audio-Language Models
Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
Chun-Yi Kuan
Wei-Ping Huang
Hung-yi Lee
AuLLM
31
7
0
12 Jun 2024
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
  Streams
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang
Yiqin Wang
Yansong Tang
Yong-Jin Liu
Jiashi Feng
Jifeng Dai
Xiaojie Jin
52
38
0
12 Jun 2024
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities
  in Large Vision-Language Models
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
Shimin Chen
Yitian Yuan
Shaoxiang Chen
Zequn Jie
Lin Ma
VLM
35
3
0
12 Jun 2024
TRINS: Towards Multimodal Language Models that Can Read
TRINS: Towards Multimodal Language Models that Can Read
Ruiyi Zhang
Yanzhe Zhang
Jian Chen
Yufan Zhou
Jiuxiang Gu
Changyou Chen
Tong Sun
VLM
39
6
0
10 Jun 2024
Multimodal Contextualized Semantic Parsing from Speech
Multimodal Contextualized Semantic Parsing from Speech
Jordan Voas
Raymond Mooney
David Harwath
56
0
0
10 Jun 2024
iMotion-LLM: Motion Prediction Instruction Tuning
iMotion-LLM: Motion Prediction Instruction Tuning
Abdulwahab Felemban
Eslam Mohamed Bakr
Xiaoqian Shen
Jian Ding
Abduallah A. Mohamed
Mohamed Elhoseiny
60
1
0
10 Jun 2024
Vript: A Video Is Worth Thousands of Words
Vript: A Video Is Worth Thousands of Words
Dongjie Yang
Suyuan Huang
Chengqiang Lu
Xiaodong Han
Haoxin Zhang
Yan Gao
Yao Hu
Hai Zhao
VGen
80
24
0
10 Jun 2024
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
Xi Li
Yusen Zhang
Renze Lou
Chen Wu
Jiaqi Wang
LRM
AAML
45
12
0
10 Jun 2024
Video-Language Understanding: A Survey from Model Architecture, Model
  Training, and Data Perspectives
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
Thong Nguyen
Yi Bin
Junbin Xiao
Leigang Qu
Yicong Li
Jay Zhangjie Wu
Cong-Duy Nguyen
See-Kiong Ng
Luu Anh Tuan
VLM
64
10
1
09 Jun 2024
VP-LLM: Text-Driven 3D Volume Completion with Large Language Models
  through Patchification
VP-LLM: Text-Driven 3D Volume Completion with Large Language Models through Patchification
Jianmeng Liu
Yichen Liu
Yuyao Zhang
Zeyuan Meng
Yu-Wing Tai
Chi-Keung Tang
49
0
0
08 Jun 2024
Seeing the Unseen: Visual Metaphor Captioning for Videos
Seeing the Unseen: Visual Metaphor Captioning for Videos
Abisek Rajakumar Kalarani
Pushpak Bhattacharyya
Sumit Shekhar
VLM
32
1
0
07 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better
  Captions
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
83
145
0
06 Jun 2024
VideoTetris: Towards Compositional Text-to-Video Generation
VideoTetris: Towards Compositional Text-to-Video Generation
Ye Tian
Ling Yang
Haotian Yang
Yuan Gao
Yufan Deng
...
Zhaochen Yu
Xin Tao
Pengfei Wan
Di Zhang
Bin Cui
DiffM
VGen
95
17
0
06 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Yu Guo
VGen
104
16
0
06 Jun 2024
AD-H: Autonomous Driving with Hierarchical Agents
AD-H: Autonomous Driving with Hierarchical Agents
Zaibin Zhang
Shiyu Tang
Yuanhang Zhang
Talas Fu
Yifan Wang
Yang Liu
Dong Wang
Jing Shao
Lijun Wang
H. Lu
54
3
0
05 Jun 2024
Evaluation of data inconsistency for multi-modal sentiment analysis
Evaluation of data inconsistency for multi-modal sentiment analysis
Yufei Wang
Mengyue Wu
46
1
0
05 Jun 2024
Multi-layer Learnable Attention Mask for Multimodal Tasks
Multi-layer Learnable Attention Mask for Multimodal Tasks
Wayner Barrios
SouYoung Jin
39
0
0
04 Jun 2024
From Redundancy to Relevance: Enhancing Explainability in Multimodal
  Large Language Models
From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models
Xiaofeng Zhang
Chen Shen
Xiaosong Yuan
Shaotian Yan
Liang Xie
Wenxiao Wang
Chaochen Gu
Hao Tang
Jieping Ye
54
2
0
04 Jun 2024
Multimodal Reasoning with Multimodal Knowledge Graph
Multimodal Reasoning with Multimodal Knowledge Graph
Junlin Lee
Yequan Wang
Jing Li
Min Zhang
46
15
0
04 Jun 2024
Towards Practical Single-shot Motion Synthesis
Towards Practical Single-shot Motion Synthesis
Konstantinos Roditakis
Spyridon Thermos
N. Zioulis
VGen
48
0
0
03 Jun 2024
Artemis: Towards Referential Understanding in Complex Videos
Artemis: Towards Referential Understanding in Complex Videos
Jihao Qiu
Yuan Zhang
Xi Tang
Lingxi Xie
Tianren Ma
Pengyu Yan
David Doermann
Qixiang Ye
Yunjie Tian
VLM
VGen
57
8
0
01 Jun 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
  Multi-modal LLMs in Video Analysis
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Tong Xu
Xiawu Zheng
Enhong Chen
Rongrong Ji
Xing Sun
VLM
MLLM
50
308
0
31 May 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
57
33
0
30 May 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action
  Anticipation using Large Video-Language Models
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
49
14
0
30 May 2024
Temporal Grounding of Activities using Multimodal Large Language Models
Temporal Grounding of Activities using Multimodal Large Language Models
Young Chol Song
51
0
0
30 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
51
32
0
29 May 2024
Matryoshka Query Transformer for Large Vision-Language Models
Matryoshka Query Transformer for Large Vision-Language Models
Wenbo Hu
Zi-Yi Dou
Liunian Harold Li
Amita Kamath
Nanyun Peng
Kai-Wei Chang
MLLM
41
8
0
29 May 2024
MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language
  Model
MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model
Ziqi Ren
Jie Li
Xuetong Xue
Xin Li
Fan Yang
Zhicheng Jiao
Xinbo Gao
46
3
0
29 May 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
61
58
0
29 May 2024
The Evolution of Multimodal Model Architectures
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
45
15
0
28 May 2024
Video Enriched Retrieval Augmented Generation Using Aligned Video
  Captions
Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
Kevin Dela Rosa
21
5
0
27 May 2024
Previous
123...8910...131415
Next