ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 875 papers shown
Title
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Sentiment-enhanced Graph-based Sarcasm Explanation in DialogueIEEE transactions on multimedia (IEEE TMM), 2024
Kun Ouyang
Liqiang Jing
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
383
7
0
06 Feb 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled
  Visual-Motional Tokenization
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationInternational Conference on Machine Learning (ICML), 2024
Yang Jin
Zhicheng Sun
Kun Xu
Kun Xu
Liwei Chen
...
Yuliang Liu
Chen Zhang
Yang Song
Kun Gai
Yadong Mu
VGen
181
72
0
05 Feb 2024
Delving into Multi-modal Multi-task Foundation Models for Road Scene
  Understanding: From Learning Paradigm Perspectives
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm PerspectivesIEEE Transactions on Intelligent Vehicles (TIV), 2024
Sheng Luo
Wei Chen
Wanxin Tian
Rui Liu
Luanxuan Hou
...
Ling Shao
Yi Yang
Bojun Gao
Qun Li
Guobin Wu
299
26
0
05 Feb 2024
A Survey for Foundation Models in Autonomous Driving
A Survey for Foundation Models in Autonomous Driving
Haoxiang Gao
Yaqian Li
Kaiwen Long
Ming Yang
Yiqing Shen
VLMLRM
204
45
0
02 Feb 2024
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal
  Image Generation
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation
Yuanhuiyi Lyu
Xueye Zheng
Lin Wang
DiffM
169
12
0
31 Jan 2024
A Survey on Generative AI and LLM for Video Generation, Understanding,
  and Streaming
A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming
Pengyuan Zhou
Lin Wang
Zhi Liu
Yanbin Hao
Pan Hui
Sasu Tarkoma
J. Kangasharju
VGen
205
46
0
30 Jan 2024
Towards 3D Molecule-Text Interpretation in Language Models
Towards 3D Molecule-Text Interpretation in Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Changhao Nai
Zhiyuan Liu
Yancheng Luo
Xiang Wang
Xiangnan He
Kenji Kawaguchi
Tat-Seng Chua
Qi Tian
AI4CE
200
68
0
25 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
336
318
0
24 Jan 2024
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
  Reasoning over Image Sequences
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Xiyao Wang
Yuhang Zhou
Xiaoyu Liu
Hongjin Lu
Yuancheng Xu
...
Taixi Lu
Gedas Bertasius
Mohit Bansal
Huaxiu Yao
Furong Huang
LRMVLM
285
95
0
19 Jan 2024
Large Language Models are Efficient Learners of Noise-Robust Speech
  Recognition
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition
Yuchen Hu
Chen Chen
Chao-Han Huck Yang
Ruizhe Li
Chao Zhang
Pin-Yu Chen
Ensiong Chng
221
33
0
19 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLMLLMAG
305
36
0
19 Jan 2024
Temporal Insight Enhancement: Mitigating Temporal Hallucination in
  Multimodal Large Language Models
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models
Li Sun
Liuan Wang
Jun Sun
Takayuki Okatani
MLLM
94
1
0
18 Jan 2024
On the Audio Hallucinations in Large Audio-Video Language Models
On the Audio Hallucinations in Large Audio-Video Language Models
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
164
11
0
18 Jan 2024
MMToM-QA: Multimodal Theory of Mind Question Answering
MMToM-QA: Multimodal Theory of Mind Question AnsweringAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Chuanyang Jin
Yutong Wu
Jing Cao
Jiannan Xiang
Yen-Ling Kuo
Zhiting Hu
T. Ullman
Antonio Torralba
Joshua B. Tenenbaum
Tianmin Shu
190
66
0
16 Jan 2024
Towards A Better Metric for Text-to-Video Generation
Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu
Guian Fang
Haoning Wu
Xintao Wang
Yixiao Ge
...
Rui Zhao
Weisi Lin
Wynne Hsu
Ying Shan
Mike Zheng Shou
VGen
202
44
0
15 Jan 2024
GroundingGPT:Language Enhanced Multi-modal Grounding Model
GroundingGPT:Language Enhanced Multi-modal Grounding ModelAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Zhaowei Li
Qi Xu
Dong Zhang
Hang Song
Yiqing Cai
...
Junting Pan
Zefeng Li
Van Tu Vu
Zhida Huang
Tao Wang
461
88
0
11 Jan 2024
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection and Explanation via Large Language Models
Hui Lv
Qianru Sun
194
47
0
11 Jan 2024
SonicVisionLM: Playing Sound with Vision Language Models
SonicVisionLM: Playing Sound with Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Zhifeng Xie
Shengye Yu
Qile He
Mengtian Li
VLMVGen
125
3
0
09 Jan 2024
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
  for Video Question Answering
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question AnsweringAAAI Conference on Artificial Intelligence (AAAI), 2024
Yueqian Wang
Yuxuan Wang
Kai Chen
Dongyan Zhao
148
2
0
08 Jan 2024
LightHouse: A Survey of AGI Hallucination
LightHouse: A Survey of AGI Hallucination
Feng Wang
LRMHILMVLM
244
4
0
08 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRMVLM
218
12
0
03 Jan 2024
Detours for Navigating Instructional Videos
Detours for Navigating Instructional VideosComputer Vision and Pattern Recognition (CVPR), 2024
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
366
7
0
03 Jan 2024
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected
  Multi-Modal Large Models
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large ModelsComputer Vision and Pattern Recognition (CVPR), 2024
Xinpeng Ding
Jinahua Han
Hang Xu
Xiaodan Liang
Wei Zhang
Xiaomeng Li
201
78
0
02 Jan 2024
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language
  Models
E-chat: Emotion-sensitive Spoken Dialogue System with Large Language ModelsInternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2023
Hongfei Xue
Yuhao Liang
Bingshen Mu
Shiliang Zhang
Mengzhe Chen
Qian Chen
Lei Xie
AuLLM
324
22
0
31 Dec 2023
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Boosting Large Language Model for Speech Synthesis: An Empirical StudyIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Hong-ping Hao
Long Zhou
Shujie Liu
Jinyu Li
Shujie Hu
Rui Wang
Furu Wei
212
25
0
30 Dec 2023
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
567
151
0
29 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLMMLLM
207
253
0
28 Dec 2023
Grounding-Prompter: Prompting LLM with Multimodal Information for
  Temporal Sentence Grounding in Long Videos
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
Houlun Chen
Xin Wang
Hong Chen
Zihan Song
Jia Jia
Wenwu Zhu
LRM
196
18
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
162
30
0
27 Dec 2023
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
256
21
0
19 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with
  Language Models
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
300
21
0
15 Dec 2023
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction
  Tuning
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
Bolin Lai
Xiaoliang Dai
Lawrence Chen
Guan Pang
James M. Rehg
Miao Liu
241
21
0
06 Dec 2023
Reason2Drive: Towards Interpretable and Chain-based Reasoning for
  Autonomous Driving
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Ming-Jun Nie
Renyuan Peng
Chunwei Wang
Xinyue Cai
Jianhua Han
Hang Xu
Li Zhang
LRM
229
100
0
06 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
215
18
0
04 Dec 2023
TimeChat: A Time-sensitive Multimodal Large Language Model for Long
  Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
VLMMLLM
276
328
0
04 Dec 2023
ChatPose: Chatting about 3D Human Pose
ChatPose: Chatting about 3D Human PoseComputer Vision and Pattern Recognition (CVPR), 2023
Yao Feng
Jing Lin
Sai Kumar Dwivedi
Yu Sun
Priyanka Patel
Michael J. Black
3DH
215
62
0
30 Nov 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLMMLLM
233
68
0
30 Nov 2023
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language ModelIEEE transactions on multimedia (IEEE TMM), 2023
Fukun Yin
Xin Chen
C. Zhang
Biao Jiang
Zibo Zhao
Jiayuan Fan
Gang Yu
Taihao Li
Tao Chen
353
38
0
29 Nov 2023
M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
M2^{2}2Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
Yatian Wang
Rongyu Zhang
Zhengkai Jiang
Yijiang Liu
Ziyi Lin
Renrui Zhang
MLLM
202
2
0
29 Nov 2023
A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models
A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models
Hongru Wang
Lingzhi Wang
Yiming Du
Liang Chen
Jing Zhou
Yufei Wang
Kam-Fai Wong
LRM
354
23
0
28 Nov 2023
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision
  Language Models in Open-Ended Video Question Answering
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xiuyuan Chen
Yuan Lin
Yuchen Zhang
Weiran Huang
ELMMLLM
255
36
0
25 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
999
1,100
0
16 Nov 2023
GRASP: A novel benchmark for evaluating language GRounding And Situated
  Physics understanding in multimodal language models
GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language modelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2023
Serwan Jassim
Mario S. Holubar
Annika Richter
Cornelius Wolff
Xenia Ohmer
Elia Bruni
ELM
226
21
0
15 Nov 2023
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Jinjin Xu
Liwu Xu
Yuzhe Yang
Xiang Li
Fanyi Wang
Yanchun Xie
Yi-Jie Huang
Yaqian Li
MoEMLLMVLM
336
24
0
09 Nov 2023
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Zhen Yang
Yingxue Zhang
Fandong Meng
Jie Zhou
VLMMLLM
159
3
0
08 Nov 2023
LLM4Drive: A Survey of Large Language Models for Autonomous Driving
LLM4Drive: A Survey of Large Language Models for Autonomous Driving
Zhenjie Yang
Xiaosong Jia
Guoying Gu
Junchi Yan
ELM
478
160
0
02 Nov 2023
Large Language Models are Temporal and Causal Reasoners for Video
  Question Answering
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Dohwan Ko
Ji Soo Lee
Wooyoung Kang
Byungseok Roh
Hyunwoo J. Kim
LRM
262
53
0
24 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLMVLM
226
64
0
13 Oct 2023
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Haoyu Zhang
Meng Liu
Yaowei Wang
Da Cao
Weili Guan
Liqiang Nie
312
1
0
11 Oct 2023
FireAct: Toward Language Agent Fine-tuning
FireAct: Toward Language Agent Fine-tuning
Baian Chen
Chang Shu
Ehsan Shareghi
Nigel Collier
Karthik Narasimhan
Shunyu Yao
ALMLLMAG
312
150
0
09 Oct 2023
Previous
123...161718
Next