ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 669 papers shown
Title
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
Minbin Huang
Yanxin Long
Xinchi Deng
Ruihang Chu
Jiangfeng Xiong
Xiaodan Liang
Hong Cheng
Qinglin Lu
Wei Liu
MLLMEGVM
278
19
0
13 Mar 2024
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Michael Andersland
61
0
0
11 Mar 2024
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
Xijia Tao
Shuai Zhong
Lei Li
Qi Liu
Lingpeng Kong
301
44
0
05 Mar 2024
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
Zhende Song
Chenchen Wang
Jiamu Sheng
C. Zhang
Gang Yu
Jiayuan Fan
Tao Chen
VGen
348
21
0
03 Mar 2024
Evaluating Large Language Models as Virtual Annotators for Time-series
  Physical Sensing Data
Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data
Aritra Hota
S. Chatterjee
Sandip Chakraborty
327
20
0
02 Mar 2024
PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
  Multimodal Models
PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models
Dingkun Guo
Yuqi Xiang
Shuqi Zhao
Xinghao Zhu
Masayoshi Tomizuka
Mingyu Ding
Wei Zhan
182
14
0
26 Feb 2024
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Yao Mu
Junting Chen
Qinglong Zhang
Shoufa Chen
Qiaojun Yu
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Mingyu Ding
Ping Luo
212
44
0
25 Feb 2024
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced
  Safety Alignment
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Zhenghao Hu
Jiazhao Li
Yiquan Li
Xiangyu Qi
Junjie Hu
Yixuan Li
P. McDaniel
Muhao Chen
Bo Li
Chaowei Xiao
AAMLSILM
291
28
0
22 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Vasu Sharma
VLM
391
63
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
321
62
0
20 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRMVLM
272
110
0
19 Feb 2024
LVCHAT: Facilitating Long Video Comprehension
LVCHAT: Facilitating Long Video Comprehension
Yu Wang
Zeyuan Zhang
Julian McAuley
Zexue He
VLM
125
6
0
19 Feb 2024
Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Yuqing Liu
Yu Wang
Lichao Sun
Philip S. Yu
170
16
0
13 Feb 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
591
128
0
13 Feb 2024
It's Never Too Late: Fusing Acoustic Information into Large Language
  Models for Automatic Speech Recognition
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
Chen Chen
Ruizhe Li
Yuchen Hu
Sabato Marco Siniscalchi
Pin-Yu Chen
Ensiong Chng
Chao-Han Huck Yang
179
32
0
08 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
368
13
0
08 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Jiaming Song
Yu Qiao
Shiyang Feng
MLLM
417
135
0
08 Feb 2024
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Sentiment-enhanced Graph-based Sarcasm Explanation in DialogueIEEE transactions on multimedia (IEEE TMM), 2024
Kun Ouyang
Liqiang Jing
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
383
7
0
06 Feb 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
336
315
0
24 Jan 2024
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
  Reasoning over Image Sequences
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Xiyao Wang
Yuhang Zhou
Xiaoyu Liu
Hongjin Lu
Yuancheng Xu
...
Taixi Lu
Gedas Bertasius
Mohit Bansal
Huaxiu Yao
Furong Huang
LRMVLM
285
94
0
19 Jan 2024
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang
Weixin Luo
Qianyu Chen
Haonan Mai
Jindi Guo
Sixun Dong
Xiaohua Xuan
MLLMLLMAG
293
36
0
19 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
559
151
0
29 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
150
30
0
27 Dec 2023
Text-Conditioned Resampler For Long Form Video Understanding
Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar
Yongqin Xian
A. Tonioni
Andrew Zisserman
Federico Tombari
252
21
0
19 Dec 2023
SMILE: Multimodal Dataset for Understanding Laughter in Video with
  Language Models
SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
Lee Hyun
Kim Sung-Bin
Seungju Han
Youngjae Yu
Tae-Hyun Oh
288
21
0
15 Dec 2023
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction
  Tuning
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
Bolin Lai
Xiaoliang Dai
Lawrence Chen
Guan Pang
James M. Rehg
Miao Liu
241
21
0
06 Dec 2023
Reason2Drive: Towards Interpretable and Chain-based Reasoning for
  Autonomous Driving
Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving
Ming-Jun Nie
Renyuan Peng
Chunwei Wang
Xinyue Cai
Jianhua Han
Hang Xu
Li Zhang
LRM
229
99
0
06 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
215
18
0
04 Dec 2023
TimeChat: A Time-sensitive Multimodal Large Language Model for Long
  Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
VLMMLLM
276
326
0
04 Dec 2023
ChatPose: Chatting about 3D Human Pose
ChatPose: Chatting about 3D Human PoseComputer Vision and Pattern Recognition (CVPR), 2023
Yao Feng
Jing Lin
Sai Kumar Dwivedi
Yu Sun
Priyanka Patel
Michael J. Black
3DH
211
62
0
30 Nov 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLMMLLM
233
68
0
30 Nov 2023
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language ModelIEEE transactions on multimedia (IEEE TMM), 2023
Fukun Yin
Xin Chen
C. Zhang
Biao Jiang
Zibo Zhao
Jiayuan Fan
Gang Yu
Taihao Li
Tao Chen
333
38
0
29 Nov 2023
M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
M2^{2}2Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
Yatian Wang
Rongyu Zhang
Zhengkai Jiang
Yijiang Liu
Ziyi Lin
Renrui Zhang
MLLM
202
2
0
29 Nov 2023
A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models
A Survey of the Evolution of Language Model-Based Dialogue Systems: Data, Task and Models
Hongru Wang
Lingzhi Wang
Yiming Du
Liang Chen
Jing Zhou
Yufei Wang
Kam-Fai Wong
LRM
354
23
0
28 Nov 2023
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision
  Language Models in Open-Ended Video Question Answering
AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xiuyuan Chen
Yuan Lin
Yuchen Zhang
Weiran Huang
ELMMLLM
247
37
0
25 Nov 2023
Video-LLaVA: Learning United Visual Representation by Alignment Before
  Projection
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin
Yang Ye
Bin Zhu
Jiaxi Cui
Munan Ning
Peng Jin
Li-ming Yuan
VLMMLLM
955
1,088
0
16 Nov 2023
GRASP: A novel benchmark for evaluating language GRounding And Situated
  Physics understanding in multimodal language models
GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language modelsInternational Joint Conference on Artificial Intelligence (IJCAI), 2023
Serwan Jassim
Mario S. Holubar
Annika Richter
Cornelius Wolff
Xenia Ohmer
Elia Bruni
ELM
222
21
0
15 Nov 2023
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model
Jinjin Xu
Liwu Xu
Yuzhe Yang
Xiang Li
Fanyi Wang
Yanchun Xie
Yi-Jie Huang
Yaqian Li
MoEMLLMVLM
336
24
0
09 Nov 2023
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Zhen Yang
Yingxue Zhang
Fandong Meng
Jie Zhou
VLMMLLM
147
3
0
08 Nov 2023
LLM4Drive: A Survey of Large Language Models for Autonomous Driving
LLM4Drive: A Survey of Large Language Models for Autonomous Driving
Zhenjie Yang
Xiaosong Jia
Guoying Gu
Junchi Yan
ELM
478
157
0
02 Nov 2023
Large Language Models are Temporal and Causal Reasoners for Video
  Question Answering
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Dohwan Ko
Ji Soo Lee
Wooyoung Kang
Byungseok Roh
Hyunwoo J. Kim
LRM
262
53
0
24 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLMVLM
222
63
0
13 Oct 2023
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog
Haoyu Zhang
Meng Liu
Yaowei Wang
Da Cao
Weili Guan
Liqiang Nie
286
1
0
11 Oct 2023
FireAct: Toward Language Agent Fine-tuning
FireAct: Toward Language Agent Fine-tuning
Baian Chen
Chang Shu
Ehsan Shareghi
Nigel Collier
Karthik Narasimhan
Shunyu Yao
ALMLLMAG
308
150
0
09 Oct 2023
Fine-grained Audio-Visual Joint Representations for Multimodal Large
  Language Models
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Chao Zhang
180
14
0
09 Oct 2023
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
Hao Sha
Yao Mu
Yuxuan Jiang
Li Chen
Chenfeng Xu
Ping Luo
Shengbo Eben Li
Masayoshi Tomizuka
Wei Zhan
Mingyu Ding
529
211
0
04 Oct 2023
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language ModelConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Avamarie Brueggeman
Andrea Madotto
Mohammad Kachuee
Tushar Nagarajan
Matt Smith
...
Peyman Heidari
Yue Liu
Kavya Srinet
Babak Damavandi
Anuj Kumar
MLLM
238
108
0
27 Sep 2023
Knowledge-Guided Short-Context Action Anticipation in Human-Centric
  Videos
Knowledge-Guided Short-Context Action Anticipation in Human-Centric Videos
Sarthak Bhagat
Simon Stepputtis
Joseph Campbell
Katia Sycara
170
4
0
12 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize
  Content And Behavior
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And BehaviorInternational Conference on Learning Representations (ICLR), 2023
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
267
10
0
01 Sep 2023
FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo
  Embeddings
FashionLOGO: Prompting Multimodal Large Language Models for Fashion Logo EmbeddingsInternational Conference on Information and Knowledge Management (CIKM), 2023
Yulin Su
Min Yang
Minghui Qiu
Jing Wang
Tao Wang
VLM
156
2
0
17 Aug 2023
Previous
123...121314
Next