ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
v1v2v3v4 (latest)

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXiv (abs)PDFHTMLHuggingFace (19 upvotes)

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 875 papers shown
Title
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering
  Using a VLM
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
177
78
0
27 Mar 2024
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive
  Dataset and Benchmark for Chain-of-Thought Reasoning
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao
Shengju Qian
Han Xiao
Guanglu Song
Zhuofan Zong
Letian Wang
Yu Liu
Jiaming Song
VGenLRMMLLM
241
193
0
25 Mar 2024
Elysium: Exploring Object-level Perception in Videos via MLLM
Elysium: Exploring Object-level Perception in Videos via MLLM
Hang Wang
Yanjie Wang
Yongjie Ye
Yuxiang Nie
Can Huang
MLLM
274
37
0
25 Mar 2024
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition
Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior RecognitionIEEE transactions on multimedia (IEEE TMM), 2024
Shijian Deng
Erin E. Kosloski
Siddhi Patel
Zeke A. Barnett
Yiyang Nan
...
William T. Doan
Matthew Wang
Harsh Singh
P. Rollins
Yapeng Tian
182
10
0
22 Mar 2024
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal
  Models
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang
Mu Cai
Bingxin Xu
Yong Jae Lee
Yan Yan
VLM
380
201
0
22 Mar 2024
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
  Math Problems?
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang
Dongzhi Jiang
Yichi Zhang
Haokun Lin
Ziyu Guo
...
Aojun Zhou
Pan Lu
Kai-Wei Chang
Shiyang Feng
Jiaming Song
180
433
0
21 Mar 2024
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based
  LLMs
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs
Jinmin Li
Kuofeng Gao
Yang Bai
Jingyun Zhang
Shu-Tao Xia
Yisen Wang
AAML
212
12
0
20 Mar 2024
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Zhengqing Yuan
Ruoxi Chen
Zhaoxu Li
Haolong Jia
Lifang He
Chi Wang
Lichao Sun
VGen
241
42
0
20 Mar 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
P. D. Haghighi
Hamid Rezatofighi
LRM
192
20
0
19 Mar 2024
RelationVLM: Making Large Vision-Language Models Understand Visual
  Relations
RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Zhipeng Huang
Zhizheng Zhang
Zheng-Jun Zha
Yan Lu
Baining Guo
VLM
108
6
0
19 Mar 2024
Contextual AD Narration with Interleaved Multimodal Sequence
Contextual AD Narration with Interleaved Multimodal SequenceComputer Vision and Pattern Recognition (CVPR), 2024
Hanlin Wang
Zhan Tong
Kecheng Zheng
Yujun Shen
Limin Wang
VGen
370
7
0
19 Mar 2024
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLMLLMAG
255
140
0
18 Mar 2024
Towards Neuro-Symbolic Video Understanding
Towards Neuro-Symbolic Video Understanding
Minkyu Choi
Harsh Goel
Mohammad Omama
Yunhao Yang
Sahil Shah
Sandeep Chinchali
141
19
0
16 Mar 2024
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang
Xiaojun Meng
Jianxin Liang
Yuxuan Wang
Qun Liu
Dongyan Zhao
172
56
0
15 Mar 2024
NaturalVLM: Leveraging Fine-grained Natural Language for
  Affordance-Guided Visual Manipulation
NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual ManipulationIEEE Robotics and Automation Letters (RA-L), 2024
Ran Xu
Yan Shen
Xiaoqi Li
Kai Cheng
Hao Dong
LM&Ro
140
15
0
13 Mar 2024
Knowledge Conflicts for LLMs: A Survey
Knowledge Conflicts for LLMs: A SurveyConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Rongwu Xu
Zehan Qi
Zhijiang Guo
Cunxiang Wang
Hongru Wang
Yue Zhang
Wei Xu
586
192
0
13 Mar 2024
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation
Minbin Huang
Yanxin Long
Xinchi Deng
Ruihang Chu
Jiangfeng Xiong
Xiaodan Liang
Hong Cheng
Qinglin Lu
Wei Liu
MLLMEGVM
278
19
0
13 Mar 2024
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial
  Creation on Physical Tasks
TutoAI: A Cross-domain Framework for AI-assisted Mixed-media Tutorial Creation on Physical TasksInternational Conference on Human Factors in Computing Systems (CHI), 2024
Yuexi Chen
Vlad I. Morariu
Anh Truong
Zhicheng Liu
DiffMVGen
182
9
0
12 Mar 2024
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages
Michael Andersland
61
0
0
11 Mar 2024
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
  Dynamic Audio-Visual Scenarios
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual ScenariosEuropean Conference on Computer Vision (ECCV), 2024
Qilang Ye
Zitong Yu
Rui Shao
Xinyu Xie
Juil Sock
Simeng Qin
MLLM
259
44
0
07 Mar 2024
Multimodal Large Language Models to Support Real-World Fact-Checking
Multimodal Large Language Models to Support Real-World Fact-Checking
Fauzan Farooqui
Yova Kementchedjhieva
Preslav Nakov
Iryna Gurevych
LRM
268
23
0
06 Mar 2024
Data Augmentation using Large Language Models: Data Perspectives,
  Learning Paradigms and Challenges
Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges
Bosheng Ding
Chengwei Qin
Ruochen Zhao
Tianze Luo
Xinze Li
Guizhen Chen
Wenhan Xia
Junjie Hu
Anh Tuan Luu
Shafiq Joty
354
35
0
05 Mar 2024
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image
Xijia Tao
Shuai Zhong
Lei Li
Qi Liu
Lingpeng Kong
313
44
0
05 Mar 2024
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
DreamFrame: Enhancing Video Understanding via Automatically Generated QA and Style-Consistent Keyframes
Zhende Song
Chenchen Wang
Jiamu Sheng
C. Zhang
Gang Yu
Jiayuan Fan
Tao Chen
VGen
348
21
0
03 Mar 2024
Evaluating Large Language Models as Virtual Annotators for Time-series
  Physical Sensing Data
Evaluating Large Language Models as Virtual Annotators for Time-series Physical Sensing Data
Aritra Hota
S. Chatterjee
Sandip Chakraborty
335
20
0
02 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
334
316
0
29 Feb 2024
Navigating Hallucinations for Reasoning of Unintentional Activities
Navigating Hallucinations for Reasoning of Unintentional Activities
Shresth Grover
Vibhav Vineet
Yogesh S Rawat
LRM
252
2
0
29 Feb 2024
OSCaR: Object State Captioning and State Change Representation
OSCaR: Object State Captioning and State Change Representation
Nguyen Nguyen
Jing Bi
Ali Vosoughi
Yapeng Tian
Pooyan Fazli
Chenliang Xu
456
14
0
27 Feb 2024
PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
  Multimodal Models
PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models
Dingkun Guo
Yuqi Xiang
Shuqi Zhao
Xinghao Zhu
Masayoshi Tomizuka
Mingyu Ding
Wei Zhan
182
14
0
26 Feb 2024
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Yao Mu
Junting Chen
Qinglong Zhang
Shoufa Chen
Qiaojun Yu
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Mingyu Ding
Ping Luo
212
44
0
25 Feb 2024
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced
  Safety Alignment
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Zhenghao Hu
Jiazhao Li
Yiquan Li
Xiangyu Qi
Junjie Hu
Yixuan Li
P. McDaniel
Muhao Chen
Bo Li
Chaowei Xiao
AAMLSILM
299
28
0
22 Feb 2024
Enhancing Robotic Manipulation with AI Feedback from Multimodal Large
  Language Models
Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Jinyi Liu
Yifu Yuan
Jianye Hao
Fei Ni
Lingzhi Fu
Yibin Chen
Yan Zheng
LM&Ro
705
10
0
22 Feb 2024
Slot-VLM: SlowFast Slots for Video-Language Modeling
Slot-VLM: SlowFast Slots for Video-Language Modeling
Jiaqi Xu
Cuiling Lan
Wenxuan Xie
Xuejin Chen
Yan Lu
MLLMVLM
103
10
0
20 Feb 2024
Model Composition for Multimodal Large Language Models
Model Composition for Multimodal Large Language Models
Chi Chen
Yiyang Du
Zheng Fang
Ziyue Wang
Ziyue Wang
...
Ming Yan
Ji Zhang
Fei Huang
Maosong Sun
Yang Liu
MoMe
128
7
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
321
62
0
20 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Vasu Sharma
VLM
423
64
0
20 Feb 2024
The Revolution of Multimodal Large Language Models: A Survey
The Revolution of Multimodal Large Language Models: A Survey
Davide Caffagni
Federico Cocchi
Luca Barsellotti
Nicholas Moratelli
Sara Sarto
Lorenzo Baraldi
Lorenzo Baraldi
Marcella Cornia
Rita Cucchiara
LRMVLM
272
113
0
19 Feb 2024
LVCHAT: Facilitating Long Video Comprehension
LVCHAT: Facilitating Long Video Comprehension
Yu Wang
Zeyuan Zhang
Julian McAuley
Zexue He
VLM
125
6
0
19 Feb 2024
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large
  Language Models
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models
Didi Zhu
Zhongyi Sun
Zexi Li
Zhenyuan Zhang
Ke Yan
Shouhong Ding
Kun Kuang
Chao Wu
CLLKELMMoMe
190
41
0
19 Feb 2024
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI
  Automation
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation
Xinbei Ma
Zhuosheng Zhang
Hai Zhao
LLMAG
235
48
0
19 Feb 2024
Momentor: Advancing Video Large Language Model with Fine-Grained
  Temporal Reasoning
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Long Qian
Juncheng Billy Li
Yu-hao Wu
Yaobo Ye
Hao Fei
Tat-Seng Chua
Yueting Zhuang
Siliang Tang
MLLMLRM
288
90
0
18 Feb 2024
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented
  In-Context Learning in Multi-Modal Large Language Model
RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
Jianhao Yuan
Shuyang Sun
Daniel Omeiza
Bo Zhao
Paul Newman
Lars Kunze
Matthew Gadd
LRM
268
73
0
16 Feb 2024
Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models
Yuqing Liu
Yu Wang
Lichao Sun
Philip S. Yu
170
16
0
13 Feb 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
599
129
0
13 Feb 2024
Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and
  Generative Datasets
Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets
Israel Abebe Azime
A. Tonja
Tadesse Destaw Belay
Mitiku Yohannes Fuge
A. Wassie
Eyasu Shiferaw Jada
Yonas Chanie
W. Sewunetie
Seid Muhie Yimam
204
7
0
12 Feb 2024
Unsupervised Sign Language Translation and Generation
Unsupervised Sign Language Translation and Generation
Zhengsheng Guo
Zhiwei He
Wenxiang Jiao
Xing Wang
Rui Wang
Kehai Chen
Zhaopeng Tu
Yong-mei Xu
Min Zhang
179
4
0
12 Feb 2024
It's Never Too Late: Fusing Acoustic Information into Large Language
  Models for Automatic Speech Recognition
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition
Chen Chen
Ruizhe Li
Yuchen Hu
Sabato Marco Siniscalchi
Pin-Yu Chen
Ensiong Chng
Chao-Han Huck Yang
179
32
0
08 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Jiaming Song
Yu Qiao
Shiyang Feng
MLLM
417
135
0
08 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
388
13
0
08 Feb 2024
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based
  Recommendation
RA-Rec: An Efficient ID Representation Alignment Framework for LLM-based Recommendation
Xiaohan Yu
Li Zhang
Xin Zhao
Yue Wang
Zhongrui Ma
145
14
0
07 Feb 2024
Previous
123...15161718
Next