ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.02858
  4. Cited By
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

5 June 2023
Hang Zhang
Xin Li
Lidong Bing
    MLLM
ArXivPDFHTML

Papers citing "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding"

50 / 703 papers shown
Title
End-to-End Video Question Answering with Frame Scoring Mechanisms and
  Adaptive Sampling
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
Jianxin Liang
Xiaojun Meng
Yueqian Wang
Chang Liu
Qun Liu
Dongyan Zhao
42
5
0
21 Jul 2024
Audio-visual training for improved grounding in video-text LLMs
Audio-visual training for improved grounding in video-text LLMs
Shivprasad Sagare
Hemachandran S
Kinshuk Sarabhai
Prashant Ullegaddi
SA Rajeshkumar
32
0
0
21 Jul 2024
A Comprehensive Review of Few-shot Action Recognition
A Comprehensive Review of Few-shot Action Recognition
Yuyang Wanyan
Xiaoshan Yang
Weiming Dong
Changsheng Xu
VLM
80
3
0
20 Jul 2024
On Pre-training of Multimodal Language Models Customized for Chart
  Understanding
On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan
Yen-Chun Chen
Mengchen Liu
Lu Yuan
Leonid Sigal
50
5
0
19 Jul 2024
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Mingchen Zhuge
Jian Ding
Deyao Zhu
Jürgen Schmidhuber
Mohamed Elhoseiny
VLM
30
18
0
17 Jul 2024
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
Jie Yang
Xuesong Niu
Nan Jiang
Ruimao Zhang
Siyuan Huang
35
9
0
17 Jul 2024
VISA: Reasoning Video Object Segmentation via Large Language Models
VISA: Reasoning Video Object Segmentation via Large Language Models
Cilin Yan
Haochen Wang
Shilin Yan
Xiaolong Jiang
Yao Hu
Guoliang Kang
Weidi Xie
E. Gavves
LRM
VLM
VOS
50
28
0
16 Jul 2024
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Yaoting Wang
Peiwen Sun
Yuanchao Li
Honggang Zhang
Di Hu
51
5
0
15 Jul 2024
Follow the Rules: Reasoning for Video Anomaly Detection with Large
  Language Models
Follow the Rules: Reasoning for Video Anomaly Detection with Large Language Models
Yuchen Yang
Kwonjoon Lee
Behzad Dariush
Yinzhi Cao
Shao-Yuan Lo
LRM
49
12
0
14 Jul 2024
Refusing Safe Prompts for Multi-modal Large Language Models
Refusing Safe Prompts for Multi-modal Large Language Models
Zedian Shao
Hongbin Liu
Yuepeng Hu
Neil Zhenqiang Gong
MLLM
LRM
46
1
0
12 Jul 2024
Bora: Biomedical Generalist Video Generation Model
Bora: Biomedical Generalist Video Generation Model
Weixiang Sun
Xiaocao You
Ruizhe Zheng
Zhengqing Yuan
Xiang Li
Lifang He
Quanzheng Li
Lichao Sun
VGen
MedIm
30
8
0
12 Jul 2024
Global-Local Collaborative Inference with LLM for Lidar-Based
  Open-Vocabulary Detection
Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection
Xingyu Peng
Yan Bai
Chen Gao
Lirong Yang
Fei Xia
Beipeng Mu
Xiaofei Wang
Si Liu
ObjD
48
3
0
12 Jul 2024
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
GOFA: A Generative One-For-All Model for Joint Graph Language Modeling
Lecheng Kong
Jiarui Feng
Hao Liu
Chengsong Huang
Jiaxin Huang
Yixin Chen
Muhan Zhang
AI4CE
77
8
0
12 Jul 2024
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large
  Vision-Language Models
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang
Xinpeng Ding
Chunwei Wang
J. N. Han
Yulong Liu
Hengshuang Zhao
Hang Xu
Lu Hou
Wei Zhang
Xiaodan Liang
VLM
31
8
0
11 Jul 2024
Hypergraph Multi-modal Large Language Model: Exploiting EEG and
  Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video
  Understanding
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Minghui Wu
Chenxu Zhao
Anyang Su
Donglin Di
Tianyu Fu
...
Min He
Ya Gao
Meng Ma
Kun Yan
Ping Wang
35
0
0
11 Jul 2024
AffectGPT: Dataset and Framework for Explainable Multimodal Emotion
  Recognition
AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition
Zheng Lian
Haiyang Sun
Guoying Zhao
Jiangyan Yi
Bin Liu
Jianhua Tao
60
2
0
10 Jul 2024
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image
  Synthesis
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
Wanggui He
Siming Fu
Mushui Liu
Xierui Wang
Wenyi Xiao
...
Zhelun Yu
Haoyuan Li
Ziwei Huang
Leilei Gan
Hao Jiang
DiffM
29
23
0
10 Jul 2024
Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for
  Text-to-Video Generation Task
Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task
Yiran Yang
Jinchao Zhang
Ying Deng
Jie Zhou
DiffM
36
0
0
09 Jul 2024
Video-STaR: Self-Training Enables Video Instruction Tuning with Any
  Supervision
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar
Xiaohan Wang
Yonatan Bitton
Idan Szpektor
Serena Yeung-Levy
VLM
LRM
63
8
0
08 Jul 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serrà
50
2
0
08 Jul 2024
Multimodal Language Models for Domain-Specific Procedural Video
  Summarization
Multimodal Language Models for Domain-Specific Procedural Video Summarization
Nafisa Hussain
50
0
0
07 Jul 2024
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
  Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
Pan Zhang
Xiaoyi Dong
Yuhang Zang
Yuhang Cao
Rui Qian
...
Kai Chen
Jifeng Dai
Yu Qiao
Dahua Lin
Jiaqi Wang
47
100
0
03 Jul 2024
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
KeyVideoLLM: Towards Large-scale Video Keyframe Selection
Hao Liang
Jiapeng Li
Tianyi Bai
Xijie Huang
Linzhuang Sun
Zhengren Wang
Conghui He
Bin Cui
Chong Chen
Wentao Zhang
VGen
34
7
0
03 Jul 2024
Video Watermarking: Safeguarding Your Video from (Unauthorized)
  Annotations by Video-based LLMs
Video Watermarking: Safeguarding Your Video from (Unauthorized) Annotations by Video-based LLMs
Jinmin Li
Kuofeng Gao
Yang Bai
Jingyun Zhang
Shu-Tao Xia
50
4
0
02 Jul 2024
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression
  Recognition with AdaptERs
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
Haodong Chen
Haojian Huang
Junhao Dong
Mingzhe Zheng
Dian Shao
50
16
0
02 Jul 2024
Meerkat: Audio-Visual Large Language Model for Grounding in Space and
  Time
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury
Sayan Nag
Subhrajyoti Dasgupta
Jun Chen
Mohamed Elhoseiny
Ruohan Gao
Dinesh Manocha
VLM
MLLM
49
9
0
01 Jul 2024
Tokenize the World into Object-level Knowledge to Address Long-tail
  Events in Autonomous Driving
Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving
Ran Tian
Boyi Li
Xinshuo Weng
Yuxiao Chen
Edward Schmerling
Yue Wang
Boris Ivanovic
Marco Pavone
60
14
0
01 Jul 2024
Tarsier: Recipes for Training and Evaluating Large Video Description
  Models
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Jiawei Wang
Liping Yuan
Yuchen Zhang
49
52
0
30 Jun 2024
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework
  for Multimodal LLMs
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Sukmin Yun
Haokun Lin
Rusiru Thushara
Mohammad Qazim Bhat
Yongxin Wang
...
Timothy Baldwin
Zhengzhong Liu
Eric P. Xing
Xiaodan Liang
Zhiqiang Shen
54
10
0
28 Jun 2024
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in
  Very Long Video Understanding
InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
Kirolos Ataallah
Chenhui Gou
Eslam Abdelrahman
Khushbu Pahwa
Jian Ding
Mohamed Elhoseiny
VLM
43
8
0
28 Jun 2024
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen
Yu-Chien Liao
Hsi-Che Lin
Yu-Chu Yu
Yen-Chun Chen
Yu-Chiang Frank Wang
37
10
0
27 Jun 2024
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
  Understanding
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Tao Zhang
Xiangtai Li
Hao Fei
Haobo Yuan
Shengqiong Wu
Shunping Ji
Chen Change Loy
Shuicheng Yan
LRM
MLLM
VLM
54
49
0
27 Jun 2024
From Efficient Multimodal Models to World Models: A Survey
From Efficient Multimodal Models to World Models: A Survey
Xinji Mai
Zeng Tao
Junxiong Lin
Haoran Wang
Yang Chang
Yanlan Kang
Yan Wang
Wenqiang Zhang
45
5
0
27 Jun 2024
Enhancing Video-Language Representations with Structural Spatio-Temporal
  Alignment
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
Hao Fei
Shengqiong Wu
Meishan Zhang
Hao Fei
Tat-Seng Chua
Shuicheng Yan
AI4TS
47
40
0
27 Jun 2024
MatchTime: Towards Automatic Soccer Game Commentary Generation
MatchTime: Towards Automatic Soccer Game Commentary Generation
Jiayuan Rao
Haoning Wu
Chang-rui Liu
Yanfeng Wang
Weidi Xie
45
7
0
26 Jun 2024
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
Jiafeng Liang
Shixin Jiang
Zekun Wang
Haojie Pan
Zerui Chen
Zheng Chu
Ming Liu
Ruiji Fu
Zhongyuan Wang
Bing Qin
34
2
0
26 Jun 2024
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment
  Retrieval
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval
Weitong Cai
Jiabo Huang
Shaogang Gong
Hailin Jin
Yang Liu
44
0
0
25 Jun 2024
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Xiangyu Zhao
Xiangtai Li
Haodong Duan
Haian Huang
Yining Li
Kai Chen
Hua Yang
VLM
MLLM
50
10
0
25 Jun 2024
Towards Probing Speech-Specific Risks in Large Multimodal Models: A
  Taxonomy, Benchmark, and Insights
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
Hao Yang
Lizhen Qu
Ehsan Shareghi
Gholamreza Haffari
36
0
0
25 Jun 2024
Zero-Shot Long-Form Video Understanding through Screenplay
Zero-Shot Long-Form Video Understanding through Screenplay
Yongliang Wu
Bozheng Li
Jiawang Cao
Wenbo Zhu
Yi Lu
...
Chuyun Xie
Haolin Zheng
Ziyue Su
Jay Wu
Xu Yang
48
4
0
25 Jun 2024
Unlocking Continual Learning Abilities in Language Models
Unlocking Continual Learning Abilities in Language Models
Wenyu Du
Shuang Cheng
Tongxu Luo
Zihan Qiu
Zeyu Huang
Ka Chun Cheung
Reynold Cheng
Jie Fu
KELM
CLL
56
7
0
25 Jun 2024
Long Context Transfer from Language to Vision
Long Context Transfer from Language to Vision
Peiyuan Zhang
Kaichen Zhang
Bo Li
Guangtao Zeng
Jingkang Yang
Yuanhan Zhang
Ziyue Wang
Haoran Tan
Chunyuan Li
Ziwei Liu
VLM
72
146
0
24 Jun 2024
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding
  with Task Divide-and-Conquer
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Lu Zhang
Tiancheng Zhao
Heting Ying
Yibo Ma
Kyusong Lee
LLMAG
38
9
0
24 Jun 2024
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Ye Wang
Yuting Mei
Sipeng Zheng
Qin Jin
LRM
47
2
0
24 Jun 2024
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned
  Data for Evaluating Text-to-Image Models
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
Zhiyu Tan
Xiaomeng Yang
Luozheng Qin
Mengping Yang
Cheng Zhang
Hao Li
57
7
0
24 Jun 2024
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
  Large Video-Language Models
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Yuxuan Wang
Yueqian Wang
Dongyan Zhao
Cihang Xie
Zilong Zheng
MLLM
VLM
59
26
0
24 Jun 2024
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Guangzhi Sun
Wenyi Yu
Changli Tang
Xianzhao Chen
Tian Tan
Wei Li
Lu Lu
Zejun Ma
Yuxuan Wang
Chao Zhang
45
23
0
22 Jun 2024
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human
  Feedback for Video Generation
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Xuan He
Dongfu Jiang
Ge Zhang
Max W.F. Ku
Achint Soni
...
Yaswanth Narsupalli
Rongqi Fan
Zhiheng Lyu
Yuchen Lin
Wenhu Chen
EGVM
VGen
ALM
58
42
0
21 Jun 2024
Towards Event-oriented Long Video Understanding
Towards Event-oriented Long Video Understanding
Yifan Du
Kun Zhou
Yuqi Huo
Yifan Li
Wayne Xin Zhao
Haoyu Lu
Zijia Zhao
Bingning Wang
Weipeng Chen
Ji-Rong Wen
VLM
43
14
0
20 Jun 2024
Through the Theory of Mind's Eye: Reading Minds with Multimodal Video
  Large Language Models
Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
Zhawnen Chen
Tianchun Wang
Yizhou Wang
Michal Kosinski
Xiang Zhang
Yun Fu
Sheng Li
LRM
34
2
0
19 Jun 2024
Previous
123...789...131415
Next