Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.05424
Cited By
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
8 June 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models"
50 / 462 papers shown
Title
Vript: A Video Is Worth Thousands of Words
Dongjie Yang
Suyuan Huang
Chengqiang Lu
Xiaodong Han
Haoxin Zhang
Yan Gao
Yao Hu
Hai Zhao
VGen
80
24
0
10 Jun 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Lin Chen
Xilin Wei
Jinsong Li
Xiaoyi Dong
Pan Zhang
...
Li Yuan
Yu Qiao
Dahua Lin
Feng Zhao
Jiaqi Wang
83
145
0
06 Jun 2024
VideoTetris: Towards Compositional Text-to-Video Generation
Ye Tian
Ling Yang
Haotian Yang
Yuan Gao
Yufan Deng
...
Zhaochen Yu
Xin Tao
Pengfei Wan
Di Zhang
Bin Cui
DiffM
VGen
95
17
0
06 Jun 2024
DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
Yidong Huang
Jacob Sansom
Ziqiao Ma
Felix Gervits
Joyce Chai
44
17
0
05 Jun 2024
From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models
Xiaofeng Zhang
Chen Shen
Xiaosong Yuan
Shaotian Yan
Liang Xie
Wenxiao Wang
Chaochen Gu
Hao Tang
Jieping Ye
54
2
0
04 Jun 2024
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Keqi Deng
Guangzhi Sun
Phil Woodland
VLM
44
4
0
01 Jun 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu
Yuhan Dai
Yondong Luo
Lei Li
Shuhuai Ren
...
Tong Xu
Xiawu Zheng
Enhong Chen
Rongrong Ji
Xing Sun
VLM
MLLM
50
308
0
31 May 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Ling-Hao Chen
Shunlin Lu
Ailing Zeng
Hao Zhang
Benyou Wang
Ruimao Zhang
Lei Zhang
57
33
0
30 May 2024
X-VILA: Cross-Modality Alignment for Large Language Model
Hanrong Ye
De-An Huang
Yao Lu
Zhiding Yu
Ming-Yu Liu
...
Jan Kautz
Song Han
Dan Xu
Pavlo Molchanov
Hongxu Yin
MLLM
VLM
51
32
0
29 May 2024
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang
Shoubin Yu
Elias Stengel-Eskin
Jaehong Yoon
Feng Cheng
Gedas Bertasius
Mohit Bansal
61
58
0
29 May 2024
The Evolution of Multimodal Model Architectures
S. Wadekar
Abhishek Chaurasia
Aman Chadha
Eugenio Culurciello
45
15
0
28 May 2024
Matryoshka Multimodal Models
Mu Cai
Jianwei Yang
Jianfeng Gao
Yong Jae Lee
VLM
60
28
0
27 May 2024
Hawk: Learning to Understand Open-World Video Anomalies
Jiaqi Tang
Hao Lu
Ruizheng Wu
Xiaogang Xu
Ke Ma
Cheng Fang
Bin Guo
Jiangbo Lu
Qifeng Chen
Ying-Cong Chen
VLM
45
9
0
27 May 2024
Streaming Long Video Understanding with Large Language Models
Rui Qian
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Shuangrui Ding
Dahua Lin
Jiaqi Wang
VLM
44
41
0
25 May 2024
Dense Connector for MLLMs
Huanjin Yao
Wenhao Wu
Taojiannan Yang
Yuxin Song
Mengxi Zhang
Haocheng Feng
Yifan Sun
Zhiheng Li
Wanli Ouyang
Jingdong Wang
MLLM
VLM
42
18
0
22 May 2024
Distilling Implicit Multimodal Knowledge into Large Language Models for Zero-Resource Dialogue Generation
Bo Zhang
Hui Ma
Jian Ding
Jian Wang 00021
Bo Xu
Hongfei Lin
VLM
43
1
0
16 May 2024
CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal
Khalid Saifullah
Ronen Basri
David Jacobs
Gowthami Somepalli
Tom Goldstein
48
40
0
14 May 2024
FreeVA: Offline MLLM as Training-Free Video Assistant
Wenhao Wu
VLM
OffRL
40
20
0
13 May 2024
DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS
Qingyang Li
Yihang Zhang
Zhidong Jia
Yannan Hu
Lei Zhang
Jianrong Zhang
Yongming Xu
Yong Cui
Xinggong Zhang
Xinggong Zhang
AI4CE
42
7
0
13 May 2024
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
Muhammad Uzair Khattak
Muhammad Ferjad Naeem
Jameel Hassan
Muzammal Naseer
Federico Tombari
Fahad Shahbaz Khan
Salman Khan
LRM
ELM
47
10
0
06 May 2024
WorldQA: Multimodal World Knowledge in Videos through Long-Chain Reasoning
Yuanhan Zhang
Kaichen Zhang
Bo Li
Fanyi Pu
Christopher Arif Setiadharma
Jingkang Yang
Ziwei Liu
VGen
52
8
0
06 May 2024
Octopi: Object Property Reasoning with Large Tactile-Language Models
Samson Yu
Kelvin Lin
Anxing Xiao
Jiafei Duan
Harold Soh
LRM
41
27
0
05 May 2024
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
Enxin Song
Wenhao Chai
Tianbo Ye
Lei Li
Xi Li
Gaoang Wang
VLM
MLLM
39
30
0
26 Apr 2024
MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition
Zheng Lian
Haiyang Sun
Guoying Zhao
Zhuofan Wen
Siyuan Zhang
...
Bin Liu
Min Zhang
Guoying Zhao
Björn W. Schuller
Jianhua Tao
VLM
46
11
0
26 Apr 2024
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Lin Xu
Yilin Zhao
Daquan Zhou
Zhijie Lin
See Kiong Ng
Jiashi Feng
MLLM
VLM
38
162
0
25 Apr 2024
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Bohao Li
Yuying Ge
Yi Chen
Yixiao Ge
Ruimao Zhang
Ying Shan
VLM
45
42
0
25 Apr 2024
Pegasus-v1 Technical Report
Raehyuk Jung
Hyojun Go
Jaehyuk Yi
Jiho Jang
Daniel Kim
...
Maninder Saini
Meredith Sanders
Soyoung Lee
Sue Kim
Travis Couture
MLLM
VLM
29
5
0
23 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
54
20
0
22 Apr 2024
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua
Yunlong Tang
Chenliang Xu
Jiebo Luo
VGen
68
25
0
18 Apr 2024
From Image to Video, what do we need in multimodal LLMs?
Suyuan Huang
Haoxin Zhang
Yan Gao
Honggu Chen
Yan Gao
Yao Hu
Zengchang Qin
VLM
47
8
0
18 Apr 2024
The 8th AI City Challenge
Shuo Wang
D. Anastasiu
Zhenghang Tang
Ming-Ching Chang
Yue Yao
...
Xunlei Wu
S. Pusegaonkar
Yizhou Wang
Sujit Biswas
Rama Chellappa
58
32
0
15 Apr 2024
Audio Dialogues: Dialogues dataset for audio and music understanding
Arushi Goel
Zhifeng Kong
Rafael Valle
Bryan Catanzaro
AuLLM
42
4
0
11 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
54
26
0
11 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
83
89
0
08 Apr 2024
X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
Jan Held
Hani Itani
A. Cioppa
Silvio Giancola
Guohao Li
Marc Van Droogenbroeck
48
17
0
07 Apr 2024
Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model
Zhonghan Zhao
Ke Ma
Wenhao Chai
Xuan Wang
Kewei Chen
Dongxu Guo
Yanting Zhang
Hongwei Wang
Gaoang Wang
50
16
0
06 Apr 2024
JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
Simindokht Jahangard
Zhixi Cai
Shiki Wen
Hamid Rezatofighi
39
6
0
06 Apr 2024
Koala: Key frame-conditioned long video-LLM
Reuben Tan
Ximeng Sun
Ping Hu
Jui-hsien Wang
Hanieh Deilamsalehy
Bryan A. Plummer
Bryan C. Russell
Kate Saenko
40
36
0
05 Apr 2024
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Kirolos Ataallah
Xiaoqian Shen
Eslam Abdelrahman
Essam Sleiman
Deyao Zhu
Jian Ding
Mohamed Elhoseiny
VLM
47
67
0
04 Apr 2024
LongVLM: Efficient Long Video Understanding via Large Language Models
Yuetian Weng
Mingfei Han
Haoyu He
Xiaojun Chang
Bohan Zhuang
VLM
68
57
0
04 Apr 2024
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward
Ruohong Zhang
Liangke Gui
Zhiqing Sun
Yihao Feng
Keyang Xu
...
Di Fu
Chunyuan Li
Alexander G. Hauptmann
Yonatan Bisk
Yiming Yang
MLLM
58
62
0
01 Apr 2024
ST-LLM: Large Language Models Are Effective Temporal Learners
Ruyang Liu
Chen Li
Haoran Tang
Yixiao Ge
Ying Shan
Ge Li
54
70
0
30 Mar 2024
LITA: Language Instructed Temporal-Localization Assistant
De-An Huang
Shijia Liao
Subhashree Radhakrishnan
Hongxu Yin
Pavlo Molchanov
Zhiding Yu
Jan Kautz
VLM
50
49
0
27 Mar 2024
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Wonkyun Kim
Changin Choi
Wonseok Lee
Wonjong Rhee
VLM
52
51
0
27 Mar 2024
Elysium: Exploring Object-level Perception in Videos via MLLM
Hang Wang
Yanjie Wang
Yongjie Ye
Yuxiang Nie
Can Huang
MLLM
42
19
0
25 Mar 2024
AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue
Yunlong Tang
Daiki Shimada
Jing Bi
Chenliang Xu
VGen
42
10
0
24 Mar 2024
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding
Yi Wang
Kunchang Li
Xinhao Li
Jiashuo Yu
Yinan He
...
Hongjie Zhang
Yifei Huang
Yu Qiao
Yali Wang
Limin Wang
44
50
0
22 Mar 2024
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
Ahmad A Mahmood
Ashmal Vayani
Muzammal Naseer
Salman Khan
Fahad Shahbaz Khan
LRM
58
7
0
21 Mar 2024
FMM-Attack: A Flow-based Multi-modal Adversarial Attack on Video-based LLMs
Jinmin Li
Kuofeng Gao
Yang Bai
Jingyun Zhang
Shu-Tao Xia
Yisen Wang
AAML
38
8
0
20 Mar 2024
RelationVLM: Making Large Vision-Language Models Understand Visual Relations
Zhipeng Huang
Zhizheng Zhang
Zheng-Jun Zha
Yan Lu
Baining Guo
VLM
44
3
0
19 Mar 2024
Previous
1
2
3
...
10
6
7
8
9
Next