Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.05424
Cited By
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
8 June 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models"
50 / 462 papers shown
Title
RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan
Yunfan Ye
Yuanjing Luo
Qian Wan
Fang Liu
Zhiping Cai
VLM
72
1
0
11 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
229
1
0
11 Mar 2025
LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
Hanyu Zhou
Gim Hee Lee
47
0
0
10 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Yue Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
228
0
0
08 Mar 2025
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Mohsen Gholami
Mohammad Akbari
Kevin Cannons
Yong Zhang
65
0
0
07 Mar 2025
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Haiyang Yu
Jinghui Lu
Yanjie Wang
Yang Li
Han Wang
Can Huang
B. Li
VLM
63
2
0
06 Mar 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
Wei Li
Bing Hu
Rui Shao
Leyang Shen
Liqiang Nie
52
2
0
05 Mar 2025
EgoLife: Towards Egocentric Life Assistant
Jingkang Yang
Shuai Liu
Hongming Guo
Yuhao Dong
X. Zhang
...
Joerg Widmer
Francesco Gringoli
Lei Yang
Bo Li
Ziwei Liu
EgoV
71
2
0
05 Mar 2025
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar
Gursimran Singh
Mohammad Akbari
Yong Zhang
VLM
79
0
0
04 Mar 2025
Advancing vision-language models in front-end development via data synthesis
Tong Ge
Yashu Liu
Jieping Ye
Tianyi Li
Chao Wang
78
0
0
03 Mar 2025
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
Zitang Zhou
Ke Mei
Yu Lu
Tianyi Wang
Fengyun Rao
94
2
0
03 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
76
4
0
01 Mar 2025
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang
Jihao Qiu
Lingxi Xie
Yunjie Tian
Jianbin Jiao
Qixiang Ye
90
0
0
28 Feb 2025
VideoA11y: Method and Dataset for Accessible Video Description
Chaoyu Li
Sid Padmanabhuni
Maryam Cheema
H. Seifi
Pooyan Fazli
VGen
70
0
0
27 Feb 2025
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
Yuhao Li
Mirana Claire Angel
Salman Khan
Yu Zhu
Jinqiu Sun
Yanning Zhang
Fahad Shahbaz Khan
VGen
51
0
0
27 Feb 2025
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs
Gengyuan Zhang
Mingcong Ding
Tong Liu
Yao Zhang
Volker Tresp
86
1
0
24 Feb 2025
Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning
Yuqi Pang
Bowen Yang
Haoqin Tu
Yun Cao
Zeyu Zhang
LRM
MLLM
66
0
0
17 Feb 2025
Unhackable Temporal Rewarding for Scalable Video MLLMs
En Yu
Kangheng Lin
Liang Zhao
Yana Wei
Zining Zhu
...
Jianjian Sun
Zheng Ge
Xinsong Zhang
Jingyu Wang
Wenbing Tao
69
4
0
17 Feb 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang
Yihan Hu
Zemin Du
Dizhan Xue
Shengsheng Qian
Jiahong Wu
Fan Yang
W. Dong
Changsheng Xu
47
5
0
15 Feb 2025
CoS: Chain-of-Shot Prompting for Long Video Understanding
Jian Hu
Zixu Cheng
Chenyang Si
Wei Li
Shaogang Gong
57
4
0
10 Feb 2025
Deciphering Functions of Neurons in Vision-Language Models
Jiaqi Xu
Cuiling Lan
Xuejin Chen
Yan Lu
VLM
107
0
0
10 Feb 2025
Survey on AI-Generated Media Detection: From Non-MLLM to MLLM
Yueying Zou
Peipei Li
Zekun Li
Huaibo Huang
Xing Cui
Xuannan Liu
Chenghanyu Zhang
Ran He
DeLMO
132
3
0
07 Feb 2025
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin
Emily Ruoyu Liu
Joyce Chuyi Chen
Ines Dormoy
Jinyoung Kim
Samar Khanna
Zhuo Zheng
Stefano Ermon
MLLM
VLM
60
6
0
28 Jan 2025
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
Jiaxing Zhao
Q. Yang
Yixing Peng
Detao Bai
Shimin Yao
...
Xiang Chen
Shenghao Fu
Weixuan chen
Xihan Wei
Liefeng Bo
VGen
AuLLM
63
5
0
28 Jan 2025
Baichuan-Omni-1.5 Technical Report
Yadong Li
Jiaheng Liu
Tao Zhang
Tao Zhang
Tian Jin
...
Jianhua Xu
Haoze Sun
Mingan Lin
Zenan Zhou
Xin Wu
AuLLM
83
14
0
28 Jan 2025
ENTER: Event Based Interpretable Reasoning for VideoQA
Hammad A. Ayyubi
Junzhang Liu
Ali Asgarov
Zaber Ibn Abdul Hakim
Najibul Haque Sarker
...
Md. Atabuzzaman
Xudong Lin
Naveen Reddy Dyava
Shih-Fu Chang
Chris Thomas
NAI
170
2
0
24 Jan 2025
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan
Seowung Leem
Kyle B. See
Joshua K. Wong
Shaoting Zhang
R. Fang
AI4CE
LM&MA
VLM
105
20
0
17 Jan 2025
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Sitong Gong
Yunzhi Zhuge
Lu Zhang
Zheng Yang
Pingping Zhang
Huchuan Lu
46
0
0
15 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
165
2
0
14 Jan 2025
TimeLogic: A Temporal Logic Benchmark for Video QA
S. Swetha
Hilde Kuehne
Mubarak Shah
52
1
0
13 Jan 2025
MS-Temba : Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha
Monish Soundar Raj
Pu Wang
Ahmed Helmy
Srijan Das
Mamba
66
3
0
10 Jan 2025
AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning
Muhammad Awais
Ali Husain Salem Abdulla Alharthi
Amandeep Kumar
Hisham Cholakkal
Rao Muhammad Anwer
VLM
65
3
0
10 Jan 2025
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Hao Fei
Shengqiong Wu
Wei Ji
Hao Zhang
Hao Fei
Mong Li Lee
Wynne Hsu
LRM
VGen
63
68
0
08 Jan 2025
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Shaolei Zhang
Qingkai Fang
Zhe Yang
Yang Feng
MLLM
VLM
79
28
0
07 Jan 2025
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Wenyi Hong
Yean Cheng
Zheng Yang
Weihan Wang
Lefan Wang
Xiaotao Gu
Shiyu Huang
Yuxiao Dong
J. Tang
CoGe
VLM
75
4
0
06 Jan 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
91
12
0
06 Jan 2025
Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Rui Liu
Hongyu Yuan
Hong Li
43
0
0
03 Jan 2025
MLVU: Benchmarking Multi-task Long Video Understanding
Yueze Wang
Yan Shu
Bo Zhao
Boya Wu
Junjie Zhou
...
Xi Yang
Y. Xiong
Bo Zhang
Tiejun Huang
Zheng Liu
VLM
63
33
0
03 Jan 2025
Image-based Multimodal Models as Intruders: Transferable Multimodal Attacks on Video-based MLLMs
Linhao Huang
Xue Jiang
Zhiqiang Wang
Wentao Mo
Xi Xiao
Bo Han
Yongjie Yin
Feng Zheng
AAML
64
2
0
02 Jan 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
88
7
0
02 Jan 2025
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan
Hang Zhang
Wentong Li
Zesen Cheng
Boqiang Zhang
...
Deli Zhao
Wenqiao Zhang
Yueting Zhuang
Jianke Zhu
Lidong Bing
80
5
0
31 Dec 2024
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei
Shengqiong Wu
Hao Zhang
Tat-Seng Chua
Shuicheng Yan
71
39
0
31 Dec 2024
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
Xiao Wang
Qingyi Si
Jianlong Wu
Shiyu Zhu
Zheng Lin
Liqiang Nie
VLM
88
6
0
29 Dec 2024
When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
VLM
48
0
0
26 Dec 2024
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
Se Jin Park
Yeonju Kim
Hyeongseop Rha
Bella Godiva
Y. Ro
48
1
0
23 Dec 2024
VidCtx: Context-aware Video Question Answering with Image Models
Andreas Goulas
Vasileios Mezaris
Ioannis Patras
252
0
0
23 Dec 2024
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei
Yujie Zhong
Haoxian Tan
Yingsen Zeng
Yong Liu
Zheng Zhao
Yujiu Yang
MLLM
VLM
VOS
114
2
0
18 Dec 2024
HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
Chen Bao
Jiarui Xu
Xiaolong Wang
Abhinav Gupta
Homanga Bharadhwaj
90
3
0
17 Dec 2024
LLMs are Also Effective Embedding Models: An In-depth Overview
Chongyang Tao
Tao Shen
Shen Gao
Junshuo Zhang
Zhen Li
Zhengwei Tao
Shuai Ma
93
7
0
17 Dec 2024
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Pan Zhang
Xiaoyi Dong
Yuhang Cao
Yuhang Zang
Rui Qian
...
Xinsong Zhang
K. Chen
Yu Qiao
Dahua Lin
Jiaqi Wang
KELM
86
12
0
12 Dec 2024
Previous
1
2
3
4
5
6
...
8
9
10
Next