ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05424
  4. Cited By
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and
  Language Models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

8 June 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
    MLLM
ArXivPDFHTML

Papers citing "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models"

50 / 462 papers shown
Title
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding
Junwen Pan
Rui Zhang
Xin Wan
Yuan Zhang
Ming Lu
Qi She
VLM
46
1
0
02 Apr 2025
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury
Hanan Gani
Nishit Anand
Sayan Nag
Ruohan Gao
Mohamed Elhoseiny
Salman Khan
Dinesh Manocha
LRM
59
0
0
29 Mar 2025
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Yunhong Wang
Yansen Wang
Bo Chen
Tong Wu
Dongyan Zhao
Zilong Zheng
VLM
MLLM
73
1
0
29 Mar 2025
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
Erika Mori
Yue Qiu
Hirokatsu Kataoka
Y. Aoki
55
0
0
27 Mar 2025
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen
Bohan Liu
Chenjia Li
Lalithkumar Seenivasan
Mathias Unberath
VOS
83
2
0
27 Mar 2025
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model
Abdelrahman M. Shaker
Muhammad Maaz
Chenhui Gou
Hamid Rezatofighi
Salman Khan
Fahad Shahbaz Khan
225
0
0
27 Mar 2025
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding
Shuming Liu
Chen Zhao
Tianqi Xu
Bernard Ghanem
VLM
78
0
0
27 Mar 2025
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations
Haitong Liu
Kuofeng Gao
Yang Bai
Jinmin Li
Jinxiao Shan
Tao Dai
Shu-Tao Xia
AAML
67
1
0
26 Mar 2025
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
Yucheng Suo
Fan Ma
Linchao Zhu
T. Wang
Fengyun Rao
Yi Yang
LRM
79
0
0
26 Mar 2025
PAVE: Patching and Adapting Video Large Language Models
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu
Yiquan Li
Khoi Duc Nguyen
Yiwu Zhong
Yin Li
KELM
LRM
89
0
0
25 Mar 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Hongcheng Gao
Jiashu Qu
Jingyi Tang
Baolong Bi
Yi Liu
Hongyu Chen
Li Liang
Li Su
Qingming Huang
MLLM
VLM
LRM
88
5
0
25 Mar 2025
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models
Dohwan Ko
S. Kim
Yumin Suh
Vijay Kumar B.G
Minseo Yoon
Manmohan Chandraker
Hyunwoo J. Kim
LRM
44
0
0
25 Mar 2025
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs
Carlos Plou
Cesar Borja
Ruben Martinez-Cantin
Ana C. Murillo
63
0
0
25 Mar 2025
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
Mingze Xu
Mingfei Gao
Shiyu Li
Jiasen Lu
Zhe Gan
Zhengfeng Lai
Meng Cao
Kai Kang
Yue Yang
Afshin Dehghan
66
2
0
24 Mar 2025
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova
Arsha Nagrani
Bernt Schiele
Hilde Kuehne
Christian Rupprecht
53
0
0
24 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
80
1
0
24 Mar 2025
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model
Cheng Yang
Yang Sui
Jinqi Xiao
Lingyi Huang
Yu Gong
...
Jinghua Yan
Y. Bai
P. Sadayappan
Xia Hu
Bo Yuan
VLM
64
0
0
24 Mar 2025
LLaVAction: evaluating and training multi-modal large language models for action recognition
LLaVAction: evaluating and training multi-modal large language models for action recognition
Shaokai Ye
Haozhe Qi
Alexander Mathis
Mackenzie W. Mathis
72
1
0
24 Mar 2025
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
Xiangrui Liu
Yan Shu
Zhengyang Liang
Ao Li
Yang Tian
Bo Zhao
VGen
VLM
100
1
0
24 Mar 2025
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
CountLLM: Towards Generalizable Repetitive Action Counting via Large Language Model
Ziyu Yao
Xuxin Cheng
Zhiqi Huang
Lei Li
62
0
0
22 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
50
0
0
22 Mar 2025
PVChat: Personalized Video Chat with One-Shot Learning
PVChat: Personalized Video Chat with One-Shot Learning
Yufei Shi
Weilong Yan
Gang Xu
Yumeng Li
Yong Li
Zechao Li
Fei Richard Yu
Ming Li
Si Yong Yeo
43
0
0
21 Mar 2025
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
Keda Tao
Haoxuan You
Yang Sui
Can Qin
Haoyu Wang
VLM
MQ
91
0
0
20 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
Hairu Wang
Kai Hu
Liangcai Gao
200
0
0
20 Mar 2025
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations
Kyungho Bae
Jinhyung Kim
Sihaeng Lee
Soonyoung Lee
G. Lee
Jinwoo Choi
66
1
0
20 Mar 2025
Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
Thanh-Son Nguyen
Hong Yang
Tzeh Yuan Neoh
Hao Zhang
Ee Yeo Keat
Basura Fernando
NAI
64
0
0
19 Mar 2025
Growing a Twig to Accelerate Large Vision-Language Models
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao
Mingyang Wang
Zhou Yu
Wenwen Pan
Yan Yang
Tao Wei
Han Zhang
Ning Mao
Wei Chen
Jun Yu
VLM
67
1
0
18 Mar 2025
Quantum EigenGame for excited state calculation
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
53
1
0
17 Mar 2025
Efficient Motion-Aware Video MLLM
Efficient Motion-Aware Video MLLM
Zijia Zhao
Yuqi Huo
Tongtian Yue
Longteng Guo
Haoyu Lu
Binghui Wang
Xin Wu
Qingbin Liu
65
0
0
17 Mar 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Yong-Jin Liu
Kevin Qinghong Lin
C. Chen
Mike Zheng Shou
LM&Ro
LRM
162
0
0
17 Mar 2025
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
Haoqi Yuan
Yu Bai
Yuhui Fu
Bohan Zhou
Yicheng Feng
Xinrun Xu
Yi Zhan
Börje F. Karlsson
Zongqing Lu
LM&Ro
90
0
0
16 Mar 2025
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
Yunze Liu
Peiran Wu
C. Liang
Junxiao Shen
Limin Wang
Li Yi
Mamba
59
0
0
16 Mar 2025
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
Peiran Wu
Yunze Liu
Chonghan Liu
Miao Liu
VGen
LRM
61
2
0
16 Mar 2025
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu
Longxiang Tang
Bohao Peng
Senqiao Yang
Bei Yu
Jiaya Jia
VLM
251
0
0
16 Mar 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yansen Wang
Shengqiong Wu
Yuyao Zhang
William Yang Wang
Ziwei Liu
Jiebo Luo
Hao Fei
LRM
95
11
0
16 Mar 2025
FastVID: Dynamic Density Pruning for Fast Video Large Language Models
Leqi Shen
Guoqiang Gong
Tao He
Yifeng Zhang
Pengzhang Liu
Sicheng Zhao
Guiguang Ding
VLM
74
0
0
14 Mar 2025
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity
Jing Bi
Junjia Guo
Susan Liang
Guangyu Sun
Luchuan Song
...
Jinxi He
Jiarui Wu
A. Vosoughi
Chong Chen
Chenliang Xu
LRM
76
3
0
14 Mar 2025
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren
Wentao Ma
Huan Yang
Cong Wei
Ge Zhang
Wenhu Chen
Mamba
65
4
0
14 Mar 2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
Long-Video Audio Synthesis with Multi-Agent Collaboration
Yehang Zhang
Xinli Xu
Xiaojie Xu
L. Liu
Yuxiao Chen
DiffM
VGen
53
0
0
13 Mar 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu
Jingwei Sun
Yueqian Lin
Jingyang Zhang
Ming Yin
Qinsi Wang
Jingyang Zhang
Yiming Li
Yiran Chen
VLM
79
2
0
13 Mar 2025
Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption
Luozheng Qin
Zhiyu Tan
Mengping Yang
Xiaomeng Yang
Hao Li
90
0
0
12 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
63
0
0
12 Mar 2025
DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski
Teodora Popordanoska
Matthew B. Blaschko
Tinne Tuytelaars
63
0
0
12 Mar 2025
Memory-enhanced Retrieval Augmentation for Long Video Understanding
Huaying Yuan
Zhengyang Liang
Minhao Qin
Hongjin Qian
Yan Shu
Zhicheng Dou
Zhicheng Dou
VOS
RALM
VLM
89
1
0
12 Mar 2025
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Md. Mohaiminul Islam
Tushar Nagarajan
Huiyu Wang
Gedas Bertasius
Lorenzo Torresani
234
1
0
12 Mar 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin Qinghong Lin
Mike Zheng Shou
VGen
237
1
0
12 Mar 2025
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers
Ruanjun Li
Yuedong Tan
Yuanming Shi
Jiawei Shao
VLM
210
0
0
12 Mar 2025
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
Haoyu Zhang
Qiaohui Chu
Meng Liu
Yunxiao Wang
Bin Wen
Fan Yang
Tingting Gao
Di Zhang
Yaowei Wang
Liqiang Nie
EgoV
75
0
0
12 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Yogesh S Rawat
VLM
214
1
0
11 Mar 2025
RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding
Xichen Tan
Yunfan Ye
Yuanjing Luo
Qian Wan
Fang Liu
Zhiping Cai
VLM
72
1
0
11 Mar 2025
Previous
12345...8910
Next