ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05424
  4. Cited By
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and
  Language Models

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

8 June 2023
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
    MLLM
ArXivPDFHTML

Papers citing "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models"

50 / 462 papers shown
Title
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Yue Fan
Xiaojian Ma
Rujie Wu
Yuntao Du
Jiaqi Li
Zhi Gao
Qing Li
VLM
LLMAG
51
57
0
18 Mar 2024
Towards Neuro-Symbolic Video Understanding
Towards Neuro-Symbolic Video Understanding
Minkyu Choi
Harsh Goel
Mohammad Omama
Yunhao Yang
Sahil Shah
Sandeep Chinchali
53
9
0
16 Mar 2024
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
Yueqian Wang
Xiaojun Meng
Jianxin Liang
Yuxuan Wang
Qun Liu
Dongyan Zhao
36
30
0
15 Mar 2024
Large Language Models and Causal Inference in Collaboration: A Survey
Large Language Models and Causal Inference in Collaboration: A Survey
Xiaoyu Liu
Paiheng Xu
Junda Wu
Jiaxin Yuan
Yifan Yang
...
Haoliang Wang
Tong Yu
Julian McAuley
Wei Ai
Furong Huang
ELM
LRM
80
34
0
14 Mar 2024
Hierarchical Auto-Organizing System for Open-Ended Multi-Agent
  Navigation
Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation
Zhonghan Zhao
Kewei Chen
Dongxu Guo
Wenhao Chai
Tianbo Ye
Yanting Zhang
Gaoang Wang
64
21
0
13 Mar 2024
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
  Acceleration for Large Vision-Language Models
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen
Haozhe Zhao
Tianyu Liu
Shuai Bai
Junyang Lin
Chang Zhou
Baobao Chang
MLLM
VLM
59
122
0
11 Mar 2024
GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual
  Affective Computing
GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing
Hao Lu
Xuesong Niu
Jiyao Wang
Yin Wang
Qingyong Hu
...
Dengbo He
Shuiguang Deng
Hao Chen
Ying-Cong Chen
Shiguang Shan
MLLM
56
12
0
09 Mar 2024
CAT: Enhancing Multimodal Large Language Model to Answer Questions in
  Dynamic Audio-Visual Scenarios
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Qilang Ye
Zitong Yu
Rui Shao
Xinyu Xie
Philip Torr
Xiaochun Cao
MLLM
56
24
0
07 Mar 2024
Embodied Understanding of Driving Scenarios
Embodied Understanding of Driving Scenarios
Yunsong Zhou
Linyan Huang
Qingwen Bu
Jia Zeng
Tianyu Li
Hang Qiu
Hongzi Zhu
Minyi Guo
Yu Qiao
Hongyang Li
LM&Ro
62
31
0
07 Mar 2024
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation
Maksim Kuprashevich
Grigorii Alekseenko
Irina Tolstykh
ELM
66
4
0
04 Mar 2024
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT
Yifang Xu
Yunzhuo Sun
Zien Xie
Benxiang Zhai
Sidan Du
56
6
0
04 Mar 2024
GPTSee: Enhancing Moment Retrieval and Highlight Detection via
  Description-Based Similarity Features
GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features
Yunzhuo Sun
Yifang Xu
Zien Xie
Yukun Shu
Sidan Du
38
6
0
03 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
89
180
0
29 Feb 2024
Navigating Hallucinations for Reasoning of Unintentional Activities
Navigating Hallucinations for Reasoning of Unintentional Activities
Shresth Grover
Vibhav Vineet
Yogesh S Rawat
LRM
57
1
0
29 Feb 2024
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning
  for Multimodal Video Captioning
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning
Huiyu Xiong
Lanxiao Wang
Heqian Qiu
Taijin Zhao
Benliu Qiu
Hongliang Li
CLL
48
1
0
27 Feb 2024
PALO: A Polyglot Large Multimodal Model for 5B People
PALO: A Polyglot Large Multimodal Model for 5B People
Muhammad Maaz
H. Rasheed
Abdelrahman M. Shaker
Salman Khan
Hisham Cholakal
Rao M. Anwer
Timothy Baldwin
Michael Felsberg
Fahad S. Khan
VLM
LRM
85
13
0
22 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current
  Methodologies and Future Directions
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Aman Chadha
VLM
62
26
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
50
29
0
20 Feb 2024
Slot-VLM: SlowFast Slots for Video-Language Modeling
Slot-VLM: SlowFast Slots for Video-Language Modeling
Jiaqi Xu
Cuiling Lan
Wenxuan Xie
Xuejin Chen
Yan Lu
MLLM
VLM
35
7
0
20 Feb 2024
Momentor: Advancing Video Large Language Model with Fine-Grained
  Temporal Reasoning
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
Long Qian
Juncheng Billy Li
Yu-hao Wu
Yaobo Ye
Hao Fei
Tat-Seng Chua
Yueting Zhuang
Siliang Tang
MLLM
LRM
65
47
0
18 Feb 2024
Can Text-to-image Model Assist Multi-modal Learning for Visual
  Recognition with Visual Modality Missing?
Can Text-to-image Model Assist Multi-modal Learning for Visual Recognition with Visual Modality Missing?
Tiantian Feng
Daniel Yang
Digbalay Bose
Shrikanth Narayanan
45
5
0
14 Feb 2024
World Model on Million-Length Video And Language With Blockwise RingAttention
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu
Wilson Yan
Matei A. Zaharia
Pieter Abbeel
VGen
44
64
0
13 Feb 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
Chris Liu
Renrui Zhang
Longtian Qiu
Siyuan Huang
Weifeng Lin
...
Hao Shao
Pan Lu
Hongsheng Li
Yu Qiao
Peng Gao
MLLM
130
110
0
08 Feb 2024
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Sentiment-enhanced Graph-based Sarcasm Explanation in Dialogue
Kun Ouyang
Liqiang Jing
Xuemeng Song
Meng Liu
Yupeng Hu
Liqiang Nie
99
3
0
06 Feb 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled
  Visual-Motional Tokenization
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Yang Jin
Zhicheng Sun
Kun Xu
Kun Xu
Liwei Chen
...
Yuliang Liu
Di Zhang
Yang Song
Kun Gai
Yadong Mu
VGen
55
42
0
05 Feb 2024
User Intent Recognition and Satisfaction with Large Language Models: A
  User Study with ChatGPT
User Intent Recognition and Satisfaction with Large Language Models: A User Study with ChatGPT
Anna Bodonhelyi
Efe Bozkir
Shuo Yang
Enkelejda Kasneci
Gjergji Kasneci
ELM
AI4MH
38
16
0
03 Feb 2024
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal
  Image Generation
Image Anything: Towards Reasoning-coherent and Training-free Multi-modal Image Generation
Yuanhuiyi Lyu
Xueye Zheng
Lin Wang
DiffM
49
11
0
31 Jan 2024
A Survey on Generative AI and LLM for Video Generation, Understanding,
  and Streaming
A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming
Pengyuan Zhou
Lin Wang
Zhi Liu
Yanbin Hao
Pan Hui
Sasu Tarkoma
J. Kangasharju
VGen
51
27
0
30 Jan 2024
Towards A Better Metric for Text-to-Video Generation
Towards A Better Metric for Text-to-Video Generation
Jay Zhangjie Wu
Guian Fang
Haoning Wu
Xintao Wang
Yixiao Ge
...
Rui Zhao
Weisi Lin
Wynne Hsu
Ying Shan
Mike Zheng Shou
VGen
42
34
0
15 Jan 2024
Distilling Vision-Language Models on Millions of Videos
Distilling Vision-Language Models on Millions of Videos
Yue Zhao
Long Zhao
Xingyi Zhou
Jialin Wu
Chun-Te Chu
...
Hartwig Adam
Ting Liu
Boqing Gong
Philipp Krahenbuhl
Liangzhe Yuan
VLM
41
13
0
11 Jan 2024
Detours for Navigating Instructional Videos
Detours for Navigating Instructional Videos
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
34
6
0
03 Jan 2024
Taking the Next Step with Generative Artificial Intelligence: The
  Transformative Role of Multimodal Large Language Models in Science Education
Taking the Next Step with Generative Artificial Intelligence: The Transformative Role of Multimodal Large Language Models in Science Education
Arne Bewersdorff
Christian Hartmann
Marie Hornberger
Kathrin Seßler
Maria Bannert
Enkelejda Kasneci
Gjergji Kasneci
Xiaoming Zhai
Claudia Nerdel
35
29
0
01 Jan 2024
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Ping Luo
Jiebo Luo
Chenliang Xu
VLM
70
84
0
29 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
40
147
0
28 Dec 2023
Grounding-Prompter: Prompting LLM with Multimodal Information for
  Temporal Sentence Grounding in Long Videos
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
Houlun Chen
Xin Wang
Hong Chen
Zihan Song
Jia Jia
Wenwu Zhu
LRM
49
10
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A
  Survey
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
44
22
0
27 Dec 2023
LLM4VG: Large Language Models Evaluation for Video Grounding
LLM4VG: Large Language Models Evaluation for Video Grounding
Wei Feng
Xin Wang
Hong Chen
Zeyang Zhang
Zihan Song
Yuwei Zhou
Wenwu Zhu
41
8
0
21 Dec 2023
VQA4CIR: Boosting Composed Image Retrieval with Visual Question
  Answering
VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering
Chun-Mei Feng
Yang Bai
Yaoyu Zhang
Zhen Li
Salman Khan
Wangmeng Zuo
Xinxing Xu
Rick Siow Mong Goh
Yong-Jin Liu
42
5
0
19 Dec 2023
Assessing GPT4-V on Structured Reasoning Tasks
Assessing GPT4-V on Structured Reasoning Tasks
Mukul Singh
J. Cambronero
Sumit Gulwani
Vu Le
Gust Verbruggen
LRM
43
10
0
13 Dec 2023
VILA: On Pre-training for Visual Language Models
VILA: On Pre-training for Visual Language Models
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLM
VLM
43
360
0
12 Dec 2023
Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
Vista-LLaMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens
Fan Ma
Xiaojie Jin
Heng Wang
Yuchen Xian
Jiashi Feng
Yi Yang
29
47
0
12 Dec 2023
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for
  Human-Level Planning
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
Yi Chen
Yuying Ge
Yixiao Ge
Mingyu Ding
Bohao Li
Rui Wang
Rui-Lan Xu
Ying Shan
Xihui Liu
LLMAG
ELM
LRM
32
10
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
27
38
0
11 Dec 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
  Understanding
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding
Hongjie Zhang
Yi Liu
Lu Dong
Yifei Huang
Z. Ling
Yali Wang
Limin Wang
Yu Qiao
25
25
0
08 Dec 2023
Foundation Models for Weather and Climate Data Understanding: A
  Comprehensive Survey
Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey
Shengchao Chen
Guodong Long
Jing Jiang
Dikai Liu
Chengqi Zhang
SyDa
AI4CE
54
24
0
05 Dec 2023
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang
Ruiyi Zhang
Haoliang Wang
Uttaran Bhattacharya
Yun Fu
Gang Wu
MLLM
40
10
0
04 Dec 2023
TimeChat: A Time-sensitive Multimodal Large Language Model for Long
  Video Understanding
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Shuhuai Ren
Linli Yao
Shicheng Li
Xu Sun
Lu Hou
VLM
MLLM
31
175
0
04 Dec 2023
Zero-Shot Video Question Answering with Procedural Programs
Zero-Shot Video Question Answering with Procedural Programs
Rohan Choudhury
Koichiro Niinuma
Kris M. Kitani
László A. Jeni
27
23
0
01 Dec 2023
Dolphins: Multimodal Language Model for Driving
Dolphins: Multimodal Language Model for Driving
Yingzi Ma
Yulong Cao
Jiachen Sun
Marco Pavone
Chaowei Xiao
MLLM
43
51
0
01 Dec 2023
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware
  representations to LLMs and Emergent Cross-modal Reasoning
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Artemis Panagopoulou
Le Xue
Ning Yu
Junnan Li
Dongxu Li
Shafiq Joty
Ran Xu
Silvio Savarese
Caiming Xiong
Juan Carlos Niebles
VLM
MLLM
41
46
0
30 Nov 2023
Previous
123...10789
Next