ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.01428
  4. Cited By
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
v1v2v3v4 (latest)

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

2 January 2025
Zhangyang Qi
Zhixiong Zhang
Ye Fang
Jiaqi Wang
Hengshuang Zhao
ArXiv (abs)PDFHTML

Papers citing "GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models"

50 / 109 papers shown
Title
LISA: Reasoning Segmentation via Large Language Model
LISA: Reasoning Segmentation via Large Language Model
Xin Lai
Zhuotao Tian
Yukang Chen
Yanwei Li
Yuhui Yuan
Shu Liu
Jiaya Jia
LM&RoVLMMLLMLRM
173
463
0
01 Aug 2023
MovieChat: From Dense Token to Sparse Memory for Long Video
  Understanding
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song
Wenhao Chai
Guanhong Wang
Yucheng Zhang
Haoyang Zhou
...
Tianbo Ye
Yanting Zhang
Yang Lu
Lei Li
Gaoang Wang
VLMMLLM
169
300
0
31 Jul 2023
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding
Zehan Wang
Haifeng Huang
Yang Zhao
Lin Li
Xize Cheng
Yichen Zhu
Aoxiong Yin
Zhou Zhao
3DPC
82
22
0
25 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
209
1,060
0
12 Jul 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng
Wenhui Wang
Li Dong
Y. Hao
Shaohan Huang
Shuming Ma
Furu Wei
MLLMObjDVLM
130
765
0
26 Jun 2023
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
OpenMask3D: Open-Vocabulary 3D Instance Segmentation
Ayca Takmaz
Elisabetta Fedele
R. Sumner
Marc Pollefeys
F. Tombari
Francis Engelmann
ISegVLM
97
173
0
23 Jun 2023
Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene
  Scale and Realism Tradeoffs for ObjectGoal Navigation
Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation
Mukul Khanna
Yongsen Mao
Hanxiao Jiang
Sanjay Haresh
Brennan Schacklett
Dhruv Batra
Alexander Clegg
Eric Undersander
Angel X. Chang
Manolis Savva
3DV
123
78
0
20 Jun 2023
Valley: Video Assistant with Large Language model Enhanced abilitY
Valley: Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo
Ziwang Zhao
Min Yang
Junwei Dong
Da Li
Pengcheng Lu
Tao Wang
Linmei Hu
Ming-Hui Qiu
MLLM
150
209
0
12 Jun 2023
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and
  Language Models
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz
H. Rasheed
Salman Khan
Fahad Shahbaz Khan
MLLM
164
662
0
08 Jun 2023
SAM3D: Segment Anything in 3D Scenes
SAM3D: Segment Anything in 3D Scenes
Yu-nuo Yang
Xiaoyang Wu
Tongyao He
Hengshuang Zhao
Xihui Liu
3DPC
104
93
0
06 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
  Understanding
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang
Xin Li
Lidong Bing
MLLM
226
1,068
0
05 Jun 2023
InstructBLIP: Towards General-purpose Vision-Language Models with
  Instruction Tuning
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai
Junnan Li
Dongxu Li
A. M. H. Tiong
Junqi Zhao
Weisheng Wang
Boyang Albert Li
Pascale Fung
Steven C. H. Hoi
MLLMVLM
276
2,105
0
11 May 2023
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
  Language Models
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu
Jun Chen
Xiaoqian Shen
Xiang Li
Mohamed Elhoseiny
VLMMLLM
177
2,080
0
20 Apr 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDaVLMMLLM
587
4,950
0
17 Apr 2023
RegionPLC: Regional Point-Language Contrastive Learning for Open-World
  3D Scene Understanding
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
Jihan Yang
Runyu Ding
Weipeng Deng
Zhe Wang
Xiaojuan Qi
146
69
0
03 Apr 2023
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
  GPT and Prototype Guidance
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance
Zoey Guo
Yiwen Tang
Renrui Zhang
Dong Wang
Zhigang Wang
Bin Zhao
Xuelong Li
128
60
0
29 Mar 2023
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D
  Recognition
CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition
Deepti Hegde
Jeya Maria Jose Valanarasu
Vishal M. Patel
CLIP
122
68
0
20 Mar 2023
GPT-4 Technical Report
GPT-4 Technical Report
OpenAI OpenAI
OpenAI Josh Achiam
Steven Adler
Sandhini Agarwal
Lama Ahmad
...
Shengjia Zhao
Tianhao Zheng
Juntang Zhuang
William Zhuk
Barret Zoph
LLMAGMLLM
1.7K
14,870
0
15 Mar 2023
Language Is Not All You Need: Aligning Perception with Language Models
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang
Li Dong
Wenhui Wang
Y. Hao
Saksham Singhal
...
Johan Bjorck
Vishrav Chaudhary
Subhojit Som
Xia Song
Furu Wei
VLMLRMMLLM
146
567
0
27 Feb 2023
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image
  Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Junnan Li
Dongxu Li
Silvio Savarese
Steven C. H. Hoi
VLMMLLM
646
4,679
0
30 Jan 2023
End-to-End 3D Dense Captioning with Vote2Cap-DETR
End-to-End 3D Dense Captioning with Vote2Cap-DETR
Sijin Chen
Erik Cambria
Xin Chen
Yinjie Lei
Tao Chen
YU Gang
ViT
77
60
0
06 Jan 2023
Learning 3D Representations from 2D Pre-trained Models via
  Image-to-Point Masked Autoencoders
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
Renrui Zhang
Liuhui Wang
Yu Qiao
Peng Gao
Hongsheng Li
3DPC
96
137
0
13 Dec 2022
ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved
  Visio-Linguistic Models in 3D Scenes
ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes
Ahmed Abdelreheem
Kyle Olszewski
Hsin-Ying Lee
Peter Wonka
Panos Achlioptas
3DPC
111
28
0
12 Dec 2022
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
PLA: Language-Driven Open-Vocabulary 3D Scene Understanding
Runyu Ding
Jihan Yang
Chuhui Xue
Wenqing Zhang
Song Bai
Xiaojuan Qi
VLM
101
154
0
29 Nov 2022
OpenScene: 3D Scene Understanding with Open Vocabularies
OpenScene: 3D Scene Understanding with Open Vocabularies
Songyou Peng
Kyle Genova
ChiyuMaxJiang
Andrea Tagliasacchi
Marc Pollefeys
Thomas Funkhouser
3DPCVLM
147
370
0
28 Nov 2022
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for
  3D Visual Grounding
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
Eslam Mohamed Bakr
Yasmeen Alsaedy
Mohamed Elhoseiny
3DPC
92
44
0
25 Nov 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Shizhe Chen
Pierre-Louis Guhur
Makarand Tapaswi
Cordelia Schmid
Ivan Laptev
107
88
0
17 Nov 2022
LAION-5B: An open large-scale dataset for training next generation
  image-text models
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann
Romain Beaumont
Richard Vencu
Cade Gordon
Ross Wightman
...
Srivatsa Kundurthy
Katherine Crowson
Ludwig Schmidt
R. Kaczmarczyk
J. Jitsev
VLMMLLMCLIP
240
3,522
0
16 Oct 2022
SQA3D: Situated Question Answering in 3D Scenes
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma
Silong Yong
Zilong Zheng
Qing Li
Yitao Liang
Song-Chun Zhu
Siyuan Huang
LM&Ro
97
160
0
14 Oct 2022
Mask3D: Mask Transformer for 3D Semantic Instance Segmentation
Mask3D: Mask Transformer for 3D Semantic Instance Segmentation
Jonas Schult
Francis Engelmann
Alexander Hermans
Or Litany
Siyu Tang
Bastian Leibe
ISeg
136
182
0
06 Oct 2022
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
  Grounding
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
Yanmin Wu
Xinhua Cheng
Renrui Zhang
Zesen Cheng
Jian Zhang
151
70
0
29 Sep 2022
Semantic Abstraction: Open-World 3D Scene Understanding from 2D
  Vision-Language Models
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Huy Ha
Shuran Song
LM&RoVLM
114
106
0
23 Jul 2022
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
Matt Deitke
Eli VanderBilt
Alvaro Herrasti
Luca Weihs
Jordi Salvador
...
Winson Han
Eric Kolve
Ali Farhadi
Aniruddha Kembhavi
Roozbeh Mottaghi
LM&Ro
129
265
0
14 Jun 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLMVLM
431
3,621
0
29 Apr 2022
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive
  Selection
3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection
Jun-Bin Luo
Jiahui Fu
Xianghao Kong
Chen Gao
Haibing Ren
Hao Shen
Huaxia Xia
Si Liu
95
95
0
13 Apr 2022
Multi-View Transformer for 3D Visual Grounding
Multi-View Transformer for 3D Visual Grounding
Shijia Huang
Yilun Chen
Jiaya Jia
Liwei Wang
115
127
0
05 Apr 2022
SoftGroup for 3D Instance Segmentation on Point Clouds
SoftGroup for 3D Instance Segmentation on Point Clouds
Thang Vu
Kookhoi Kim
Tung M. Luu
Xuan Thanh Nguyen
Chang D. Yoo
3DPC
83
242
0
03 Mar 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified
  Vision-Language Understanding and Generation
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
Guosheng Lin
MLLMBDLVLMCLIP
586
4,444
0
28 Jan 2022
ScanQA: 3D Question Answering for Spatial Scene Understanding
ScanQA: 3D Question Answering for Spatial Scene Understanding
Daich Azuma
Taiki Miyanishi
Shuhei Kurita
M. Kawanabe
115
208
0
20 Dec 2021
Bottom Up Top Down Detection Transformers for Language Grounding in
  Images and Point Clouds
Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds
Ayush Jain
N. Gkanatsios
Ishita Mediratta
Katerina Fragkiadaki
ObjD
145
110
0
16 Dec 2021
3D Question Answering
3D Question Answering
Shuquan Ye
Dongdong Chen
Songfang Han
Jing Liao
ViT
94
49
0
15 Dec 2021
PointCLIP: Point Cloud Understanding by CLIP
PointCLIP: Point Cloud Understanding by CLIP
Renrui Zhang
Ziyu Guo
Wei Zhang
Kunchang Li
Xupeng Miao
Tengjiao Wang
Yu Qiao
Peng Gao
Hongsheng Li
VLM3DPC
271
453
0
04 Dec 2021
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene
  Understanding Using Mobile RGB-D Data
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch
Zhuoyuan Chen
Afshin Dehghan
Tal Dimry
Yuri Feigin
...
Thomas Gebauer
Brandon Joffe
Daniel Kurz
Arik Schwartz
Elad Shulman
3DV3DPC
126
209
0
17 Nov 2021
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments
  for Embodied AI
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh Kumar Ramakrishnan
Aaron Gokaslan
Erik Wijmans
Oleksandr Maksymets
Alexander Clegg
...
Andrew Westbury
Angel X. Chang
Manolis Savva
Yili Zhao
Dhruv Batra
115
393
0
16 Sep 2021
An End-to-End Transformer Model for 3D Object Detection
An End-to-End Transformer Model for 3D Object Detection
Ishan Misra
Rohit Girdhar
Armand Joulin
3DPCViT
172
489
0
16 Sep 2021
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
  Visual Grounding
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding
Dailan He
Yusheng Zhao
Junyu Luo
Tianrui Hui
Shaofei Huang
Aixi Zhang
Si Liu
ViT
81
95
0
05 Aug 2021
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
Zhengyuan Yang
Songyang Zhang
Liwei Wang
Jiebo Luo
3DPC
119
126
0
24 May 2021
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding
  on Point Clouds through Instance Multi-level Contextual Referring
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring
Zhihao Yuan
Xu Yan
Yinghong Liao
Ruimao Zhang
Sheng Wang
Zhen Li
Shuguang Cui
129
135
0
01 Mar 2021
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya A. Ramesh
Gabriel Goh
...
Amanda Askell
Pamela Mishkin
Jack Clark
Gretchen Krueger
Ilya Sutskever
CLIPVLM
1.1K
30,116
0
26 Feb 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize
  Long-Tail Visual Concepts
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
570
1,143
0
17 Feb 2021
Previous
123
Next