ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.16537
  4. Cited By
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
v1v2v3v4 (latest)

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

25 November 2024
Chan Hee Song
Valts Blukis
Jonathan Tremblay
Stephen Tyree
Yu-Chuan Su
Stan Birchfield
ArXiv (abs)PDFHTML

Papers citing "RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics"

50 / 67 papers shown
Title
A Spatial Relationship Aware Dataset for Robotics
A Spatial Relationship Aware Dataset for Robotics
Peng Wang
Minh Huy Pham
Zhihao Guo
Wei Zhou
3DPC
21
0
0
14 Jun 2025
AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making
AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making
Wenbo Li
Shiyi Wang
Yiteng Chen
Huiping Zhuang
Qingyao Wu
36
0
0
14 Jun 2025
LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation
J. Huang
Xiaojian Ma
Xiongkun Linghu
Yue Fan
Junchao He
...
Qing Li
Song-Chun Zhu
Yixin Chen
Baoxiong Jia
Siyuan Huang
82
0
0
11 Jun 2025
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Mengdi Jia
Zekun Qi
Shaochen Zhang
Wenyao Zhang
Xinqiang Yu
Jiawei He
He Wang
L. Yi
LRMVLM
64
0
0
03 Jun 2025
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
Runsen Xu
Weiyao Wang
Hao Tang
Xingyu Chen
Xiaodong Wang
Fu-Jen Chu
Dahua Lin
Matt Feiszli
Kevin J. Liang
LRM
115
1
0
22 May 2025
Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds
Joel Currie
Gioele Migno
Enrico Piacenti
Maria Elena Giannaccini
Patric Bach
Davide De Tommaso
Agnieszka Wykowska
LM&Ro
69
0
0
20 May 2025
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
Abhay Deshpande
Yuquan Deng
Arijit Ray
Jordi Salvador
Winson Han
Jiafei Duan
Kuo-Hao Zeng
Yuke Zhu
Ranjay Krishna
Rose Hendrix
106
0
0
19 May 2025
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning
Yang Liu
Ming Ma
Xiaomin Yu
Pengxiang Ding
Han Zhao
Mingyang Sun
Siteng Huang
Donglin Wang
LRM
210
0
0
18 May 2025
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Yifu Yuan
Haiqin Cui
Yibin Chen
Zibin Dong
Fei Ni
Longxin Kou
Jinyi Liu
Pengyi Li
Yan Zheng
Jianye Hao
158
0
0
13 May 2025
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Phillip Y. Lee
Jihyeon Je
Chanho Park
Mikaela Angelina Uy
Leonidas Guibas
Minhyuk Sung
LRM
113
3
0
24 Apr 2025
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger
Nina Wenzel
David Griffiths
Haiming Gang
Justin Lazarow
...
Kai Kang
Marcin Eichner
Yue Yang
Afshin Dehghan
Peter Grasch
126
5
0
17 Mar 2025
An Egocentric Vision-Language Model based Portable Real-time Smart Assistant
Yuanmin Huang
Jilan Xu
Baoqi Pei
Yuping He
Guo Chen
...
Xinyuan Chen
Yaohui Wang
Yali Wang
Yu Qiao
Limin Wang
134
3
0
06 Mar 2025
An Empirical Analysis on Spatial Reasoning Capabilities of Large
  Multimodal Models
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
Fatemeh Shiri
Xiao-Yu Guo
Mona Golestan Far
Xin-Yao Yu
Gholamreza Haffari
Yuan-Fang Li
LRM
75
17
0
09 Nov 2024
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
Yue Zhang
Zhiyang Xu
Ying Shen
Parisa Kordjamshidi
Lifu Huang
131
8
0
04 Oct 2024
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures
  in Robotic Manipulation
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan
Wilbert Pumacay
Nishanth Kumar
Yi Ru Wang
Shulin Tian
Wentao Yuan
Ranjay Krishna
Dieter Fox
Ajay Mandlekar
Yijie Guo
VLMLRM
117
29
0
01 Oct 2024
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
  Multimodal Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Matt Deitke
Christopher Clark
Sangho Lee
Rohun Tripathi
Yue Yang
...
Noah A. Smith
Hannaneh Hajishirzi
Ross Girshick
Ali Farhadi
Aniruddha Kembhavi
OSLMVLM
127
13
0
25 Sep 2024
Multi-modal Situated Reasoning in 3D Scenes
Multi-modal Situated Reasoning in 3D Scenes
Xiongkun Linghu
Jiangyong Huang
Xuesong Niu
Xiaojian Ma
Baoxiong Jia
Siyuan Huang
121
19
0
04 Sep 2024
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for
  Robotic Manipulation
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
Wenlong Huang
Chen Wang
Yongqian Li
Ruohan Zhang
Li Fei-Fei
134
115
0
03 Sep 2024
Space3D-Bench: Spatial 3D Question Answering Benchmark
Space3D-Bench: Spatial 3D Question Answering Benchmark
E. Szymańska
Mihai Dusmanu
J. Buurlage
Mahdi Rad
Marc Pollefeys
182
7
0
29 Aug 2024
SAM 2: Segment Anything in Images and Videos
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi
Valentin Gabeur
Yuan-Ting Hu
Ronghang Hu
Chaitanya K. Ryali
...
Nicolas Carion
Chao-Yuan Wu
Ross B. Girshick
Piotr Dollár
Christoph Feichtenhofer
VLMMLLM
174
950
0
01 Aug 2024
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video
  Understanding
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
Alessandro Suglia
Claudio Greco
Katie Baker
Jose L. Part
Ioannis Papaioannou
Arash Eshghi
Ioannis Konstas
Oliver Lemon
90
11
0
19 Jun 2024
SpatialBot: Precise Spatial Understanding with Vision Language Models
SpatialBot: Precise Spatial Understanding with Vision Language Models
Wenxiao Cai
Yaroslav Ponomarenko
Jianhao Yuan
Xiaoqi Li
Wankou Yang
Hao Dong
Bo Zhao
VLM
126
46
0
19 Jun 2024
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for
  Robotics
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
Wentao Yuan
Jiafei Duan
Valts Blukis
Wilbert Pumacay
Ranjay Krishna
Adithyavairavan Murali
Arsalan Mousavian
Dieter Fox
LM&Ro
113
67
0
15 Jun 2024
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim
Karl Pertsch
Siddharth Karamcheti
Ted Xiao
Ashwin Balakrishna
...
Russ Tedrake
Dorsa Sadigh
Sergey Levine
Percy Liang
Chelsea Finn
LM&RoVLM
293
535
0
13 Jun 2024
Situational Awareness Matters in 3D Vision Language Reasoning
Situational Awareness Matters in 3D Vision Language Reasoning
Yunze Man
Liang-Yan Gui
Yu-Xiong Wang
91
18
0
11 Jun 2024
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks
  with Large Vision-Language Models
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
Mengfei Du
Binhao Wu
Zejun Li
Xuanjing Huang
Zhongyu Wei
87
19
0
09 Jun 2024
Octo: An Open-Source Generalist Robot Policy
Octo: An Open-Source Generalist Robot Policy
Octo Model Team
Dibya Ghosh
Homer Walke
Karl Pertsch
Kevin Black
...
Quan Vuong
Ted Xiao
Dorsa Sadigh
Chelsea Finn
Sergey Levine
216
452
0
20 May 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu
Yushi Hu
Bangzheng Li
Yu Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
VLMLRMMLLM
154
150
0
18 Apr 2024
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Kanchana Ranasinghe
Satya Narayan Shukla
Omid Poursaeed
Michael S. Ryoo
Tsung-Yu Lin
LRM
77
31
0
11 Apr 2024
CoPa: General Robotic Manipulation through Spatial Constraints of Parts
  with Foundation Models
CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models
Haoxu Huang
Fanqi Lin
Yingdong Hu
Shengjie Wang
Yang Gao
106
61
0
13 Mar 2024
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Soroush Nasiriany
Fei Xia
Wenhao Yu
Ted Xiao
Jacky Liang
...
Karol Hausman
N. Heess
Chelsea Finn
Sergey Levine
Brian Ichter
LM&RoLRM
94
114
0
12 Feb 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
  Capabilities
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen
Zhuo Xu
Sean Kirmani
Brian Ichter
Danny Driess
Pete Florence
Dorsa Sadigh
Leonidas Guibas
Fei Xia
LRMReLM
98
270
0
22 Jan 2024
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
  Embodied AI
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
Tai Wang
Xiaohan Mao
Chenming Zhu
Runsen Xu
Ruiyuan Lyu
...
Tianfan Xue
Xihui Liu
Cewu Lu
Dahua Lin
Jiangmiao Pang
LM&Ro
117
74
0
26 Dec 2023
VILA: On Pre-training for Visual Language Models
VILA: On Pre-training for Visual Language Models
Ji Lin
Hongxu Yin
Ming-Yu Liu
Yao Lu
Pavlo Molchanov
Andrew Tao
Huizi Mao
Jan Kautz
Mohammad Shoeybi
Song Han
MLLMVLM
176
430
0
12 Dec 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
445
960
0
27 Nov 2023
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human
  Demonstration
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration
Naoki Wake
Atsushi Kanehira
Kazuhiro Sasabuchi
Jun Takamatsu
Katsushi Ikeuchi
LM&Ro
84
69
0
20 Nov 2023
An Embodied Generalist Agent in 3D World
An Embodied Generalist Agent in 3D World
Jiangyong Huang
Silong Yong
Xiaojian Ma
Xiongkun Linghu
Puhao Li
Yan Wang
Qing Li
Song-Chun Zhu
Baoxiong Jia
Siyuan Huang
LM&Ro
118
176
0
18 Nov 2023
What's "up" with vision-language models? Investigating their struggle
  with spatial reasoning
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRMCoGe
90
119
0
30 Oct 2023
cuRobo: Parallelized Collision-Free Minimum-Jerk Robot Motion Generation
cuRobo: Parallelized Collision-Free Minimum-Jerk Robot Motion Generation
Balakumar Sundaralingam
S. Hari
Adam Fishman
Caelan Reed Garrett
Karl Van Wyk
...
Helen Oleynikova
Ankur Handa
Fabio Ramos
Nathan D. Ratliff
Dieter Fox
113
31
0
26 Oct 2023
Evaluating Spatial Understanding of Large Language Models
Evaluating Spatial Understanding of Large Language Models
Yutaro Yamada
Yihan Bao
Andrew Kyle Lampinen
Jungo Kasai
Ilker Yildirim
LRM
91
43
0
23 Oct 2023
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration
Abby OÑeill
Abdul Rehman
Abhinav Gupta
Abhiram Maddukuri
...
Zhuo Xu
Zichen Jeff Cui
Zichen Zhang
Zipeng Fu
Zipeng Lin
LM&Ro
190
531
0
13 Oct 2023
Improved Baselines with Visual Instruction Tuning
Improved Baselines with Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Yuheng Li
Yong Jae Lee
VLMMLLM
246
2,834
0
05 Oct 2023
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
  Models
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi
Jana Kosecka
VLM
111
12
0
18 Aug 2023
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
  Control
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan
Noah Brown
Justice Carbajal
Yevgen Chebotar
Xi Chen
...
Ted Xiao
Peng Xu
Sichun Xu
Tianhe Yu
Brianna Zitkovich
LM&RoLRM
247
1,297
0
28 Jul 2023
GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for
  Task-Oriented Grasping
GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping
Chao Tang
Dehao Huang
Wenqiang Ge
Weiyu Liu
Kuanqi Cai
111
75
0
25 Jul 2023
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench: Is Your Multi-modal Model an All-around Player?
Yuanzhan Liu
Haodong Duan
Yuanhan Zhang
Yue Liu
Songyang Zhang
...
Jiaqi Wang
Conghui He
Ziwei Liu
Kai-xiang Chen
Dahua Lin
201
1,060
0
12 Jul 2023
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language
  Models
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu
Peixian Chen
Yunhang Shen
Yulei Qin
Mengdan Zhang
...
Xiawu Zheng
Ke Li
Xing Sun
Zhenyu Qiu
Rongrong Ji
ELMMLLM
158
860
0
23 Jun 2023
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
Xinyu Gong
S. Mohan
Naina Dhingra
Jean-Charles Bazin
Yilei Li
Zhangyang Wang
Rakesh Ranjan
EgoV
127
19
0
12 May 2023
Visual Instruction Tuning
Visual Instruction Tuning
Haotian Liu
Chunyuan Li
Qingyang Wu
Yong Jae Lee
SyDaVLMMLLM
582
4,948
0
17 Apr 2023
SQA3D: Situated Question Answering in 3D Scenes
SQA3D: Situated Question Answering in 3D Scenes
Xiaojian Ma
Silong Yong
Zilong Zheng
Qing Li
Yitao Liang
Song-Chun Zhu
Siyuan Huang
LM&Ro
95
160
0
14 Oct 2022
12
Next