What's "up" with vision-language models? Investigating their struggle with spatial reasoning

30 October 2023

Papers citing "What's "up" with vision-language models? Investigating their struggle with spatial reasoning"

26 / 26 papers shown

Title
Diffusion Classifiers Understand Compositionality, but Conditions Apply Yujin Jeong Arnas Uselis Seong Joon Oh Anna Rohrbach DiffM CoGe 507 0 3 23 May 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Dahun Kim A. Piergiovanni Ganesh Mallya A. Angelova CoGe 96 0 0 04 Apr 2025
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning Yangzhe Kong Daeun Song Jing Liang Dinesh Manocha Ziyu Yao Xuesu Xiao LRM 97 1 0 10 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data Haoxin Li Boyang Li CoGe 114 1 0 03 Mar 2025
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation S. Joshi Besmira Nushi Vidhisha Balachandran Varun Chandrasekaran Vibhav Vineet Neel Joshi Baharan Mirzasoleiman MLLM VLM 125 0 0 07 Jan 2025
Evaluating Vision-Language Models as Evaluators in Path Planning Mohamed Aghzal Xiang Yue Erion Plaku Ziyu Yao LRM 134 1 0 27 Nov 2024
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics Chan Hee Song Valts Blukis Jonathan Tremblay Stephen Tyree Yu-Chuan Su Stan Birchfield 151 15 0 25 Nov 2024
Is 'Right' Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models through Egocentric Instruction Tuning Ji Hyeok Jung Eun Tae Kim S. Kim Joo Ho Lee Bumsoo Kim Buru Chang VLM 419 2 0 24 Nov 2024
Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? Antonia Wüst Tim Nelson Tobiasch Lukas Helff Inga Ibs Wolfgang Stammer Devendra Singh Dhami Constantin Rothkopf Kristian Kersting CoGe ReLM VLM LRM 119 2 0 25 Oct 2024
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities Zheyuan Zhang Fengyuan Hu Jayjun Lee Freda Shi Parisa Kordjamshidi Joyce Chai Ziqiao Ma 114 13 0 22 Oct 2024
Locality Alignment Improves Vision-Language Models Ian Covert Tony Sun James Zou Tatsunori Hashimoto VLM 208 6 0 14 Oct 2024
MediConfusion: Can you trust your AI radiologist? Probing the reliability of multimodal medical foundation models Mohammad Shahab Sepehri Zalan Fabian Maryam Soltanolkotabi Mahdi Soltanolkotabi MedIm 105 5 0 23 Sep 2024
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models Zejun Li Ruipu Luo Jiwen Zhang Minghui Qiu Zhongyu Wei Zhongyu Wei LRM MLLM 110 16 0 27 May 2024
LAION-5B: An open large-scale dataset for training next generation image-text models Christoph Schuhmann Romain Beaumont Richard Vencu Cade Gordon Ross Wightman ... Srivatsa Kundurthy Katherine Crowson Ludwig Schmidt R. Kaczmarczyk J. Jitsev VLM MLLM CLIP 150 3,444 0 16 Oct 2022
When and why vision-language models behave like bags-of-words, and what to do about it? Mert Yuksekgonul Federico Bianchi Pratyusha Kalluri Dan Jurafsky James Zou VLM CoGe 62 392 0 04 Oct 2022
VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations Tiancheng Zhao Tianqi Zhang Mingwei Zhu Haozhan Shen Kyusong Lee Xiaopeng Lu Jianwei Yin VLM CoGe MLLM 89 97 0 01 Jul 2022
Hierarchical Text-Conditional Image Generation with CLIP Latents Aditya A. Ramesh Prafulla Dhariwal Alex Nichol Casey Chu Mark Chen VLM DiffM 370 6,859 0 13 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality Tristan Thrush Ryan Jiang Max Bartolo Amanpreet Singh Adina Williams Douwe Kiela Candace Ross CoGe 89 424 0 07 Apr 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong Guosheng Lin MLLM BDL VLM CLIP 513 4,340 0 28 Jan 2022
FLAVA: A Foundational Language And Vision Alignment Model Amanpreet Singh Ronghang Hu Vedanuj Goswami Guillaume Couairon Wojciech Galuba Marcus Rohrbach Douwe Kiela CLIP VLM 88 706 0 08 Dec 2021
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts Yan Zeng Xinsong Zhang Hang Li VLM CLIP 62 305 0 16 Nov 2021
Surface Form Competition: Why the Highest Probability Answer Isn't Always Right Ari Holtzman Peter West Vered Schwartz Yejin Choi Luke Zettlemoyer LRM 97 237 0 16 Apr 2021
A Corpus for Reasoning About Natural Language Grounded in Photographs Alane Suhr Stephanie Zhou Ally Zhang Iris Zhang Huajun Bai Yoav Artzi LRM 100 603 0 01 Nov 2018
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Justin Johnson B. Hariharan Laurens van der Maaten Li Fei-Fei C. L. Zitnick Ross B. Girshick CoGe 289 2,374 0 20 Dec 2016
Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Yash Goyal Tejas Khot D. Summers-Stay Dhruv Batra Devi Parikh CoGe 328 3,235 0 02 Dec 2016
Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov Ilya Sutskever Kai Chen G. Corrado J. Dean NAI OCL 387 33,521 0 16 Oct 2013