ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.15863
29
3

Task-oriented Robotic Manipulation with Vision Language Models

21 October 2024
Nurhan Bulus Guran
Hanchi Ren
Jingjing Deng
Xianghua Xie
ArXivPDFHTML
Abstract

Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. Accurately understanding spatial relationships remains a non-trivial challenge, yet it is essential for effective robotic manipulation. In this work, we introduce a novel framework that integrates VLMs with a structured spatial reasoning pipeline to perform object manipulation based on high-level, task-oriented input. Our approach is the transformation of visual scenes into tree-structured representations that encode the spatial relations. These trees are subsequently processed by a Large Language Model (LLM) to infer restructured configurations that determine how these objects should be organised for a given high-level task. To support our framework, we also present a new dataset containing manually annotated captions that describe spatial relations among objects, along with object-level attribute annotations such as fragility, mass, material, and transparency. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

View on arXiv
@article{guran2025_2410.15863,
  title={ Task-oriented Robotic Manipulation with Vision Language Models },
  author={ Nurhan Bulus Guran and Hanchi Ren and Jingjing Deng and Xianghua Xie },
  journal={arXiv preprint arXiv:2410.15863},
  year={ 2025 }
}
Comments on this paper