ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.16263
25
0

CapsDT: Diffusion-Transformer for Capsule Robot Manipulation

19 June 2025
Xiting He
Mingwu Su
Xinqi Jiang
Long Bai
Jiewen Lai
Hongliang Ren
    MedIm
ArXiv (abs)PDFHTML
Main:6 Pages
3 Figures
Bibliography:1 Pages
Abstract

Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.

View on arXiv
@article{he2025_2506.16263,
  title={ CapsDT: Diffusion-Transformer for Capsule Robot Manipulation },
  author={ Xiting He and Mingwu Su and Xinqi Jiang and Long Bai and Jiewen Lai and Hongliang Ren },
  journal={arXiv preprint arXiv:2506.16263},
  year={ 2025 }
}
Comments on this paper