32

TongSIM: A General Platform for Simulating Intelligent Machines

Zhe Sun
Kunlun Wu
Chuanjian Fu
Zeming Song
Langyong Shi
Zihe Xue
Bohan Jing
Ying Yang
Xiaomeng Gao
Aijia Li
Tianyu Guo
Huiying Li
Xueyuan Yang
Rongkai Liu
Xinyi He
Yuxi Wang
Yue Li
Mingyuan Liu
Yujie Lu
Hongzhao Xie
Shiyun Zhao
Bo Dai
Wei Wang
Tao Yuan
Song-Chun Zhu
Yujia Peng
Zhenliang Zhang
Main:19 Pages
14 Figures
8 Tables
Appendix:7 Pages
Abstract

As artificial intelligence (AI) rapidly advances, especially in multimodal large language models (MLLMs), research focus is shifting from single-modality text processing to the more complex domains of multimodal and embodied AI. Embodied intelligence focuses on training agents within realistic simulated environments, leveraging physical interaction and action feedback rather than conventionally labeled datasets. Yet, most existing simulation platforms remain narrowly designed, each tailored to specific tasks. A versatile, general-purpose training environment that can support everything from low-level embodied navigation to high-level composite activities, such as multi-agent social simulation and human-AI collaboration, remains largely unavailable. To bridge this gap, we introduce TongSIM, a high-fidelity, general-purpose platform for training and evaluating embodied agents. TongSIM offers practical advantages by providing over 100 diverse, multi-room indoor scenarios as well as an open-ended, interaction-rich outdoor town simulation, ensuring broad applicability across research needs. Its comprehensive evaluation framework and benchmarks enable precise assessment of agent capabilities, such as perception, cognition, decision-making, human-robot cooperation, and spatial and social reasoning. With features like customized scenes, task-adaptive fidelity, diverse agent types, and dynamic environmental simulation, TongSIM delivers flexibility and scalability for researchers, serving as a unified platform that accelerates training, evaluation, and advancement toward general embodied intelligence.

View on arXiv
Comments on this paper