ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.06897
45
0

HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

10 March 2025
Xingzu Zhan
Chen Xie
Haoran Sun
Xiaochun Mai
    Mamba
ArXivPDFHTML
Abstract

Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.

View on arXiv
@article{zhan2025_2503.06897,
  title={ HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation },
  author={ Xingzu Zhan and Chen Xie and Haoran Sun and Xiaochun Mai },
  journal={arXiv preprint arXiv:2503.06897},
  year={ 2025 }
}
Comments on this paper