TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans

1 June 2025

Main:18 Pages

12 Figures

Bibliography:2 Pages

2 Tables

Appendix:4 Pages

Abstract

Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available atthis https URL

View on arXiv

@article{guo2025_2506.01077,
  title={ TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans },
  author={ Yueqian Guo and Tianzhao Li and Xin Lyu and Jiehaolin Chen and Zhaohan Wang and Sirui Xiao and Yurun Chen and Yezi He and Helin Li and Fan Zhang },
  journal={arXiv preprint arXiv:2506.01077},
  year={ 2025 }
}

Comments on this paper