71

FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation

Liuzhou Zhang
Zeyu Zhang
Biao Wu
Luyao Tang
Zirui Song
Hongyang He
Renda Han
Guangzhen Yao
Huacan Wang
Ronghao Chen
Xiuying Chen
Guan Huang
Zheng Zhu
Main:6 Pages
4 Figures
Bibliography:2 Pages
Abstract

Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code:this https URL.

View on arXiv
Comments on this paper