FlashSign: Pose-Free Guidance for Efficient Sign Language Video Generation

30 March 2026

Liuzhou Zhang

Zeyu Zhang

Biao Wu

Luyao Tang

Zirui Song

Hongyang He

Renda Han

Guangzhen Yao

Huacan Wang

Ronghao Chen

Xiuying Chen

Guan Huang

Zheng Zhu

DiffM

SLR

VGen

ArXiv (abs)PDF HTML Github

Main:6 Pages

4 Figures

Bibliography:2 Pages

Abstract

Sign language plays a crucial role in bridging communication gaps between the deaf and hard-of-hearing communities. However, existing sign language video generation models often rely on complex intermediate representations, which limits their flexibility and efficiency. In this work, we propose a novel pose-free framework for real-time sign language video generation. Our method eliminates the need for intermediate pose representations by directly mapping natural language text to sign language videos using a diffusion-based approach. We introduce two key innovations: (1) a pose-free generative model based on the a state-of-the-art diffusion backbone, which learns implicit text-to-gesture alignments without pose estimation, and (2) a Trainable Sliding Tile Attention (T-STA) mechanism that accelerates inference by exploiting spatio-temporal locality patterns. Unlike previous training-free sparsity approaches, T-STA integrates trainable sparsity into both training and inference, ensuring consistency and eliminating the train-test gap. This approach significantly reduces computational overhead while maintaining high generation quality, making real-time deployment feasible. Our method increases video generation speed by 3.07x without compromising video quality. Our contributions open new avenues for real-time, high-quality, pose-free sign language synthesis, with potential applications in inclusive communication tools for diverse communities. Code:this https URL.

View on arXiv

Comments on this paper