Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning
This paper introduces a novel approach to Visual Forced Alignment (VFA), aiming to accurately synchronize utterances with corresponding lip movements, without relying on audio cues. We propose a novel VFA approach that integrates a local context-aware feature extractor and employs multi-task learning to refine both global and local context features, enhancing sensitivity to subtle lip movements for precise word-level and phoneme-level alignment. Incorporating the improved Viterbi algorithm for post-processing, our method significantly reduces misalignments. Experimental results show our approach outperforms existing methods, achieving a 6% accuracy improvement at the word-level and 27% improvement at the phoneme-level in LRS2 dataset. These improvements offer new potential for applications in automatically subtitling TV shows or user-generated content platforms like TikTok and YouTube Shorts.
View on arXiv@article{he2025_2503.03286, title={ Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning }, author={ Yi He and Lei Yang and Shilin Wang }, journal={arXiv preprint arXiv:2503.03286}, year={ 2025 } }