ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.08344
48
0

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model

11 April 2025
Renda Li
Xiaohua Qi
Q. Ling
Jun Yu
Ziyi Chen
Peng Chang
Mei HanJing Xiao
    DiffM
    VGen
ArXivPDFHTML
Abstract

Audio-driven cospeech video generation typically involves two stages: speech-to-gesture and gesture-to-video. While significant advances have been made in speech-to-gesture generation, synthesizing natural expressions and gestures remains challenging in gesture-to-video systems. In order to improve the generation effect, previous works adopted complex input and training strategies and required a large amount of data sets for pre-training, which brought inconvenience to practical applications. We propose a simple one-stage training method and a temporal inference method based on a diffusion model to synthesize realistic and continuous gesture videos without the need for additional training of temporalthis http URLentire model makes use of existing pre-trained weights, and only a few thousand frames of data are needed for each character at a time to complete fine-tuning. Built upon the video generator, we introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation. Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.

View on arXiv
@article{li2025_2504.08344,
  title={ EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model },
  author={ Renda Li and Xiaohua Qi and Qiang Ling and Jun Yu and Ziyi Chen and Peng Chang and Mei HanJing Xiao },
  journal={arXiv preprint arXiv:2504.08344},
  year={ 2025 }
}
Comments on this paper