ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.00980
24
152

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

1 June 2023
Yanyu Li
Huan Wang
Qing Jin
Ju Hu
Pavlo Chemerys
Yun Fu
Yanzhi Wang
Sergey Tulyakov
Jian Ren
    VLM
ArXivPDFHTML
Abstract

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than 222 seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with 888 denoising steps achieves better FID and CLIP scores than Stable Diffusion v1.51.51.5 with 505050 steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

View on arXiv
Comments on this paper