ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.11277
32
306

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

21 April 2023
Yanli Zhao
Andrew Gu
R. Varma
Liangchen Luo
Chien-chin Huang
Min Xu
Less Wright
Hamid Shojanazeri
Myle Ott
Sam Shleifer
Alban Desmaison
Can Balioglu
Pritam Damania
Bernard Nguyen
Geeta Chauhan
Y. Hao
Ajit Mathews
Shen Li
    FedML
    MoE
ArXivPDFHTML
Abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

View on arXiv
Comments on this paper