Optimizing Large Model Training through Overlapped Activation Recomputation

13 June 2024

Abstract

Large model training often uses recomputation to alleviate memory pressure and pipelines to exploit the parallelism of data, tensors, and devices. However, existing recomputation approaches may incur high overhead when training real-world models, as they are executed on demand in the critical training path. In this paper, we present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines. To reduce the large search space for recomputation strategies, we propose a heuristic-based recomputation scheduling algorithm, which is based on the observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all such structures. Additionally, we propose a recomputation-aware model partitioning method to balance each stage's execution time for improved training throughput. Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.

View on arXiv

@article{chen2025_2406.08756,
  title={ Optimizing Large Model Training through Overlapped Activation Recomputation },
  author={ Ping Chen and Wenjie Zhang and Shuibing He and Weijian Chen and Siling Yang and Kexin Huang and Yanlong Yin and Xuan Zhan and Yingjie Gu and Zhuwei Peng and Yi Zheng and Zhefeng Wang and Gang Chen },
  journal={arXiv preprint arXiv:2406.08756},
  year={ 2025 }
}

Comments on this paper