Recipes for Pre-training LLMs with MXFP8

30 May 2025

Abstract

Precision scaling - using fewer bits to represent model parameters and related tensors during pre-training - has emerged as a compelling technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats in NVIDIA's latest Blackwell GPUs represent a major leap in enabling this precision scaling aspect. These formats combine narrow floating-point data types with per-block scaling factors, offering a fine-grained approach to quantizing tensors.Although MX-formats offer the promise of improved numeric stability compared to other reduced-precision representations, in practice they must be used carefully in order to successfully converge an LLM on a multi-trillion token dataset. In this paper, we show that the rounding mode suggested in OCP specification can lead to divergence when pre-training an LLM. We show an improved rounding mode, which uses round-to-infinity to compute scaling factors, enables successful pre-training in MXFP8 for an 8B model on 15T tokens.

View on arXiv

@article{mishra2025_2506.08027,
  title={ Recipes for Pre-training LLMs with MXFP8 },
  author={ Asit Mishra and Dusan Stosic and Simon Layton },
  journal={arXiv preprint arXiv:2506.08027},
  year={ 2025 }
}

Comments on this paper