Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size

Pretraining large language models is a costly process. To make this process more efficient, several methods have been proposed to optimize model architecture/parametrization and hardware use. On the parametrization side, (Maximal Update Parametrization) parametrizes model weights and learning rate (LR) in a way that makes hyperparameters (HPs) transferable with width (embedding dimension): HPs can be tuned for a small model and used for larger models without additional tuning. While P showed impressive results in practice, recent empirical studies have reported conflicting observations when applied to LLMs. One limitation of the theory behind P is the fact that input dimension (vocabulary size in LLMs) is considered fixed when taking the width to infinity. This is unrealistic since vocabulary size is generally much larger than width in practice. In this work, we provide a theoretical analysis of the effect of vocabulary size on training dynamics, and subsequently show that as vocabulary size increases, the training dynamics \emph{interpolate between the P regime and another regime that we call Large Vocab (LV) Regime}, where optimal scaling rules are different from those predicted by P. Our analysis reveals that in the LV regime, the optimal embedding LR to hidden LR ratio should roughly scale as , surprisingly close to the empirical findings previously reported in the literature, and different from the ratio predicted by P. We conduct several experiments to validate our theory, and pretrain a 1B model from scratch to show the benefit of our suggested scaling rule for the embedding LR.
View on arXiv@article{hayou2025_2506.15025, title={ Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size }, author={ Soufiane Hayou and Liyuan Liu }, journal={arXiv preprint arXiv:2506.15025}, year={ 2025 } }