ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.02737
93
40

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

4 February 2025
Loubna Ben Allal
Anton Lozhkov
Elie Bakouch
Gabriel Martín Blázquez
Guilherme Penedo
Lewis Tunstall
Andrés Marafioti
Hynek Kydlícek
Agustín Piqueres Lajarín
Vaibhav Srivastav
Joshua Lochner
Caleb Fahlgren
Xuan-Son Nguyen
Clémentine Fourrier
Ben Burtenshaw
Hugo Larcher
Haojun Zhao
Cyril Zakka
Mathieu Morlon
Colin Raffel
Leandro von Werra
Thomas Wolf
    MoE
ArXivPDFHTML
Abstract

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

View on arXiv
@article{allal2025_2502.02737,
  title={ SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model },
  author={ Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf },
  journal={arXiv preprint arXiv:2502.02737},
  year={ 2025 }
}
Comments on this paper