ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.03101
50
0

Beyond Text Compression: Evaluating Tokenizers Across Scales

3 June 2025
Jonas F. Lotz
António V. Lopes
Stephan Peitz
Hendra Setiawan
Leonardo Emili
ArXiv (abs)PDFHTML
Main:7 Pages
2 Figures
Bibliography:9 Pages
12 Tables
Appendix:3 Pages
Abstract

The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our work offers a more efficient path to informed tokenizer selection in future language model development.

View on arXiv
@article{lotz2025_2506.03101,
  title={ Beyond Text Compression: Evaluating Tokenizers Across Scales },
  author={ Jonas F. Lotz and António V. Lopes and Stephan Peitz and Hendra Setiawan and Leonardo Emili },
  journal={arXiv preprint arXiv:2506.03101},
  year={ 2025 }
}
Comments on this paper