17
0

Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Main:8 Pages
2 Figures
Bibliography:3 Pages
14 Tables
Appendix:6 Pages
Abstract

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

View on arXiv
@article{cho2025_2506.15138,
  title={ Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models },
  author={ Gyeongje Cho and Yeonkyoun So and Chanwoo Park and Sangmin Lee and Sungmok Jung and Jaejin Lee },
  journal={arXiv preprint arXiv:2506.15138},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.