Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

18 June 2025

Main:8 Pages

2 Figures

Bibliography:3 Pages

14 Tables

Appendix:6 Pages

Abstract

This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

View on arXiv

@article{cho2025_2506.15138,
  title={ Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models },
  author={ Gyeongje Cho and Yeonkyoun So and Chanwoo Park and Sangmin Lee and Sungmok Jung and Jaejin Lee },
  journal={arXiv preprint arXiv:2506.15138},
  year={ 2025 }
}

Comments on this paper