59
1

Causal Estimation of Tokenisation Bias

Main:9 Pages
9 Figures
Bibliography:3 Pages
Appendix:4 Pages
Abstract

Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., hello\langle hello \rangle) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first KK to a tokeniser's vocabulary, where KK is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.

View on arXiv
@article{lesci2025_2506.03149,
  title={ Causal Estimation of Tokenisation Bias },
  author={ Pietro Lesci and Clara Meister and Thomas Hofmann and Andreas Vlachos and Tiago Pimentel },
  journal={arXiv preprint arXiv:2506.03149},
  year={ 2025 }
}
Comments on this paper