ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.14172
70
0
v1v2 (latest)

The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models

20 May 2025
Adrian Cosma
Stefan Ruseti
Emilian Radoi
Mihai Dascalu
Author Contacts:
ioan_adrian.cosma@upb.rostefan.ruseti@upb.roemilian.radoi@upb.romihai.dascalu@upb.ro
    LRM
ArXiv (abs)PDFHTML
Main:8 Pages
9 Figures
Bibliography:3 Pages
1 Tables
Abstract

Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.

View on arXiv
@article{cosma2025_2505.14172,
  title={ The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models },
  author={ Adrian Cosma and Stefan Ruseti and Emilian Radoi and Mihai Dascalu },
  journal={arXiv preprint arXiv:2505.14172},
  year={ 2025 }
}
Comments on this paper