Parameterized Synthetic Text Generation with SimpleStories
- SyDa

Main:9 Pages
8 Figures
Bibliography:2 Pages
5 Tables
Appendix:5 Pages
Abstract
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.
View on arXiv@article{finke2025_2504.09184, title={ Parameterized Synthetic Text Generation with SimpleStories }, author={ Lennart Finke and Chandan Sreedhara and Thomas Dooms and Mat Allen and Emerald Zhang and Juan Diego Rodriguez and Noa Nabeshima and Thomas Marshall and Dan Braun }, journal={arXiv preprint arXiv:2504.09184}, year={ 2025 } }
Comments on this paper