83
0

Parameterized Synthetic Text Generation with SimpleStories

Main:9 Pages
8 Figures
Bibliography:2 Pages
5 Tables
Appendix:5 Pages
Abstract

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.

View on arXiv
@article{finke2025_2504.09184,
  title={ Parameterized Synthetic Text Generation with SimpleStories },
  author={ Lennart Finke and Chandan Sreedhara and Thomas Dooms and Mat Allen and Emerald Zhang and Juan Diego Rodriguez and Noa Nabeshima and Thomas Marshall and Dan Braun },
  journal={arXiv preprint arXiv:2504.09184},
  year={ 2025 }
}
Comments on this paper