v1v2 (latest)

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

13 June 2025

Main:9 Pages

14 Figures

Bibliography:3 Pages

6 Tables

Appendix:13 Pages

Abstract

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora -- counting string appearances and retrieving the enclosing documents -- yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18 $\times$ ) and memory use during both indexing (3.2 $\times$ reduction) and querying (down to a negligible amount). We index 46TB of Internet text in 50 days with a single 128-core CPU node (or 19 hours if using 75 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 40% in SQuAD), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

View on arXiv

@article{xu2025_2506.12229,
  title={ Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index },
  author={ Hao Xu and Jiacheng Liu and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi },
  journal={arXiv preprint arXiv:2506.12229},
  year={ 2025 }
}

Comments on this paper