ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data

Autoformalization, the automatic translation of mathematical content from natural language into machine-verifiable formal languages, has seen significant progress driven by advances in large language models (LLMs). Nonetheless, a primary barrier to further improvements is the limited availability of parallel corpora that map informal mathematical text to its formal counterpart. To address this limitation, we propose ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), a novel data generation framework designed to produce large-scale, high-quality parallel corpora of theorem statements. Distinct from prior approaches, ATLAS begins with a concept repository, accelerates the improvement of student model through expert iteration combined with knowledge distillation, and introduces two novel augmentation strategies that exploit the structural characteristics of formal languages. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 117k theorem statements and develop ATLAS Translator, which demonstrates statistically significant improvements over both the HERALD Translator and the Kimina-Autoformalizer across all benchmarks (, two-sided t-test), achieving a new state of the art. The datasets, model, and code will be released to the public soon.
View on arXiv@article{liu2025_2502.05567, title={ ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data }, author={ Xiaoyang Liu and Kangjie Bao and Jiashuo Zhang and Yunqi Liu and Yuntian Liu and Yu Chen and Yang Jiao and Tao Luo }, journal={arXiv preprint arXiv:2502.05567}, year={ 2025 } }