We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable atthis https URL.
View on arXiv@article{gao2025_2505.20282, title={ One-shot Entropy Minimization }, author={ Zitian Gao and Lynx Chen and Joey Zhou and Bryan Dai }, journal={arXiv preprint arXiv:2505.20282}, year={ 2025 } }