Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a 42trainingcost,comparedtothousandsofdollarsforbaselinemodels.However,challengessuchasoptimizationinstabilityandlengthconstraintsemergedwithprolongedtraining.ThesefindingshighlighttheefficacyofRL−basedfine−tuningforsmallLLMs,offeringacost−effectivealternativetolarge−scaleapproaches.Wereleaseourcodeanddatasetsasopen−sourceresources,providinginsightsintotrade−offsandlayingafoundationforscalable,reasoning−capableLLMsinresource−limitedenvironments.AllareavailableatthishttpsURL.
@article{dang2025_2503.16219,
title={ Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't },
author={ Quy-Anh Dang and Chris Ngo },
journal={arXiv preprint arXiv:2503.16219},
year={ 2025 }
}
Comments on this paper
We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.