Reinforcement Learning for LLM Reasoning Under Memory Constraints

29 April 2025

Alan Lee

Harry Tong

OffRL

ArXiv (abs)PDF HTML

Main:6 Pages

1 Figures

Bibliography:1 Pages

6 Tables

Appendix:1 Pages

Abstract

We explore reinforcement learning (RL) techniques to enhance reasoning within targeted problem spaces in large language models (LLMs) under memory and compute constraints. Our focus is on critic-free methods compatible with LoRA fine-tuning on a single 40GB GPU, a common limitation in academic settings. We introduce S-GRPO, a memory-efficient variant of Group Relative Policy Optimization, and T-SPMO, a token-level prefix matching strategy for fine-grained credit assignment. Despite limited resources, when used to fine-tune Qwen2-1.5B both methods significantly improve SVAMP benchmark accuracy from 46% to above 70% using LoRA training. T-SPMO also excels in multi-digit multiplication tasks, underscoring the potential of RL fine-tuning under hardware constraints. Additionally, we find that our full-token GRPO baseline under LoRA fine-tuning did not improve model performance (compared to base model) on either task, suggesting that our memory-efficient methods may act as a form of regularization that stabilizes training when only a small subset of parameters are updated.

View on arXiv

@article{lee2025_2504.20834,
  title={ Token-Efficient RL for LLM Reasoning },
  author={ Alan Lee and Harry Tong },
  journal={arXiv preprint arXiv:2504.20834},
  year={ 2025 }
}

Comments on this paper