Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM

16 May 2025

Abstract

We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to $4 \times$ the cumulative rewards of the pure RL baseline.

View on arXiv

@article{duong2025_2505.10861,
  title={ Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM },
  author={ Thang Duong and Minglai Yang and Chicheng Zhang },
  journal={arXiv preprint arXiv:2505.10861},
  year={ 2025 }
}

Comments on this paper