GoalLadder: Incremental Goal Discovery with Vision-Language Models

19 June 2025

Main:9 Pages

5 Figures

Bibliography:4 Pages

4 Tables

Appendix:3 Pages

Abstract

Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, $\textbf{GoalLadder}$ , that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent's task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $\sim$ 95% compared to only $\sim$ 45% of the best competitor.

View on arXiv

@article{zakharov2025_2506.16396,
  title={ GoalLadder: Incremental Goal Discovery with Vision-Language Models },
  author={ Alexey Zakharov and Shimon Whiteson },
  journal={arXiv preprint arXiv:2506.16396},
  year={ 2025 }
}

Comments on this paper