LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

17 May 2025

Abstract

Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$ , a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ( $\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ( $\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ( $\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ( $\texttt{LLM-BabyBench-Predict}$ , $\texttt{-Plan}$ , $\texttt{-Decompose}$ ) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ( $\href{this https URL}{\text{GitHub}}$ , $\href{this https URL}{\text{HuggingFace}}$ ).

View on arXiv

@article{choukrani2025_2505.12135,
  title={ LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs },
  author={ Omar Choukrani and Idriss Malek and Daniil Orel and Zhuohan Xie and Zangir Iklassov and Martin Takáč and Salem Lahlou },
  journal={arXiv preprint arXiv:2505.12135},
  year={ 2025 }
}

Comments on this paper