Meeseeks: An Iterative Benchmark Evaluating LLMs Multi-Turn Instruction-Following Ability

30 April 2025

Jiaming Wang

ArXiv (abs)PDF HTML

Main:10 Pages

4 Figures

Bibliography:4 Pages

9 Tables

Appendix:3 Pages

Abstract

The ability to follow instructions accurately is fundamental for Large Language Models (LLMs) to serve as reliable agents in real-world applications. While existing instruction-following benchmarks are either single-turn or introduce new requirements in each turn without allowing self-correction, Meeseeks simulates realistic human-LLM interactions through an iterative feedback process. This design enables models to self-correct based on specific requirement failures, better reflecting real-world user-end usage patterns. The benchmark implements a comprehensive evaluation system with 38 capability tags organized across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks provides valuable insights into LLMs' instruction-following capabilities in practical applications.

View on arXiv

@article{wang2025_2504.21625,
  title={ Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs' Multi-turn Instruction-Following Ability },
  author={ Jiaming Wang and Yunke Zhao and Peng Ding and Jun Kuang and Zongyu Wang and Xuezhi Cao and Xunliang Cai },
  journal={arXiv preprint arXiv:2504.21625},
  year={ 2025 }
}

Comments on this paper