102
0

Meeseeks: An Iterative Benchmark Evaluating LLMs Multi-Turn Instruction-Following Ability

Main:10 Pages
4 Figures
Bibliography:4 Pages
9 Tables
Appendix:3 Pages
Abstract

The ability to follow instructions accurately is fundamental for Large Language Models (LLMs) to serve as reliable agents in real-world applications. While existing instruction-following benchmarks are either single-turn or introduce new requirements in each turn without allowing self-correction, Meeseeks simulates realistic human-LLM interactions through an iterative feedback process. This design enables models to self-correct based on specific requirement failures, better reflecting real-world user-end usage patterns. The benchmark implements a comprehensive evaluation system with 38 capability tags organized across three dimensions: Intent Recognition, Granular Content Validation, and Output Structure Validation. Through rigorous evaluation across LLMs, Meeseeks provides valuable insights into LLMs' instruction-following capabilities in practical applications.

View on arXiv
@article{wang2025_2504.21625,
  title={ Ask, Fail, Repeat: Meeseeks, an Iterative Feedback Benchmark for LLMs' Multi-turn Instruction-Following Ability },
  author={ Jiaming Wang and Yunke Zhao and Peng Ding and Jun Kuang and Zongyu Wang and Xuezhi Cao and Xunliang Cai },
  journal={arXiv preprint arXiv:2504.21625},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.