Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling

29 May 2025

Main:7 Pages

17 Figures

Bibliography:3 Pages

14 Tables

Appendix:18 Pages

Abstract

Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs' performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs' proactive error handling capabilities. The dataset will be publicly available.

View on arXiv

@article{zeng2025_2506.00064,
  title={ Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling },
  author={ Jiayi Zeng and Yizhe Feng and Mengliang He and Wenhui Lei and Wei Zhang and Zeming Liu and Xiaoming Shi and Aimin Zhou },
  journal={arXiv preprint arXiv:2506.00064},
  year={ 2025 }
}

Comments on this paper