LLM-as-a-qualitative-judge: automating error analysis in natural language generation
- ELM

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available atthis https URL.
View on arXiv@article{chirkova2025_2506.09147, title={ LLM-as-a-qualitative-judge: automating error analysis in natural language generation }, author={ Nadezhda Chirkova and Tunde Oluwaseyi Ajayi and Seth Aycock and Zain Muhammad Mujahid and Vladana Perlić and Ekaterina Borisova and Markarit Vartampetian }, journal={arXiv preprint arXiv:2506.09147}, year={ 2025 } }