LLM-as-a-qualitative-judge: automating error analysis in natural language generation

10 June 2025

Main:8 Pages

9 Figures

Bibliography:4 Pages

6 Tables

Appendix:18 Pages

Abstract

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available atthis https URL.

View on arXiv

@article{chirkova2025_2506.09147,
  title={ LLM-as-a-qualitative-judge: automating error analysis in natural language generation },
  author={ Nadezhda Chirkova and Tunde Oluwaseyi Ajayi and Seth Aycock and Zain Muhammad Mujahid and Vladana Perlić and Ekaterina Borisova and Markarit Vartampetian },
  journal={arXiv preprint arXiv:2506.09147},
  year={ 2025 }
}

Comments on this paper