KPQA: A Metric for Generative Question Answering Using Keyphrase Weights
In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Moreover, there is a lack of benchmark datasets to evaluate the suitability of existing metrics in terms of correctness. To study a better metric for GenQA, we first create high-quality human judgments of correctness on two standard GenQA datasets. Using our human-evaluation datasets, we show that widely used n-gram similarity metrics do not correlate with human judgments. To alleviate this problem, we propose a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. Our proposed metric shows a significantly higher correlation with human judgments than existing metrics in various datasets.
View on arXiv