KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

1 May 2020

Hwanhee Lee

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (34★)

Abstract

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Moreover, there is a lack of benchmark datasets to evaluate the suitability of existing metrics in terms of correctness. To study a better metric for GenQA, we first create high-quality human judgments of correctness on two standard GenQA datasets. Using our human-evaluation datasets, we show that widely used n-gram similarity metrics do not correlate with human judgments. To alleviate this problem, we propose a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. Our proposed metric shows a significantly higher correlation with human judgments than existing metrics in various datasets.

View on arXiv

Comments on this paper