deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
Michel Galley
Chris Brockett
Alessandro Sordoni
Yangfeng Ji
Michael Auli
Chris Quirk
Margaret Mitchell
Jianfeng Gao
W. Dolan

Abstract
We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.
View on arXivComments on this paper