ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.02709
35
84

Evaluating Models' Local Decision Boundaries via Contrast Sets

6 April 2020
Matt Gardner
Yoav Artzi
Victoria Basmova
Jonathan Berant
Ben Bogin
Sihao Chen
Pradeep Dasigi
Dheeru Dua
Yanai Elazar
Ananth Gottumukkala
Nitish Gupta
Hannaneh Hajishirzi
Gabriel Ilharco
Daniel Khashabi
Kevin Lin
Jiangming Liu
Nelson F. Liu
Phoebe Mulcaire
Qiang Ning
Sameer Singh
Noah A. Smith
Sanjay Subramanian
Reut Tsarfaty
Eric Wallace
Ally Zhang
Ben Zhou
    ELM
ArXivPDFHTML
Abstract

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25\% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

View on arXiv
Comments on this paper