It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and
Measurements of Performance

It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance

15 May 2023

Arjun Subramonian

Su Lin Blodgett

Papers citing "It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance"

13 / 13 papers shown

Title
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation Maria Eriksson Erasmo Purificato Arman Noroozian Joao Vinagre Guillaume Chaslot Emilia Gomez David Fernandez Llorca ELM 139 1 0 10 Feb 2025
WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case Vagrant Gautam Julius Steuer Eileen Bingert Ray Johns Anne Lauscher Dietrich Klakow 48 3 0 09 Sep 2024
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark Minje Choi Jiaxin Pei Sagar Kumar Chang Shu David Jurgens ALM LLMAG 29 69 0 24 May 2023
Stop Measuring Calibration When Humans Disagree Joris Baan Wilker Aziz Barbara Plank Raquel Fernández 24 53 0 28 Oct 2022
ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics Chantal Amrhein Nikita Moghe Liane Guillou ELM 34 22 0 27 Oct 2022
Quantifying Social Biases Using Templates is Unreliable P. Seshadri Pouya Pezeshkpour Sameer Singh 51 33 0 09 Oct 2022
State-of-the-art generalisation research in NLP: A taxonomy and review Dieuwke Hupkes Mario Giulianelli Verna Dankers Mikel Artetxe Yanai Elazar ... Leila Khalatbari Maria Ryskina Rita Frieske Ryan Cotterell Zhijing Jin 114 93 0 06 Oct 2022
Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization Shiyue Zhang David Wan Joey Tianyi Zhou HILM 52 27 0 08 Sep 2022
Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications Kaitlyn Zhou Su Lin Blodgett Adam Trischler Hal Daumé Kaheer Suleman Alexandra Olteanu ELM 99 26 0 13 May 2022
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Sebastian Gehrmann Tosin P. Adewumi Karmanya Aggarwal Pawan Sasanka Ammanamanchi Aremu Anuoluwapo ... Nishant Subramani Wei-ping Xu Diyi Yang Akhila Yerukola Jiawei Zhou VLM 254 285 0 02 Feb 2021
Hypothesis Only Baselines in Natural Language Inference Adam Poliak Jason Naradowsky Aparajita Haldar Rachel Rudinger Benjamin Van Durme 190 576 0 02 May 2018
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 297 6,959 0 20 Apr 2018
Text Summarization Techniques: A Brief Survey M. Allahyari Seyedamin Pouriyeh Mehdi Assefi S. Safaei Elizabeth D. Trippe Juan B. Gutierrez K. Kochut CVBM 52 513 0 07 Jul 2017