Targeting the Benchmark: On Methodology in Current Natural Language Processing Research

7 July 2020

Papers citing "Targeting the Benchmark: On Methodology in Current Natural Language Processing Research"

16 / 16 papers shown

Title
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation Maria Eriksson Erasmo Purificato Arman Noroozian Joao Vinagre Guillaume Chaslot Emilia Gomez David Fernandez Llorca ELM 145 1 0 10 Feb 2025
Position: Key Claims in LLM Research Have a Long Tail of Footnotes Anna Rogers A. Luccioni 53 19 0 14 Aug 2023
Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of the WL Test Arjun Subramonian Adina Williams Maximilian Nickel Yizhou Sun Levent Sagun 38 0 0 11 Jul 2023
Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples P. Sadler David Schlangen 29 2 0 24 May 2023
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark Minje Choi Jiaxin Pei Sagar Kumar Chang Shu David Jurgens ALM LLMAG 35 69 0 24 May 2023
PaLM 2 Technical Report Rohan Anil Andrew M. Dai Orhan Firat Melvin Johnson Dmitry Lepikhin ... Ce Zheng Wei Zhou Denny Zhou Slav Petrov Yonghui Wu ReLM LRM 128 1,152 0 17 May 2023
Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy David Schlangen ELM 32 13 0 14 Apr 2023
Right the docs: Characterising voice dataset documentation practices used in machine learning Kathy Reid Elizabeth T. Williams 27 2 0 19 Mar 2023
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation Barbara Plank 30 97 0 04 Nov 2022
StyLEx: Explaining Style Using Human Lexical Annotations Shirley Anugrah Hayati Kyumin Park Dheeraj Rajagopal Lyle Ungar Dongyeop Kang 28 3 0 14 Oct 2022
Underspecification in Scene Description-to-Depiction Tasks Ben Hutchinson Jason Baldridge Vinodkumar Prabhakaran DiffM 74 32 0 11 Oct 2022
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez Mahmoud Azab Becka Silvert Renato Sanchez Linzy Labson Hardik Shah Seungwhan Moon 50 1 0 10 Oct 2022
Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR Nina Markl S. McNulty 24 9 0 25 Feb 2022
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research Bernard Koch Emily L. Denton A. Hanna J. Foster 53 140 0 03 Dec 2021
AI and the Everything in the Whole Wide World Benchmark Inioluwa Deborah Raji Emily M. Bender Amandalynne Paullada Emily L. Denton A. Hanna 30 292 0 26 Nov 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 299 6,996 0 20 Apr 2018