ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2007.04792
  4. Cited By
Targeting the Benchmark: On Methodology in Current Natural Language
  Processing Research

Targeting the Benchmark: On Methodology in Current Natural Language Processing Research

7 July 2020
David Schlangen
ArXivPDFHTML

Papers citing "Targeting the Benchmark: On Methodology in Current Natural Language Processing Research"

16 / 16 papers shown
Title
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation
Maria Eriksson
Erasmo Purificato
Arman Noroozian
Joao Vinagre
Guillaume Chaslot
Emilia Gomez
David Fernandez Llorca
ELM
145
1
0
10 Feb 2025
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Position: Key Claims in LLM Research Have a Long Tail of Footnotes
Anna Rogers
A. Luccioni
53
19
0
14 Aug 2023
Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of
  the WL Test
Weisfeiler and Leman Go Measurement Modeling: Probing the Validity of the WL Test
Arjun Subramonian
Adina Williams
Maximilian Nickel
Yizhou Sun
Levent Sagun
38
0
0
11 Jul 2023
Pento-DIARef: A Diagnostic Dataset for Learning the Incremental
  Algorithm for Referring Expression Generation from Examples
Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples
P. Sadler
David Schlangen
29
2
0
24 May 2023
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
  Language Models with SocKET Benchmark
Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark
Minje Choi
Jiaxin Pei
Sagar Kumar
Chang Shu
David Jurgens
ALM
LLMAG
35
69
0
24 May 2023
PaLM 2 Technical Report
PaLM 2 Technical Report
Rohan Anil
Andrew M. Dai
Orhan Firat
Melvin Johnson
Dmitry Lepikhin
...
Ce Zheng
Wei Zhou
Denny Zhou
Slav Petrov
Yonghui Wu
ReLM
LRM
128
1,152
0
17 May 2023
Dialogue Games for Benchmarking Language Understanding: Motivation,
  Taxonomy, Strategy
Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy
David Schlangen
ELM
32
13
0
14 Apr 2023
Right the docs: Characterising voice dataset documentation practices
  used in machine learning
Right the docs: Characterising voice dataset documentation practices used in machine learning
Kathy Reid
Elizabeth T. Williams
27
2
0
19 Mar 2023
The 'Problem' of Human Label Variation: On Ground Truth in Data,
  Modeling and Evaluation
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation
Barbara Plank
30
97
0
04 Nov 2022
StyLEx: Explaining Style Using Human Lexical Annotations
StyLEx: Explaining Style Using Human Lexical Annotations
Shirley Anugrah Hayati
Kyumin Park
Dheeraj Rajagopal
Lyle Ungar
Dongyeop Kang
28
3
0
14 Oct 2022
Underspecification in Scene Description-to-Depiction Tasks
Underspecification in Scene Description-to-Depiction Tasks
Ben Hutchinson
Jason Baldridge
Vinodkumar Prabhakaran
DiffM
74
32
0
11 Oct 2022
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video
  Retrieval Benchmarks
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Pedro Rodriguez
Mahmoud Azab
Becka Silvert
Renato Sanchez
Linzy Labson
Hardik Shah
Seungwhan Moon
50
1
0
10 Oct 2022
Language technology practitioners as language managers: arbitrating data
  bias and predictive bias in ASR
Language technology practitioners as language managers: arbitrating data bias and predictive bias in ASR
Nina Markl
S. McNulty
24
9
0
25 Feb 2022
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning
  Research
Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research
Bernard Koch
Emily L. Denton
A. Hanna
J. Foster
53
140
0
03 Dec 2021
AI and the Everything in the Whole Wide World Benchmark
AI and the Everything in the Whole Wide World Benchmark
Inioluwa Deborah Raji
Emily M. Bender
Amandalynne Paullada
Emily L. Denton
A. Hanna
30
292
0
26 Nov 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
299
6,996
0
20 Apr 2018
1