Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2203.04592
Cited By
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
9 March 2022
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Mapping global dynamics of benchmark creation and saturation in artificial intelligence"
24 / 24 papers shown
Title
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
60
0
0
01 May 2025
Auditing the Ethical Logic of Generative AI Models
W. Russell Neuman
Chad Coleman
Ali Dasdan
Safinah Ali
Manan Shah
ELM
LRM
72
1
0
24 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
55
2
0
21 Apr 2025
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel
Amelia F. Hardy
Chandler Smith
Max Lamparth
Malcolm Hardy
Mykel J. Kochenderfer
ELM
81
17
0
20 Nov 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
76
1
0
26 Oct 2024
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
Esteban Garces Arias
Hannah Blocher
Julian Rodemann
Meimingwei Li
Christian Heumann
Matthias Aßenmacher
25
1
0
24 Oct 2024
Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development
Andrew Katz
Gabriella Coloyan Fleming
Joyce Main
31
4
0
28 Sep 2024
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Yang Wang
Naomi Saphra
39
10
0
22 Jul 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Zhimin Zhao
A. A. Bangash
F. Côgo
Bram Adams
Ahmed E. Hassan
59
1
0
04 Jul 2024
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
Beck Labash
August Rosedale
Alex Reents
Lucas Negritto
Colin Wiel
KELM
22
9
0
24 Jun 2024
Statistical Multicriteria Benchmarking via the GSD-Front
Christoph Jansen
G. Schollmeyer
Julian Rodemann
Hannah Blocher
Thomas Augustin
41
4
0
06 Jun 2024
Philosophy of Cognitive Science in the Age of Deep Learning
Raphaël Millière
AI4CE
NAI
40
3
0
07 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward
Raphael Milliere
Cameron Buckner
LRM
66
13
0
06 May 2024
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Guanhua Zhang
Moritz Hardt
42
7
0
02 May 2024
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
Ameya Prabhu
Vishaal Udandarao
Philip H. S. Torr
Matthias Bethge
Adel Bibi
Samuel Albanie
42
5
0
29 Feb 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez
John Yang
Alexander Wettig
Shunyu Yao
Kexin Pei
Ofir Press
Karthik Narasimhan
ELM
34
469
0
10 Oct 2023
Operationalising the Definition of General Purpose AI Systems: Assessing Four Approaches
Risto Uuk
C. I. Gutierrez
Alex Tamkin
26
2
0
05 Jun 2023
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors
Kathryn Wantlin
Chenwei Wu
Shih-Cheng Huang
Oishi Banerjee
Farah Z. Dadabhoy
...
A. Adamson
Laura Heacock
G. Tison
Alex Tamkin
Pranav Rajpurkar
SSL
OOD
38
2
0
17 Apr 2023
Melting Pot 2.0
J. Agapiou
A. Vezhnevets
Edgar A. Duénez-Guzmán
Jayd Matyas
Yiran Mao
...
Sukhdeep Singh
Julia Haas
Igor Mordatch
D. Mobbs
Joel Z Leibo
30
31
0
24 Nov 2022
TAPE: Assessing Few-shot Russian Language Understanding
Ekaterina Taktasheva
Tatiana Shavrina
Alena Fenogenova
Denis Shevelev
Nadezhda Katricheva
...
Svetlana Iordanskaia
Alena Spiridonova
Valentina Kurenshchikova
Ekaterina Artemova
Vladislav Mikhailov
AAML
45
10
0
23 Oct 2022
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory
Mark Rofin
Vladislav Mikhailov
Mikhail Florinskiy
A. Kravchenko
E. Tutubalina
Tatiana Shavrina
Daniel Karabekyan
Ekaterina Artemova
24
8
0
11 Oct 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans
John J. Nay
ELM
AILaw
88
27
0
14 Sep 2022
ASR in German: A Detailed Error Analysis
John M. Wirth
René Peinl
18
5
0
12 Apr 2022
Convolutional Neural Networks for Sentence Classification
Yoon Kim
AILaw
VLM
255
13,364
0
25 Aug 2014
1