Mapping global dynamics of benchmark creation and saturation in artificial intelligence

9 March 2022

Papers citing "Mapping global dynamics of benchmark creation and saturation in artificial intelligence"

24 / 24 papers shown

Title
Multi-Modal Language Models as Text-to-Image Model Evaluators Jiahui Chen Candace Ross Reyhane Askari Hemmat Koustuv Sinha Melissa Hall M. Drozdzal Adriana Romero-Soriano EGVM 60 0 0 01 May 2025
Auditing the Ethical Logic of Generative AI Models W. Russell Neuman Chad Coleman Ali Dasdan Safinah Ali Manan Shah ELM LRM 72 1 0 24 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark Jasper Götting Pedro Medeiros Jon G Sanders Nathaniel Li Long Phan Karam Elabd Lennart Justen Dan Hendrycks Seth Donoughe ELM 55 2 0 21 Apr 2025
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices Anka Reuel Amelia F. Hardy Chandler Smith Max Lamparth Malcolm Hardy Mykel J. Kochenderfer ELM 81 17 0 20 Nov 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets Vipul Gupta Candace Ross David Pantoja R. Passonneau Megan Ung Adina Williams 76 1 0 26 Oct 2024
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework Esteban Garces Arias Hannah Blocher Julian Rodemann Meimingwei Li Christian Heumann Matthias Aßenmacher 25 1 0 24 Oct 2024
Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development Andrew Katz Gabriella Coloyan Fleming Joyce Main 31 4 0 28 Sep 2024
Benchmarks as Microscopes: A Call for Model Metrology Michael Stephen Saxon Ari Holtzman Peter West William Yang Wang Naomi Saphra 39 10 0 22 Jul 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards Zhimin Zhao A. A. Bangash F. Côgo Bram Adams Ahmed E. Hassan 59 1 0 04 Jul 2024
RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale Beck Labash August Rosedale Alex Reents Lucas Negritto Colin Wiel KELM 22 9 0 24 Jun 2024
Statistical Multicriteria Benchmarking via the GSD-Front Christoph Jansen G. Schollmeyer Julian Rodemann Hannah Blocher Thomas Augustin 41 4 0 06 Jun 2024
Philosophy of Cognitive Science in the Age of Deep Learning Raphaël Millière AI4CE NAI 40 3 0 07 May 2024
A Philosophical Introduction to Language Models - Part II: The Way Forward Raphael Milliere Cameron Buckner LRM 66 13 0 06 May 2024
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks Guanhua Zhang Moritz Hardt 42 7 0 02 May 2024
Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress Ameya Prabhu Vishaal Udandarao Philip H. S. Torr Matthias Bethge Adel Bibi Samuel Albanie 42 5 0 29 Feb 2024
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Carlos E. Jimenez John Yang Alexander Wettig Shunyu Yao Kexin Pei Ofir Press Karthik Narasimhan ELM 34 469 0 10 Oct 2023
Operationalising the Definition of General Purpose AI Systems: Assessing Four Approaches Risto Uuk C. I. Gutierrez Alex Tamkin 26 2 0 05 Jun 2023
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors Kathryn Wantlin Chenwei Wu Shih-Cheng Huang Oishi Banerjee Farah Z. Dadabhoy ... A. Adamson Laura Heacock G. Tison Alex Tamkin Pranav Rajpurkar SSL OOD 38 2 0 17 Apr 2023
Melting Pot 2.0 J. Agapiou A. Vezhnevets Edgar A. Duénez-Guzmán Jayd Matyas Yiran Mao ... Sukhdeep Singh Julia Haas Igor Mordatch D. Mobbs Joel Z Leibo 30 31 0 24 Nov 2022
TAPE: Assessing Few-shot Russian Language Understanding Ekaterina Taktasheva Tatiana Shavrina Alena Fenogenova Denis Shevelev Nadezhda Katricheva ... Svetlana Iordanskaia Alena Spiridonova Valentina Kurenshchikova Ekaterina Artemova Vladislav Mikhailov AAML 45 10 0 23 Oct 2022
Voteñ'Rank: Revision of Benchmarking with Social Choice Theory Mark Rofin Vladislav Mikhailov Mikhail Florinskiy A. Kravchenko E. Tutubalina Tatiana Shavrina Daniel Karabekyan Ekaterina Artemova 24 8 0 11 Oct 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans John J. Nay ELM AILaw 88 27 0 14 Sep 2022
ASR in German: A Detailed Error Analysis John M. Wirth René Peinl 18 5 0 12 Apr 2022
Convolutional Neural Networks for Sentence Classification Yoon Kim AILaw VLM 255 13,364 0 25 Aug 2014