ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.20245
  4. Cited By
Improving Model Evaluation using SMART Filtering of Benchmark Datasets

Improving Model Evaluation using SMART Filtering of Benchmark Datasets

26 October 2024
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
ArXivPDFHTML

Papers citing "Improving Model Evaluation using SMART Filtering of Benchmark Datasets"

50 / 60 papers shown
Title
Multi-Modal Language Models as Text-to-Image Model Evaluators
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
87
0
0
01 May 2025
What makes a good metric? Evaluating automatic metrics for text-to-image
  consistency
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adriana Romero Soriano
Adina Williams
131
4
0
18 Dec 2024
Benchmarks as Microscopes: A Call for Model Metrology
Benchmarks as Microscopes: A Call for Model Metrology
Michael Stephen Saxon
Ari Holtzman
Peter West
William Y. Wang
Naomi Saphra
61
11
0
22 Jul 2024
Consent in Crisis: The Rapid Decline of the AI Data Commons
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre
Robert Mahari
Ariel N. Lee
Campbell Lund
Hamidah Oderinwale
...
Hanlin Li
Daphne Ippolito
Sara Hooker
Jad Kabbara
Sandy Pentland
85
39
0
20 Jul 2024
Qwen2 Technical Report
Qwen2 Technical Report
An Yang
Baosong Yang
Binyuan Hui
Jian Xu
Bowen Yu
...
Yuqiong Liu
Zeyu Cui
Zhenru Zhang
Zhifang Guo
Zhi-Wei Fan
OSLM
VLM
MU
106
926
0
15 Jul 2024
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
Nishant Balepur
Rachel Rudinger
60
7
0
02 Jul 2024
Changing Answer Order Can Decrease MMLU Accuracy
Changing Answer Order Can Decrease MMLU Accuracy
Vipul Gupta
David Pantoja
Candace Ross
Adina Williams
Megan Ung
77
25
0
27 Jun 2024
Are We Done with MMLU?
Are We Done with MMLU?
Aryo Pradipta Gema
Joshua Ong Jun Leang
Giwon Hong
Alessio Devoto
Alberto Carlo Maria Mancino
...
R. McHardy
Joshua Harris
Jean Kaddour
Emile van Krieken
Pasquale Minervini
ELM
107
39
0
06 Jun 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language
  Understanding Benchmark
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang
Xueguang Ma
Ge Zhang
Yuansheng Ni
Abhranil Chandra
...
Kai Wang
Alex Zhuang
Rongqi Fan
Xiang Yue
Wenhu Chen
LRM
ELM
82
409
0
03 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
56
43
0
03 Jun 2024
Beyond Performance: Quantifying and Mitigating Label Bias in LLMs
Beyond Performance: Quantifying and Mitigating Label Bias in LLMs
Philipp Benz
Maitreya Patel
160
13
0
04 May 2024
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
  Phone
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin
Sam Ade Jacobs
A. A. Awan
J. Aneja
Ahmed Hassan Awadallah
...
Li Zhang
Yi Zhang
Yue Zhang
Yunan Zhang
Xiren Zhou
LRM
ALM
106
1,193
0
22 Apr 2024
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Parishad BehnamGhader
Vaibhav Adlakha
Marius Mosbach
Dzmitry Bahdanau
Nicolas Chapados
Siva Reddy
78
223
0
09 Apr 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
233
549
0
07 Mar 2024
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang
Lianmin Zheng
Ying Sheng
Anastasios Nikolas Angelopoulos
Tianle Li
...
Hao Zhang
Banghua Zhu
Michael I. Jordan
Joseph E. Gonzalez
Ion Stoica
OSLM
127
567
0
07 Mar 2024
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations
  for Values and Opinions in Large Language Models
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
Paul Röttger
Valentin Hofmann
Valentina Pyatkin
Musashi Hinck
Hannah Rose Kirk
Hinrich Schütze
Dirk Hovy
ELM
49
62
0
26 Feb 2024
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions
  Without the Question?
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?
Nishant Balepur
Abhilasha Ravichander
Rachel Rudinger
ELM
81
26
0
19 Feb 2024
When Benchmarks are Targets: Revealing the Sensitivity of Large Language
  Model Leaderboards
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
Norah A. Alzahrani
H. A. Alyahya
Sultan Yazeed Alnumay
Muhtasim Tahmid
Shaykhah Alsubaie
...
Saleh Soltan
Nathan Scales
Marie-Anne Lachaux
Samuel R. Bowman
Haidar Khan
ELM
109
78
0
01 Feb 2024
Investigating Data Contamination for Pre-training Language Models
Investigating Data Contamination for Pre-training Language Models
Minhao Jiang
Ken Ziyu Liu
Ming Zhong
Rylan Schaeffer
Siru Ouyang
Jiawei Han
Sanmi Koyejo
58
71
0
11 Jan 2024
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein
Betty Li Hou
Asa Cooper Stickland
Jackson Petty
Richard Yuanzhe Pang
Julien Dirani
Julian Michael
Samuel R. Bowman
AI4MH
ELM
72
655
0
20 Nov 2023
What's In My Big Data?
What's In My Big Data?
Yanai Elazar
Akshita Bhagia
Ian H. Magnusson
Abhilasha Ravichander
Dustin Schwenk
...
Luca Soldaini
Sameer Singh
Hanna Hajishirzi
Noah A. Smith
Jesse Dodge
32
94
0
31 Oct 2023
Large Language Models Are Not Robust Multiple Choice Selectors
Large Language Models Are Not Robust Multiple Choice Selectors
Chujie Zheng
Hao Zhou
Fandong Meng
Jie Zhou
Minlie Huang
53
237
0
07 Sep 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
...
Dacheng Li
Eric Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
ALM
OSLM
ELM
312
4,253
0
09 Jun 2023
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
  Datasets
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
Md Tahmid Rahman Laskar
M Saiful Bari
Mizanur Rahman
Md Amran Hossen Bhuiyan
Shafiq Joty
J. Huang
LM&MA
ELM
ALM
77
189
0
29 May 2023
Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural
  Language Processing Leaderboards
Self-Improving-Leaderboard(SIL): A Call for Real-World Centric Natural Language Processing Leaderboards
Chanjun Park
Hyeonseok Moon
Seolhwa Lee
Jaehyung Seo
Sugyeong Eo
Heu-Jeoung Lim
37
2
0
20 Mar 2023
SemDeDup: Data-efficient learning at web-scale through semantic
  deduplication
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
Amro Abbas
Kushal Tirumala
Daniel Simig
Surya Ganguli
Ari S. Morcos
61
175
0
16 Mar 2023
Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift
  with Multiple Views
Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views
Katerina Margatina
Shuai Wang
Yogarshi Vyas
Neha Ann John
Yassine Benajiba
Miguel Ballesteros
59
17
0
23 Feb 2023
Real or Fake Text?: Investigating Human Ability to Detect Boundaries
  Between Human-Written and Machine-Generated Text
Real or Fake Text?: Investigating Human Ability to Detect Boundaries Between Human-Written and Machine-Generated Text
Liam Dugan
Daphne Ippolito
Arun Kirubarajan
Sherry Shi
Chris Callison-Burch
DeLMO
70
71
0
24 Dec 2022
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Hongjin Su
Weijia Shi
Jungo Kasai
Yizhong Wang
Yushi Hu
Mari Ostendorf
Wen-tau Yih
Noah A. Smith
Luke Zettlemoyer
Tao Yu
79
296
0
19 Dec 2022
Efficient Methods for Natural Language Processing: A Survey
Efficient Methods for Natural Language Processing: A Survey
Marcos Vinícius Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
...
Emma Strubell
Niranjan Balasubramanian
Leon Derczynski
Iryna Gurevych
Roy Schwartz
74
114
0
31 Aug 2022
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
GEMv2: Multilingual NLG Benchmarking in a Single Line of Code
Sebastian Gehrmann
Abhik Bhattacharjee
Abinaya Mahendiran
Alex Jinpeng Wang
Alexandros Papangelis
...
Yacine Jernite
Yi Xu
Yisi Sang
Yixin Liu
Yufang Hou
77
38
0
22 Jun 2022
Dataset and Case Studies for Visual Near-Duplicates Detection in the
  Context of Social Media
Dataset and Case Studies for Visual Near-Duplicates Detection in the Context of Social Media
Hana Matatov
Mor Naaman
Ofra Amir
37
5
0
14 Mar 2022
Mapping global dynamics of benchmark creation and saturation in
  artificial intelligence
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
Simon Ott
A. Barbosa-Silva
Kathrin Blagec
J. Brauner
Matthias Samwald
47
39
0
09 Mar 2022
ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
Neeraj Varshney
Swaroop Mishra
Chitta Baral
48
17
0
07 Mar 2022
Text and Code Embeddings by Contrastive Pre-Training
Text and Code Embeddings by Contrastive Pre-Training
Arvind Neelakantan
Tao Xu
Raul Puri
Alec Radford
Jesse Michael Han
...
Tabarak Khan
Toki Sherbakov
Joanne Jang
Peter Welinder
Lilian Weng
SSL
AI4TS
340
438
0
24 Jan 2022
AI and the Everything in the Whole Wide World Benchmark
AI and the Everything in the Whole Wide World Benchmark
Inioluwa Deborah Raji
Emily M. Bender
Amandalynne Paullada
Emily L. Denton
A. Hanna
68
305
0
26 Nov 2021
Adversarially Constructed Evaluation Sets Are More Challenging, but May
  Not Be Fair
Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair
Jason Phang
Angelica Chen
William Huang
Samuel R. Bowman
AAML
43
13
0
16 Nov 2021
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
225
4,354
0
27 Oct 2021
A Survey on Cost Types, Interaction Schemes, and Annotator Performance
  Models in Selection Algorithms for Active Learning in Classification
A Survey on Cost Types, Interaction Schemes, and Annotator Performance Models in Selection Algorithms for Active Learning in Classification
M. Herde
Denis Huseljic
Bernhard Sick
A. Calma
53
25
0
23 Sep 2021
Analyzing the Granularity and Cost of Annotation in Clinical Sequence
  Labeling
Analyzing the Granularity and Cost of Annotation in Clinical Sequence Labeling
Haozhan Sun
Chenchen Xu
H. Suominen
32
3
0
23 Aug 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
348
623
0
14 Jul 2021
Comparing Test Sets with Item Response Theory
Comparing Test Sets with Item Response Theory
Clara Vania
Phu Mon Htut
William Huang
Dhara Mungra
Richard Yuanzhe Pang
Jason Phang
Haokun Liu
Kyunghyun Cho
Sam Bowman
54
42
0
01 Jun 2021
Dynabench: Rethinking Benchmarking in NLP
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela
Max Bartolo
Yixin Nie
Divyansh Kaushik
Atticus Geiger
...
Pontus Stenetorp
Robin Jia
Joey Tianyi Zhou
Christopher Potts
Adina Williams
174
405
0
07 Apr 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
D. Song
Jacob Steinhardt
ReLM
FaML
147
2,220
0
05 Mar 2021
Memorization vs. Generalization: Quantifying Data Leakage in NLP
  Performance Evaluation
Memorization vs. Generalization: Quantifying Data Leakage in NLP Performance Evaluation
Aparna Elangovan
Jiayuan He
Karin Verspoor
TDI
FedML
201
92
0
03 Feb 2021
The GEM Benchmark: Natural Language Generation, its Evaluation and
  Metrics
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann
Tosin Adewumi
Karmanya Aggarwal
Pawan Sasanka Ammanamanchi
Aremu Anuoluwapo
...
Nishant Subramani
Wei Xu
Diyi Yang
Akhila Yerukola
Jiawei Zhou
VLM
292
286
0
02 Feb 2021
DynaSent: A Dynamic Benchmark for Sentiment Analysis
DynaSent: A Dynamic Benchmark for Sentiment Analysis
Christopher Potts
Zhengxuan Wu
Atticus Geiger
Douwe Kiela
253
79
0
30 Dec 2020
What do we expect from Multiple-choice QA Systems?
What do we expect from Multiple-choice QA Systems?
Krunal Shah
Nitish Gupta
Dan Roth
AAML
29
14
0
20 Nov 2020
Automatic Detection of Machine Generated Text: A Critical Survey
Automatic Detection of Machine Generated Text: A Critical Survey
Ganesh Jawahar
Muhammad Abdul-Mageed
L. Lakshmanan
DeLMO
53
233
0
02 Nov 2020
Dataset Cartography: Mapping and Diagnosing Datasets with Training
  Dynamics
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta
Roy Schwartz
Nicholas Lourie
Yizhong Wang
Hannaneh Hajishirzi
Noah A. Smith
Yejin Choi
90
447
0
22 Sep 2020
12
Next