ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.11894
  4. Cited By
Automating Dataset Updates Towards Reliable and Timely Evaluation of
  Large Language Models

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

19 February 2024
Jiahao Ying
Yixin Cao
Yushi Bai
Qianru Sun
Bo Wang
Wei Tang
Zhaojun Ding
Yizhe Yang
Xuanjing Huang
Shuicheng Yan
    KELM
ArXivPDFHTML

Papers citing "Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models"

8 / 8 papers shown
Title
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
Xuzhao Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Tianwei Zhang
ALM
ELM
86
2
0
26 Apr 2025
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Yixin Cao
Jiahao Ying
Yansen Wang
Xipeng Qiu
Xuanjing Huang
Yugang Jiang
ELM
44
2
0
10 Apr 2025
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
110
137
0
03 Nov 2023
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale
  Instructions
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu
Abdul Waheed
Chiyu Zhang
Muhammad Abdul-Mageed
Alham Fikri Aji
ALM
135
119
0
27 Apr 2023
Co-Writing Screenplays and Theatre Scripts with Language Models: An
  Evaluation by Industry Professionals
Co-Writing Screenplays and Theatre Scripts with Language Models: An Evaluation by Industry Professionals
Piotr Wojciech Mirowski
Kory W. Mathewson
Jaylen Pittman
Richard Evans
HAI
64
2
0
29 Sep 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
395
8,559
0
28 Jan 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
213
1,661
0
15 Oct 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,984
0
20 Apr 2018
1