ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2305.18486
  4. Cited By
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
  Datasets

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

29 May 2023
Md Tahmid Rahman Laskar
M Saiful Bari
Mizanur Rahman
Md Amran Hossen Bhuiyan
Shafiq R. Joty
J. Huang
    LM&MA
    ELM
    ALM
ArXivPDFHTML

Papers citing "A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets"

49 / 49 papers shown
Title
Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes
Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes
Mohammad Aqib
Mohd Hamza
Qipei Mei
Ying Hei Chui
RALM
ELM
52
0
0
07 May 2025
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks
Elie Antoine
Frédéric Béchet
Géraldine Damnati
Philippe Langlais
56
1
0
29 Jan 2025
STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains
STAYKATE: Hybrid In-Context Example Selection Combining Representativeness Sampling and Retrieval-based Approach -- A Case Study on Science Domains
Chencheng Zhu
Kazutaka Shimada
Tomoki Taniguchi
Tomoko Ohkuma
38
0
0
31 Dec 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
76
1
0
26 Oct 2024
Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding
Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding
Renato Vukovic
David Arps
Carel van Niekerk
Benjamin Matthias Ruppik
Hsien-chin Lin
Michael Heck
Milica Gašić
46
1
0
05 Aug 2024
Beyond Metrics: A Critical Analysis of the Variability in Large Language
  Model Evaluation Frameworks
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks
Marco AF Pimentel
Clément Christophe
Tathagata Raha
Prateek Munjal
Praveen K Kanithi
Shadab Khan
ELM
39
2
0
29 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
29
28
0
04 Jul 2024
Are We Done with MMLU?
Are We Done with MMLU?
Aryo Pradipta Gema
Joshua Ong Jun Leang
Giwon Hong
Alessio Devoto
Alberto Carlo Maria Mancino
...
R. McHardy
Joshua Harris
Jean Kaddour
Emile van Krieken
Pasquale Minervini
ELM
60
30
0
06 Jun 2024
Large Language Models for Cyber Security: A Systematic Literature Review
Large Language Models for Cyber Security: A Systematic Literature Review
HanXiang Xu
Shenao Wang
Ningke Li
Kaidi Wang
Yanjie Zhao
Kai Chen
Ting Yu
Yang Liu
Haoyu Wang
37
23
0
08 May 2024
D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities
  of Large Language Models
D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models
Duygu Altinok
AI4MH
LRM
LM&MA
ELM
26
2
0
07 May 2024
Adapting Mental Health Prediction Tasks for Cross-lingual Learning via
  Meta-Training and In-context Learning with Large Language Model
Adapting Mental Health Prediction Tasks for Cross-lingual Learning via Meta-Training and In-context Learning with Large Language Model
Zita Lifelo
Huansheng Ning
Sahraoui Dhelim
AI4MH
48
0
0
13 Apr 2024
METAL: Towards Multilingual Meta-Evaluation
METAL: Towards Multilingual Meta-Evaluation
Rishav Hada
Varun Gumma
Mohamed Ahmed
Kalika Bali
Sunayana Sitaram
ELM
43
2
0
02 Apr 2024
What is different between these datasets?
What is different between these datasets?
Varun Babbar
Zhicheng Guo
Cynthia Rudin
59
1
0
08 Mar 2024
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Hanlei Jin
Yang Zhang
Dan Meng
Jun Wang
Jinghua Tan
68
80
0
05 Mar 2024
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival
  Human Crowd Accuracy
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
P. Schoenegger
Indre Tuminauskaite
Peter S. Park
Rafael Valdece Sousa Bastos
P. Tetlock
40
25
0
29 Feb 2024
A Comprehensive Evaluation of Quantization Strategies for Large Language
  Models
A Comprehensive Evaluation of Quantization Strategies for Large Language Models
Renren Jin
Jiangcun Du
Wuwei Huang
Wei Liu
Jian Luan
Bin Wang
Deyi Xiong
MQ
32
31
0
26 Feb 2024
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in
  Closed-Source LLMs
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Simone Balloccu
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
SILM
ELM
PILM
21
156
0
06 Feb 2024
(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection
(Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection
Francesco Periti
Haim Dubossarsky
Nina Tahmasebi
AI4MH
34
13
0
25 Jan 2024
Named Entity Recognition Under Domain Shift via Metric Learning for Life
  Sciences
Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences
Hongyi Liu
Qingyun Wang
Payam Karisani
Heng Ji
16
1
0
19 Jan 2024
The Earth is Flat? Unveiling Factual Errors in Large Language Models
The Earth is Flat? Unveiling Factual Errors in Large Language Models
Wenxuan Wang
Juluan Shi
Zhaopeng Tu
Youliang Yuan
Jen-tse Huang
Wenxiang Jiao
Michael R. Lyu
KELM
HILM
SyDa
47
1
0
01 Jan 2024
The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64
  Languages
The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages
Chiyu Zhang
Khai Duy Doan
Qisheng Liao
Muhammad Abdul-Mageed
36
6
0
23 Oct 2023
Empirical Study of Zero-Shot NER with ChatGPT
Empirical Study of Zero-Shot NER with ChatGPT
Tingyu Xie
Qi Li
Jian Zhang
Yan Zhang
Zuozhu Liu
Hongwei Wang
LRM
ReLM
24
63
0
16 Oct 2023
ChatGPT & Mechanical Engineering: Examining performance on the FE
  Mechanical Engineering and Undergraduate Exams
ChatGPT & Mechanical Engineering: Examining performance on the FE Mechanical Engineering and Undergraduate Exams
Matthew Frenkel
Hebah Emara
31
2
0
26 Sep 2023
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls
  of Large Language Models on Bengali NLP
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
M. Kabir
Mohammed Saidul Islam
Md Tahmid Rahman Laskar
Mir Tafseer Nayeem
M Saiful Bari
Enamul Hoque
LM&MA
24
15
0
22 Sep 2023
The Potential and Pitfalls of using a Large Language Model such as
  ChatGPT or GPT-4 as a Clinical Assistant
The Potential and Pitfalls of using a Large Language Model such as ChatGPT or GPT-4 as a Clinical Assistant
Jingqing Zhang
Kai Sun
A. Jagadeesh
Mahta Ghahfarokhi
Deepa Gupta
Ashok Gupta
Vibhor Gupta
Yike Guo
LM&MA
AI4MH
ELM
32
11
0
16 Jul 2023
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of
  Large Language Models for Code Generation
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
ELM
ALM
189
799
0
02 May 2023
Deep Transfer Learning for Automatic Speech Recognition: Towards Better
  Generalization
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Hamza Kheddar
Yassine Himeur
S. Al-Maadeed
Abbes Amira
F. Bensaali
47
76
0
27 Apr 2023
Industrial Engineering with Large Language Models: A case study of
  ChatGPT's performance on Oil & Gas problems
Industrial Engineering with Large Language Models: A case study of ChatGPT's performance on Oil & Gas problems
O. Ogundare
S. Madasu
N. Wiggins
LLMAG
AI4CE
40
15
0
27 Apr 2023
ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time
ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time
Shangqing Tu
Chunyang Li
Jifan Yu
Xiaozhi Wang
Lei Hou
Juanzi Li
LLMAG
AI4MH
75
10
0
27 Apr 2023
ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
  about
ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking about
Aman Rangapur
Haoran Wang
AI4MH
39
3
0
06 Apr 2023
Towards Making the Most of ChatGPT for Machine Translation
Towards Making the Most of ChatGPT for Machine Translation
Keqin Peng
Liang Ding
Qihuang Zhong
Li Shen
Xuebo Liu
Min Zhang
Y. Ouyang
Dacheng Tao
LRM
91
210
0
24 Mar 2023
Can we trust the evaluation on ChatGPT?
Can we trust the evaluation on ChatGPT?
Rachith Aiyappa
Jisun An
Haewoon Kwak
Yong-Yeol Ahn
ELM
ALM
LLMAG
AI4MH
LRM
120
87
0
22 Mar 2023
A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability
A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability
Aiwei Liu
Xuming Hu
Lijie Wen
Philip S. Yu
LMTD
AI4MH
82
149
0
12 Mar 2023
Do large language models resemble humans in language use?
Do large language models resemble humans in language use?
Zhenguang G. Cai
Xufeng Duan
David A. Haslett
Shuqi Wang
M. Pickering
ALM
81
37
0
10 Mar 2023
Ask and You Shall Receive (a Graph Drawing): Testing ChatGPT's Potential
  to Apply Graph Layout Algorithms
Ask and You Shall Receive (a Graph Drawing): Testing ChatGPT's Potential to Apply Graph Layout Algorithms
Sara Di Bartolomeo
Giorgio Severi
V. Schetinger
Cody Dunne
43
8
0
03 Mar 2023
Language Models are Multilingual Chain-of-Thought Reasoners
Language Models are Multilingual Chain-of-Thought Reasoners
Freda Shi
Mirac Suzgun
Markus Freitag
Xuezhi Wang
Suraj Srivats
...
Yi Tay
Sebastian Ruder
Denny Zhou
Dipanjan Das
Jason W. Wei
ReLM
LRM
172
327
0
06 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
253
1,073
0
05 Oct 2022
An Effective, Performant Named Entity Recognition System for Noisy
  Business Telephone Conversation Transcripts
An Effective, Performant Named Entity Recognition System for Noisy Business Telephone Conversation Transcripts
Xue-Yong Fu
Cheng Chen
Md Tahmid Rahman Laskar
TN ShashiBhushan
Simon Corston-Oliver
35
6
0
27 Sep 2022
ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity
  Linking
ReFinED: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking
Tom Ayoola
Shubhi Tyagi
Joseph Fisher
Christos Christodoulopoulos
Andrea Pierleoni
42
85
0
08 Jul 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
PromptSource: An Integrated Development Environment and Repository for
  Natural Language Prompts
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
Stephen H. Bach
Victor Sanh
Zheng-Xin Yong
Albert Webson
Colin Raffel
...
Khalid Almubarak
Xiangru Tang
Dragomir R. Radev
Mike Tian-Jian Jiang
Alexander M. Rush
VLM
225
338
0
02 Feb 2022
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning
Yixuan Su
Fangyu Liu
Zaiqiao Meng
Tian Lan
Lei Shu
Ehsan Shareghi
Nigel Collier
139
57
0
07 Nov 2021
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
213
1,657
0
15 Oct 2021
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,848
0
18 Apr 2021
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding
  and Generation
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu
Daya Guo
Shuo Ren
Junjie Huang
Alexey Svyatkovskiy
...
Nan Duan
Neel Sundaresan
Shao Kun Deng
Shengyu Fu
Shujie Liu
ELM
201
853
0
09 Feb 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
276
1,996
0
31 Dec 2020
A Survey on Recent Approaches for Natural Language Processing in
  Low-Resource Scenarios
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios
Michael A. Hedderich
Lukas Lange
Heike Adel
Jannik Strötgen
Dietrich Klakow
202
286
0
23 Oct 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,959
0
20 Apr 2018
Teaching Machines to Read and Comprehend
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
178
3,510
0
10 Jun 2015
1