ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.16337
  4. Cited By
Task Contamination: Language Models May Not Be Few-Shot Anymore

Task Contamination: Language Models May Not Be Few-Shot Anymore

26 December 2023
Changmao Li
Jeffrey Flanigan
ArXivPDFHTML

Papers citing "Task Contamination: Language Models May Not Be Few-Shot Anymore"

26 / 26 papers shown
Title
Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
Paul Tschisgale
Holger Maus
Fabian Kieser
Ben Kroehs
Stefan Petersen
Peter Wulff
ELM
LRM
41
0
0
14 May 2025
Towards Contamination Resistant Benchmarks
Towards Contamination Resistant Benchmarks
Rahmatullah Musawi
Sheng Lu
42
0
0
13 May 2025
Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models
Bye-bye, Bluebook? Automating Legal Procedure with Large Language Models
Matthew Dahl
AILaw
ELM
54
0
0
05 May 2025
Large Language Models Could Be Rote Learners
Large Language Models Could Be Rote Learners
Yuyang Xu
Renjun Hu
Haochao Ying
Jiangxu Wu
Xing Shi
Wei Lin
ELM
160
0
0
11 Apr 2025
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning
Qiming Bao
Gael Gendron
A. Peng
Wanjun Zhong
N. Tan
Yang Chen
Michael Witbrock
Jiaheng Liu
LRM
ELM
68
2
0
20 Jan 2025
DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts
DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts
Tobias Braun
Mark Rothermel
Marcus Rohrbach
Anna Rohrbach
87
1
0
13 Dec 2024
Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages
Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages
Hoang Nguyen
Khyati Mahajan
Vikas Yadav
Philip S. Yu
Masoud Hashemi
Rishabh Maheshwary
Rishabh Maheshwary
47
0
0
04 Nov 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
59
3
0
24 Oct 2024
MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks
M. Bueno
R. Lotufo
Rodrigo Nogueira
LRM
26
0
0
08 Oct 2024
How Much Can We Forget about Data Contamination?
How Much Can We Forget about Data Contamination?
Sebastian Bordt
Suraj Srinivas
Valentyn Boreiko
U. V. Luxburg
45
1
0
04 Oct 2024
Federated Instruction Tuning of LLMs with Domain Coverage Augmentation
Federated Instruction Tuning of LLMs with Domain Coverage Augmentation
Zezhou Wang
Yaxin Du
Zhuzhong Qian
Yugang Jiang
Zhuzhong Qian
Siheng Chen
FedML
137
0
0
30 Sep 2024
ASR Error Correction using Large Language Models
ASR Error Correction using Large Language Models
Rao Ma
Mengjie Qian
Mark J. F. Gales
Kate Knill
KELM
46
1
0
14 Sep 2024
Training on the Test Task Confounds Evaluation and Emergence
Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt
ELM
68
7
1
10 Jul 2024
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
29
27
0
04 Jul 2024
Benchmark Data Contamination of Large Language Models: A Survey
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
38
38
0
06 Jun 2024
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Feature contamination: Neural networks learn uncorrelated features and fail to generalize
Tianren Zhang
Chujie Zhao
Guanyu Chen
Yizhou Jiang
Feng Chen
OOD
MLT
OODD
77
3
0
05 Jun 2024
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Marianna Nezhurina
Lucia Cipolina-Kun
Mehdi Cherti
J. Jitsev
LLMAG
LRM
ELM
ReLM
58
25
0
04 Jun 2024
Hallucination-Free? Assessing the Reliability of Leading AI Legal
  Research Tools
Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools
Varun Magesh
Faiz Surani
Matthew Dahl
Mirac Suzgun
Christopher D. Manning
Daniel E. Ho
HILM
ELM
AILaw
27
66
0
30 May 2024
Had enough of experts? Quantitative knowledge retrieval from large language models
Had enough of experts? Quantitative knowledge retrieval from large language models
David Selby
Kai Spriestersbach
Yuichiro Iwashita
Dennis Bappert
Archana Warrier
Sumantrak Mukherjee
Muhammad Nabeel Asim
Koichi Kise
Sebastian Vollmer
41
0
0
12 Feb 2024
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
107
136
0
03 Nov 2023
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls
  of Large Language Models on Bengali NLP
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
M. Kabir
Mohammed Saidul Islam
Md Tahmid Rahman Laskar
Mir Tafseer Nayeem
M Saiful Bari
Enamul Hoque
LM&MA
24
15
0
22 Sep 2023
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
Kent K. Chang
Mackenzie Cramer
Sandeep Soni
David Bamman
RALM
145
111
0
28 Apr 2023
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
314
3,248
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
313
11,953
0
04 Mar 2022
The Power of Scale for Parameter-Efficient Prompt Tuning
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester
Rami Al-Rfou
Noah Constant
VPVLM
280
3,848
0
18 Apr 2021
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit
  Reasoning Strategies
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
Mor Geva
Daniel Khashabi
Elad Segal
Tushar Khot
Dan Roth
Jonathan Berant
RALM
250
673
0
06 Jan 2021
1