ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.08442
24
40

The MiniPile Challenge for Data-Efficient Language Models

17 April 2023
Jean Kaddour
    MoE
    ALM
ArXivPDFHTML
Abstract

The ever-growing diversity of pre-training text corpora has equipped language models with generalization capabilities across various downstream tasks. However, such diverse datasets are often too large for academic budgets; hence, most research on Transformer architectures, training procedures, optimizers, etc. gets conducted on smaller, homogeneous datasets. To this end, we present The MiniPile Challenge, where one pre-trains a language model on a diverse text corpus containing at most 1M documents. MiniPile is a 6GB subset of the deduplicated 825GB The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using kkk-means, and (3) filter out low-quality clusters. To verify MiniPile's suitability for language model pre-training, we use it to pre-train a BERT and T5 model, yielding a performance drop of only 1.9%1.9\%1.9%/2.5%2.5\%2.5% on the GLUE and SNI benchmarks compared to the original pre-trained checkpoints trained on 2.62.62.6x/745745745x the amount of data. MiniPile is available at https://huggingface.co/datasets/JeanKaddour/minipile.

View on arXiv
Comments on this paper