ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.14907
44
1

GneissWeb: Preparing High Quality Data for LLMs at Scale

19 February 2025
Hajar Emami-Gohari
S. Kadhe
Syed Yousaf Shah. Constantin Adam
Abdulhamid A. Adebayo
Praneet Adusumilli
Farhan Ahmed
Nathalie Baracaldo Angel
Santosh Borse
Yuan Chi Chang
Xuan-Hong Dang
N. Desai
Ravital Eres
Ran Iwamoto
Alexei Karve
Yan Koyfman
Wei-Han Lee
Changchang Liu
Boris Lublinsky
Takuyo Ohko
Pablo Pesce
Maroun Touma
Shiqiang Wang
Shalisha Witherspoon
Herbert Woisetschläger
D. Wood
Kun-Lung Wu
Issei Yoshida
Syed Zawad
Petros Zerfos
Yi Zhou
Bishwaranjan Bhattacharjee
ArXivPDFHTML
Abstract

Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models.In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens).We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

View on arXiv
@article{gohari2025_2502.14907,
  title={ GneissWeb: Preparing High Quality Data for LLMs at Scale },
  author={ Hajar Emami Gohari and Swanand Ravindra Kadhe and Syed Yousaf Shah. Constantin Adam and Abdulhamid Adebayo and Praneet Adusumilli and Farhan Ahmed and Nathalie Baracaldo Angel and Santosh Borse and Yuan-Chi Chang and Xuan-Hong Dang and Nirmit Desai and Ravital Eres and Ran Iwamoto and Alexei Karve and Yan Koyfman and Wei-Han Lee and Changchang Liu and Boris Lublinsky and Takuyo Ohko and Pablo Pesce and Maroun Touma and Shiqiang Wang and Shalisha Witherspoon and Herbert Woisetschlager and David Wood and Kun-Lung Wu and Issei Yoshida and Syed Zawad and Petros Zerfos and Yi Zhou and Bishwaranjan Bhattacharjee },
  journal={arXiv preprint arXiv:2502.14907},
  year={ 2025 }
}
Comments on this paper