ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.16387
64
1

FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web

25 November 2024
Cheng-Wei Lin
Wan-Hsuan Hsieh
Kai-Xin Guan
Chan-Jan Hsu
Chia-Chen Kuo
Chuan-Lin Lai
Chung-Wei Chung
Ming-Jen Wang
Da-shan Shiu
ArXivPDFHTML
Abstract

The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.

View on arXiv
Comments on this paper