KréyoLID From Language Identification Towards Language Mining

9 March 2025

Abstract

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

View on arXiv

@article{dent2025_2503.06547,
  title={ KréyoLID From Language Identification Towards Language Mining },
  author={ Rasul Dent and Pedro Ortiz Suarez and Thibault Clérice and Benoît Sagot },
  journal={arXiv preprint arXiv:2503.06547},
  year={ 2025 }
}

Comments on this paper