Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.11546
Cited By
v1
v2 (latest)
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
17 February 2025
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander Fraser
Maosong Sun
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection"
22 / 22 papers shown
Title
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen
Wen Lai
Shuo Wang
Kangyang Luo
Alexander Fraser
Maosong Sun
62
0
0
20 May 2025
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
106
7
0
31 Oct 2024
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo
Hynek Kydlícek
Loubna Ben Allal
Anton Lozhkov
Margaret Mitchell
Colin Raffel
Leandro von Werra
Thomas Wolf
108
248
0
25 Jun 2024
LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback
Wen Lai
Mohsen Mesgar
Alexander Fraser
LRM
ALM
102
25
0
03 Jun 2024
Large Language Models for Education: A Survey
Hanyi Xu
Wensheng Gan
Zhenlian Qi
Jiayang Wu
Philip S. Yu
AI4Ed
ELM
122
17
0
12 May 2024
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert
Graeme Nail
Nikolay Arefyev
Marta Bañón
Jelmer van der Linde
...
Gema Ramírez-Sánchez
Andrey Kutuzov
S. Pyysalo
Stephan Oepen
Jörg Tiedemann
VLM
89
23
0
20 Mar 2024
SaulLM-7B: A pioneering Large Language Model for Law
Pierre Colombo
T. Pires
Malik Boudiaf
Dominic Culver
Rui Melo
...
Andre F. T. Martins
Fabrizio Esposito
Vera Lúcia Raposo
Sofia Morgado
Michael Desa
ELM
AILaw
100
75
0
06 Mar 2024
Datasets for Large Language Models: A Comprehensive Survey
Yang Liu
Jiahuan Cao
Chongyu Liu
Kai Ding
Lianwen Jin
AILaw
59
70
0
28 Feb 2024
Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia
Tzu-Sheng Kuo
Aaron L Halfaker
Zirui Cheng
Jiwoo Kim
Meng-Hsin Wu
Tongshuang Wu
Kenneth Holstein
Haiyi Zhu
85
22
0
21 Feb 2024
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Ahmet Üstün
Viraat Aryabumi
Zheng-Xin Yong
Wei-Yin Ko
Daniel D'souza
...
Shayne Longpre
Niklas Muennighoff
Marzieh Fadaee
Julia Kreutzer
Sara Hooker
ALM
ELM
SyDa
LRM
89
226
0
12 Feb 2024
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen
Chien Van Nguyen
Viet Dac Lai
Hieu Man
Nghia Trung Ngo
Franck Dernoncourt
Ryan Rossi
Thien Huu Nguyen
102
107
0
17 Sep 2023
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Sneha Kudugunta
Isaac Caswell
Biao Zhang
Xavier Garcia
Christopher A. Choquette-Choo
...
Derrick Xin
Aditya Kusupati
Romi Stella
Ankur Bapna
Orhan Firat
109
136
0
09 Sep 2023
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Ayyoob Imani
Peiqin Lin
Amir Hossein Kargaran
Silvia Severini
Masoud Jalili Sabet
...
Chunlan Ma
Helmut Schmid
André F. T. Martins
François Yvon
Hinrich Schütze
ALM
LRM
88
105
0
20 May 2023
XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters
Xuanyu Zhang
Qing Yang
Dongliang Xu
ALM
OSLM
70
105
0
19 May 2023
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Colin Leong
Joshua Nemecek
Jacob Mansdorfer
Anna Filighera
A. Owodunni
Daniel Whitenack
VLM
AI4CE
142
28
0
26 Oct 2022
No Language Left Behind: Scaling Human-Centered Machine Translation
Nllb team
Marta R. Costa-jussá
James Cross
Onur cCelebi
Maha Elbayad
...
Alexandre Mourachko
C. Ropers
Safiyyah Saleem
Holger Schwenk
Jeff Wang
MoE
226
1,263
0
11 Jul 2022
Can Foundation Models Wrangle Your Data?
A. Narayan
Ines Chami
Laurel J. Orr
Simran Arora
Christopher Ré
LMTD
AI4CE
229
221
0
20 May 2022
Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji
Pedro Ortiz Suarez
Laurent Romary
Benoît Sagot
CLL
87
158
0
17 Jan 2022
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation
Naman Goyal
Cynthia Gao
Vishrav Chaudhary
Peng-Jen Chen
Guillaume Wenzek
Da Ju
Sanjan Krishnan
MarcÁurelio Ranzato
Francisco Guzman
Angela Fan
99
587
0
06 Jun 2021
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau
Kartikay Khandelwal
Naman Goyal
Vishrav Chaudhary
Guillaume Wenzek
Francisco Guzmán
Edouard Grave
Myle Ott
Luke Zettlemoyer
Veselin Stoyanov
220
6,565
0
05 Nov 2019
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
439
20,181
0
23 Oct 2019
On the Use of ArXiv as a Dataset
Colin B. Clement
Matthew Bierbaum
K. O’Keeffe
Alexander A. Alemi
AI4CE
376
133
0
30 Apr 2019
1