ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1911.00359
  4. Cited By
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

1 November 2019
Guillaume Wenzek
Marie-Anne Lachaux
Alexis Conneau
Vishrav Chaudhary
Francisco Guzmán
Armand Joulin
Edouard Grave
ArXivPDFHTML

Papers citing "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data"

50 / 161 papers shown
Title
Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda
Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda
Richard Kimera
DongNyeong Heo
Daniela N. Rim
Heeyoul Choi
173
0
0
05 May 2025
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
Fengze Liu
Weidong Zhou
Binbin Liu
Zhimiao Yu
Yifan Zhang
...
Yifeng Yu
Bingni Zhang
Xiaohuan Zhou
Taifeng Wang
Yong Cao
66
1
0
23 Apr 2025
ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese
ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese
H. Phung
Ngoc C. Lê
Van-Chien Nguyen
Hang Thi Nguyen
Thuy Phuong Thi Nguyen
77
1
0
21 Apr 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Yucheng Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
55
0
0
19 Apr 2025
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Catch Me if You Search: When Contextual Web Search Results Affect the Detection of Hallucinations
Mahjabin Nahar
Eun-Ju Lee
Jin Won Park
Dongwon Lee
HILM
75
0
0
01 Apr 2025
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Xiaoxuan Zhu
Zhouhong Gu
Baiqian Wu
Suhang Zheng
Tao Wang
Tianyu Li
Hongwei Feng
Yanghua Xiao
46
0
0
01 Apr 2025
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation
Olivier Gouvert
Julie Hunter
Jérôme Louradour
Christophe Cerisara
Evan Dufraisse
Yaya Sy
Laura Rivière
Jean-Pierre Lorré
OpenLLM-France community
214
0
0
15 Mar 2025
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Kashun Shum
Yuanmin Huang
Hongjian Zou
Qi Ding
Yixuan Liao
Xiao Chen
Qian Liu
Junxian He
67
2
0
02 Mar 2025
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings
Layba Fiaz
Munief Hassan Tahir
Sana Shams
Sarmad Hussain
51
0
0
24 Feb 2025
Machine-generated text detection prevents language model collapse
Machine-generated text detection prevents language model collapse
George Drayson
Emine Yilmaz
Vasileios Lampos
DeLMO
64
0
0
21 Feb 2025
Multilingual Language Model Pretraining using Machine-translated Data
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
David Ifeoluwa Adelani
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
83
3
0
20 Feb 2025
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
A. Khan
Robert Underwood
Carlo Siebenschuh
Y. Babuji
Aswathy Ajith
Kyle Hippe
Ozan Gokdemir
Alexander Brace
Kyle Chard
Ian Foster
38
0
0
06 Nov 2024
FlexCAD: Unified and Versatile Controllable CAD Generation with Fine-tuned Large Language Models
FlexCAD: Unified and Versatile Controllable CAD Generation with Fine-tuned Large Language Models
Zhanwei Zhang
Shizhao Sun
Wenxiao Wang
D. Cai
Jiang Bian
AI4CE
36
1
0
05 Nov 2024
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Amir Hossein Kargaran
François Yvon
Hinrich Schutze
VLM
44
5
0
31 Oct 2024
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Hsun-Yu Kuo
Yin-Hsiang Liao
Yu-Chieh Chao
Wei-Yun Ma
Pu-Jen Cheng
SyDa
53
3
0
28 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
TIPS: Text-Image Pretraining with Spatial awareness
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
37
3
0
21 Oct 2024
SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge
  AI and Low-Resource Environments
SHAKTI: A 2.5 Billion Parameter Small Language Model Optimized for Edge AI and Low-Resource Environments
Syed Abdul Gaffar Shakhadri
Kruthika KR
Rakshit Aralimatti
VLM
25
1
0
15 Oct 2024
Data Quality Control in Federated Instruction-tuning of Large Language Models
Data Quality Control in Federated Instruction-tuning of Large Language Models
Yaxin Du
Guangyi Liu
Fengting Yuchi
W. Zhao
Jingjing Qu
Yanjie Wang
Siheng Chen
ALM
FedML
56
0
0
15 Oct 2024
Reverse Modeling in Large Language Models
Reverse Modeling in Large Language Models
S. Yu
Yuanchen Xu
Cunxiao Du
Yanying Zhou
Minghui Qiu
Q. Sun
Hao Zhang
Jiawei Wu
39
2
0
13 Oct 2024
Data Processing for the OpenGPT-X Model Family
Data Processing for the OpenGPT-X Model Family
Nicolo' Brandizzi
Hammam Abdelwahab
Anirban Bhowmick
Lennard Helmer
Benny Jörg Stein
...
Georg Rehm
Dennis Wegener
Nicolas Flores-Herr
Joachim Kohler
Johannes Leveling
VLM
87
2
0
11 Oct 2024
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization
Yougang Lyu
Lingyong Yan
Zihan Wang
Dawei Yin
Pengjie Ren
Maarten de Rijke
Z. Z. Ren
63
6
0
10 Oct 2024
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling
David Grangier
Simin Fan
Skyler Seto
Pierre Ablin
44
3
0
30 Sep 2024
Open-World Evaluation for Retrieving Diverse Perspectives
Open-World Evaluation for Retrieving Diverse Perspectives
Hung-Ting Chen
Eunsol Choi
35
0
0
26 Sep 2024
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
Shaoxiong Ji
Zihao Li
Indraneil Paul
Jaakko Paavola
Peiqin Lin
...
Dayyán O'Brien
Hengyu Luo
Hinrich Schütze
Jörg Tiedemann
Barry Haddow
CLL
43
3
0
26 Sep 2024
Mixture of Diverse Size Experts
Mixture of Diverse Size Experts
Manxi Sun
Wei Liu
Jian Luan
Pengzhi Gao
Bin Wang
MoE
28
1
0
18 Sep 2024
What is the Role of Small Models in the LLM Era: A Survey
What is the Role of Small Models in the LLM Era: A Survey
Lihu Chen
Gaël Varoquaux
ALM
63
23
0
10 Sep 2024
Harvesting Textual and Structured Data from the HAL Publication Repository
Harvesting Textual and Structured Data from the HAL Publication Repository
Francis Kulumba
Wissam Antoun
Guillaume Vimont
Laurent Romary
40
2
0
30 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability
  of Low-Resource Bengali LLMs
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
39
0
0
29 Jun 2024
Multilingual Large Language Models and Curse of Multilinguality
Multilingual Large Language Models and Curse of Multilinguality
Daniil Gurgurov
Tanja Bäumel
Tatiana Anikina
86
4
0
15 Jun 2024
Datasets for Multilingual Answer Sentence Selection
Datasets for Multilingual Answer Sentence Selection
Matteo Gabburo
S. Campese
Federico Agostini
Alessandro Moschitti
46
0
0
14 Jun 2024
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small
  Reference Models
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner
Cody Blakeney
Kartik K. Sreenivasan
Max Marion
Matthew L. Leavitt
Mansheej Paul
43
24
0
30 May 2024
The Mosaic Memory of Large Language Models
The Mosaic Memory of Large Language Models
Igor Shilov
Matthieu Meeus
Yves-Alexandre de Montjoye
47
3
0
24 May 2024
BMRetriever: Tuning Large Language Models as Better Biomedical Text
  Retrievers
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers
Ran Xu
Wenqi Shi
Yue Yu
Yuchen Zhuang
Yanqiao Zhu
M. D. Wang
Joyce C. Ho
Chao Zhang
Carl Yang
LM&MA
40
19
0
29 Apr 2024
Building a Large Japanese Web Corpus for Large Language Models
Building a Large Japanese Web Corpus for Large Language Models
Naoaki Okazaki
Kakeru Hattori
Hirai Shota
Hiroki Iida
Masanari Ohi
Kazuki Fujii
Taishi Nakamura
Mengsay Loem
Rio Yokota
Sakae Mizuki
55
7
0
27 Apr 2024
JaFIn: Japanese Financial Instruction Dataset
JaFIn: Japanese Financial Instruction Dataset
Kota Tanabe
Masahiro Suzuki
Hiroki Sakaji
Itsuki Noda
47
1
0
14 Apr 2024
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path
  Forward
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
Xuan Xie
Jiayang Song
Zhehua Zhou
Yuheng Huang
Da Song
Lei Ma
OffRL
53
6
0
12 Apr 2024
Rho-1: Not All Tokens Are What You Need
Rho-1: Not All Tokens Are What You Need
Zheng-Wen Lin
Zhibin Gou
Yeyun Gong
Xiao Liu
Yelong Shen
...
Chen Lin
Yujiu Yang
Jian Jiao
Nan Duan
Weizhu Chen
CLL
50
57
0
11 Apr 2024
Comprehensive Study on German Language Models for Clinical and
  Biomedical Text Understanding
Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding
Ahmad Idrissi-Yaghir
Amin Dada
Henning Schafer
Kamyar Arzideh
Giulia Baldini
...
Peter A. Horn
Christin Seifert
F. Nensa
Jens Kleesiek
Christoph M. Friedrich
AI4MH
39
2
0
08 Apr 2024
Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind
Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind
Hongchuan Zeng
Hongshen Xu
Lu Chen
Kai Yu
59
5
0
06 Apr 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
57
64
0
25 Mar 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
150
511
0
07 Mar 2024
Novi jezički modeli za srpski jezik
Novi jezički modeli za srpski jezik
Mihailo vSkorić
23
0
0
22 Feb 2024
TeenyTinyLlama: open-source tiny language models trained in Brazilian
  Portuguese
TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese
N. Corrêa
Sophia Falk
Shiza Fatimah
Aniket Sen
N. D. Oliveira
30
9
0
30 Jan 2024
WordScape: a Pipeline to extract multilingual, visually rich Documents
  with Layout Annotations from Web Crawl Data
WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data
Maurice Weber
Carlo Siebenschuh
Rory Butler
Anton Alexandrov
Valdemar Thanner
...
Haris Jabbar
Ian Foster
Bo-wen Li
Rick L. Stevens
Ce Zhang
21
4
0
15 Dec 2023
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition
  and Phoneme to Grapheme Translation
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation
Wonjun Lee
Gary Geunbae Lee
Yunsu Kim
31
0
0
06 Dec 2023
Oasis: Data Curation and Assessment System for Pretraining of Large
  Language Models
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Tong Zhou
Yubo Chen
Pengfei Cao
Kang Liu
Jun Zhao
Shengping Liu
29
3
0
21 Nov 2023
Structural Priming Demonstrates Abstract Grammatical Representations in
  Multilingual Language Models
Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models
J. Michaelov
Catherine Arnett
Tyler A. Chang
Benjamin Bergen
36
12
0
15 Nov 2023
Leveraging LLMs for Synthesizing Training Data Across Many Languages in
  Multilingual Dense Retrieval
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
Nandan Thakur
Jianmo Ni
Gustavo Hernández Ábrego
John Wieting
Jimmy J. Lin
Daniel Cer
RALM
43
12
0
10 Nov 2023
Data Filtering Networks
Data Filtering Networks
Alex Fang
Albin Madappally Jose
Amit Jain
Ludwig Schmidt
Alexander Toshev
Vaishaal Shankar
CLIP
46
125
0
29 Sep 2023
DTrOCR: Decoder-only Transformer for Optical Character Recognition
DTrOCR: Decoder-only Transformer for Optical Character Recognition
Masato Fujitake
56
35
0
30 Aug 2023
1234
Next