Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.19282
Cited By
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
29 February 2024
Jiantao Qiu
Haijun Lv
Zhenjiang Jin
Rui Wang
Wenchang Ning
Jia Yu
ChaoBin Zhang
Zhenxiang Li
Pei Chu
Yuan Qu
Jin Shi
Lindong Lu
Runyu Peng
Zhiyuan Zeng
Huanze Tang
Zhikai Lei
Jiawei Hong
Keyu Chen
Zhaoye Fei
R. Xu
Wei Li
Zhongying Tu
Lin Dahua
Yu Qiao
Hang Yan
Conghui He
Re-assign community
ArXiv
PDF
HTML
Papers citing
"WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset"
7 / 7 papers shown
Title
Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Yishuo Wang
Z. Fu
Jie Cai
Peijun Tang
Hongya Lyu
...
Jie Zhou
Guoyang Zeng
Chaojun Xiao
Xu Han
Zhiyuan Liu
54
0
0
08 May 2025
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
Jiahui Peng
Ren Ma
Yucheng Wang
Tianyi Bai
Xingjian Wei
Jiantao Qiu
Chi Zhang
Ying Qian
Conghui He
55
0
0
19 Apr 2025
Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Fan Zhou
Zengzhi Wang
Qian Liu
Junlong Li
Pengfei Liu
ALM
106
15
0
17 Feb 2025
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Peng Sun
73
8
0
29 Jul 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLM
OffRL
56
21
0
12 Jun 2024
Dial-insight: Fine-tuning Large Language Models with High-Quality Domain-Specific Data Preventing Capability Collapse
Jianwei Sun
Chaoyang Mei
Linlin Wei
Kaiyu Zheng
Na Liu
Ming Cui
Tianyi Li
ALM
48
4
0
14 Mar 2024
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
2,000
0
31 Dec 2020
1