ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.10341
  4. Cited By
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
v1v2 (latest)

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

14 February 2025
Alexander Wettig
Kyle Lo
Sewon Min
Hannaneh Hajishirzi
Danqi Chen
Luca Soldaini
ArXiv (abs)PDFHTML

Papers citing "Organize the Web: Constructing Domains Enhances Pre-Training Data Curation"

11 / 11 papers shown
Title
Essential-Web v1.0: 24T tokens of organized web data
Essential-Web v1.0: 24T tokens of organized web data
Essential AI
Andrew Hojel
Michael Pust
Tim Romanski
Yash Vanjani
...
Platon Mazarakis
Saad Jamal
Saurabh Srivastava
Somanshu Singla
Ashish Vaswani
13
0
0
17 Jun 2025
Learning to Reason Across Parallel Samples for LLM Reasoning
Jianing Qi
Xi Ye
Hao Tang
Zhigang Zhu
Eunsol Choi
ReLMLRM
14
0
0
10 Jun 2025
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
Thao Nguyen
Yang Li
O. Yu. Golovneva
Luke Zettlemoyer
Sewoong Oh
Ludwig Schmidt
Xian Li
OnRL
139
0
0
05 Jun 2025
Robust LLM Fingerprinting via Domain-Specific Watermarks
Robust LLM Fingerprinting via Domain-Specific Watermarks
Thibaud Gloaguen
Robin Staab
Nikola Jovanović
Martin Vechev
WaLM
106
0
0
22 May 2025
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training
Dongyang Fan
Vinko Sabolčec
Martin Jaggi
51
0
0
22 May 2025
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
Kai Hua
Steven Wu
Ge Zhang
Ke Shen
LRM
80
0
0
12 May 2025
Semantic Probabilistic Control of Language Models
Semantic Probabilistic Control of Language Models
Kareem Ahmed
Catarina G Belém
Padhraic Smyth
Sameer Singh
110
1
0
04 May 2025
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
Albert Ge
Tzu-Heng Huang
John Cooper
Avi Trost
Ziyi Chu
Satya Sai Srinath Namburi GNVV
Ziyang Cai
Kendall Park
Nicholas Roberts
Frederic Sala
107
1
0
01 May 2025
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Shizhe Diao
Yu Yang
Y. Fu
Xin Dong
Dan Su
...
Hongxu Yin
M. Patwary
Yingyan
Jan Kautz
Pavlo Molchanov
120
2
0
17 Apr 2025
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Dongyang Fan
Vinko Sabolčec
Matin Ansaripour
Ayush Kumar Tarun
Martin Jaggi
Antoine Bosselut
Imanol Schlag
63
1
0
08 Apr 2025
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Jake Poznanski
Aman Rangapur
Jon Borchardt
Jason Dunkelberger
Regan Huff
Daniel Lin
Aman Rangapur
Christopher Wilhelm
Kyle Lo
Luca Soldaini
174
7
0
25 Feb 2025
1