Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2506.14111
Cited By
Essential-Web v1.0: 24T tokens of organized web data
17 June 2025
Essential AI
Andrew Hojel
Michael Pust
Tim Romanski
Yash Vanjani
Ritvik Kapila
Mohit Parmar
Adarsh Chaluvaraju
Alok Tripathy
Anil Thomas
Ashish Tanwer
Darsh J Shah
Ishaan Shah
Karl Stratos
Khoi Nguyen
Kurt Smith
Michael Callahan
Peter Rushton
Philip Monk
Platon Mazarakis
Saad Jamal
Saurabh Srivastava
Somanshu Singla
Ashish Vaswani
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Essential-Web v1.0: 24T tokens of organized web data"
14 / 14 papers shown
Title
MegaMath: Pushing the Limits of Open Math Corpora
Fan Zhou
Zengzhi Wang
Nikhil Ranjan
Zhoujun Cheng
Liping Tang
Guowei He
Zhengzhong Liu
Eric P. Xing
LRM
116
3
0
03 Apr 2025
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Weizhe Yuan
Jane Dwivedi-Yu
Song Jiang
Karthik Padthe
Yang Li
...
Ilia Kulikov
Kyunghyun Cho
Yuandong Tian
Jason Weston
Xian Li
ReLM
LRM
116
20
0
18 Feb 2025
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Alexander Wettig
Kyle Lo
Sewon Min
Hannaneh Hajishirzi
Danqi Chen
Luca Soldaini
98
14
0
17 Feb 2025
Automatic Evaluation of Healthcare LLMs Beyond Question-Answering
Anna Arias-Duart
Pablo A. Martin-Torres
Daniel Hinjos
Pablo Bernabeu Perez
Lucia Urcelay-Ganzabal
Marta Gonzalez-Mallo
Ashwin Kumar Gururajan
Enrique Lopez-Cuena
Sergio Alvarez-Napagao
Dario Garcia-Gasulla
LM&MA
ELM
255
4
0
10 Feb 2025
RedPajama: an Open Dataset for Training Large Language Models
Maurice Weber
Daniel Y. Fu
Quentin Anthony
Yonatan Oren
S. Adams
...
Tri Dao
Percy Liang
Christopher Ré
Irina Rish
Ce Zhang
214
83
0
19 Nov 2024
Scaling Laws for Linear Complexity Language Models
Xuyang Shen
Dong Li
Ruitao Leng
Zhen Qin
Weigao Sun
Yiran Zhong
LRM
54
8
0
24 Jun 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
138
1,238
0
05 Feb 2024
Program Synthesis with Large Language Models
Jacob Austin
Augustus Odena
Maxwell Nye
Maarten Bosma
Henryk Michalewski
...
Ellen Jiang
Carrie J. Cai
Michael Terry
Quoc V. Le
Charles Sutton
ELM
AIMat
ReCod
ALM
200
2,004
0
16 Aug 2021
Evaluating Large Language Models Trained on Code
Mark Chen
Jerry Tworek
Heewoo Jun
Qiming Yuan
Henrique Pondé
...
Bob McGrew
Dario Amodei
Sam McCandlish
Ilya Sutskever
Wojciech Zaremba
ELM
ALM
233
5,635
0
07 Jul 2021
Explaining Neural Scaling Laws
Yasaman Bahri
Ethan Dyer
Jared Kaplan
Jaehoon Lee
Utkarsh Sharma
68
269
0
12 Feb 2021
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
608
4,893
0
23 Jan 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel
Noam M. Shazeer
Adam Roberts
Katherine Lee
Sharan Narang
Michael Matena
Yanqi Zhou
Wei Li
Peter J. Liu
AIMat
456
20,298
0
23 Oct 2019
PubMedQA: A Dataset for Biomedical Research Question Answering
Qiao Jin
Bhuwan Dhingra
Zhengping Liu
William W. Cohen
Xinghua Lu
387
910
0
13 Sep 2019
Enriching Word Vectors with Subword Information
Piotr Bojanowski
Edouard Grave
Armand Joulin
Tomas Mikolov
NAI
SSL
VLM
229
9,978
0
15 Jul 2016
1