Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.04325
Cited By
Will we run out of data? Limits of LLM scaling based on human-generated data
26 October 2022
Pablo Villalobos
A. Ho
J. Sevilla
T. Besiroglu
Lennart Heim
Marius Hobbhahn
ALM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Will we run out of data? Limits of LLM scaling based on human-generated data"
24 / 74 papers shown
Title
The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Yanzhu Guo
Guokan Shang
Michalis Vazirgiannis
Chloé Clavel
34
50
0
16 Nov 2023
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
Naomi Saphra
Eve Fleisig
Kyunghyun Cho
Adam Lopez
LRM
30
8
0
08 Nov 2023
Market Concentration Implications of Foundation Models
Jai Vipra
Anton Korinek
ELM
40
16
0
02 Nov 2023
Will Code Remain a Relevant User Interface for End-User Programming with Generative AI Models?
Advait Sarkar
26
16
0
01 Nov 2023
FedSplitX: Federated Split Learning for Computationally-Constrained Heterogeneous Clients
Jiyun Shin
Jinhyun Ahn
Honggu Kang
Joonhyuk Kang
FedML
65
6
0
23 Oct 2023
A State-Vector Framework for Dataset Effects
E. Sahak
Zining Zhu
Frank Rudzicz
30
1
0
17 Oct 2023
FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models
Tao Fan
Yan Kang
Guoqiang Ma
Weijing Chen
Wenbin Wei
Lixin Fan
Qiang Yang
40
57
0
16 Oct 2023
Open-Sourcing Highly Capable Foundation Models: An evaluation of risks, benefits, and alternative methods for pursuing open-source objectives
Elizabeth Seger
Noemi Dreksler
Richard Moulange
Emily Dardaman
Jonas Schuett
...
Emma Bluemke
Michael Aird
Patrick Levermore
Julian Hazell
Abhishek Gupta
20
40
0
29 Sep 2023
A Benchmark for Learning to Translate a New Language from One Grammar Book
Garrett Tanzer
Mirac Suzgun
Chenguang Xi
Dan Jurafsky
Luke Melas-Kyriazi
24
51
0
28 Sep 2023
Chain-of-Thought Reasoning is a Policy Improvement Operator
Hugh Zhang
David C. Parkes
ReLM
LM&Ro
LRM
31
12
0
15 Sep 2023
EarthPT: a time series foundation model for Earth Observation
Michael J. Smith
Luke Fleming
James E. Geach
AI4TS
22
7
0
13 Sep 2023
FDAPT: Federated Domain-adaptive Pre-training for Language Models
Lekang Jiang
F. Svoboda
Nicholas D. Lane
FedML
AI4CE
72
4
0
12 Jul 2023
When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
Weiming Zhuang
Chen Chen
Lingjuan Lyu
Chong Chen
Yaochu Jin
Lingjuan Lyu
AIFin
AI4CE
99
85
0
27 Jun 2023
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Guilherme Penedo
Quentin Malartic
Daniel Hesslow
Ruxandra-Aimée Cojocaru
Alessandro Cappelli
Hamza Alobeidli
B. Pannier
Ebtesam Almazrouei
Julien Launay
50
751
0
01 Jun 2023
Scaling Data-Constrained Language Models
Niklas Muennighoff
Alexander M. Rush
Boaz Barak
Teven Le Scao
Aleksandra Piktus
Nouamane Tazi
S. Pyysalo
Thomas Wolf
Colin Raffel
ALM
38
200
0
25 May 2023
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Fuzhao Xue
Yao Fu
Wangchunshu Zhou
Zangwei Zheng
Yang You
85
78
0
22 May 2023
A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers
Jordan Meadows
Marco Valentino
Damien Teney
André Freitas
35
8
0
21 May 2023
Pretraining Language Models with Human Preferences
Tomasz Korbak
Kejian Shi
Angelica Chen
Rasika Bhalerao
C. L. Buckley
Jason Phang
Sam Bowman
Ethan Perez
ALM
SyDa
36
207
0
16 Feb 2023
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
Thomas Wang
Daniel Hesslow
Lucile Saulnier
Stas Bekman
...
Lintang Sutawika
Jaesung Tae
Zheng-Xin Yong
Julien Launay
Iz Beltagy
MoE
AI4CE
230
103
0
27 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
354
12,003
0
04 Mar 2022
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
242
593
0
14 Jul 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
1,996
0
31 Dec 2020
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,489
0
23 Jan 2020
AI safety via debate
G. Irving
Paul Christiano
Dario Amodei
204
201
0
02 May 2018
Previous
1
2