ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.04325
  4. Cited By
Will we run out of data? Limits of LLM scaling based on human-generated
  data

Will we run out of data? Limits of LLM scaling based on human-generated data

26 October 2022
Pablo Villalobos
A. Ho
J. Sevilla
T. Besiroglu
Lennart Heim
Marius Hobbhahn
    ALM
ArXivPDFHTML

Papers citing "Will we run out of data? Limits of LLM scaling based on human-generated data"

50 / 74 papers shown
Title
Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
Hao Mark Chen
Guanxi Lu
Yasuyuki Okoshi
Zhiwen Mo
Masato Motomura
Hongxiang Fan
LRM
4
0
0
16 May 2025
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Jean-Philippe Corbeil
Amin Dada
Jean-Michel Attendu
Asma Ben Abacha
Alessandro Sordoni
Lucas Caccia
François Beaulieu
Thomas Lin
Jens Kleesiek
Paul Vozila
LM&MA
17
0
0
15 May 2025
Parallel Scaling Law for Language Models
Parallel Scaling Law for Language Models
Mouxiang Chen
Binyuan Hui
Zeyu Cui
Jiaxi Yang
Dayiheng Liu
Jianling Sun
Junyang Lin
Zhongxin Liu
MoE
LRM
37
0
0
15 May 2025
Position: Enough of Scaling LLMs! Lets Focus on Downscaling
Position: Enough of Scaling LLMs! Lets Focus on Downscaling
Ayan Sengupta
Yash Goel
Tanmoy Chakraborty
34
0
0
02 May 2025
Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks
Towards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks
Yang Liu
Bingjie Yan
Tianyuan Zou
Jianqing Zhang
Zixuan Gu
...
Jiajian Li
Xiaozhou Ye
Ye Ouyang
Qiang Yang
Wenjie Qu
ALM
176
1
0
24 Apr 2025
Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from space
Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from space
Michael J. Smith
Luke Fleming
James E. Geach
Ryan J. Roberts
Freddie Kalaitzis
James Banister
29
0
0
24 Apr 2025
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training
Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang
MLLM
VLM
55
0
0
17 Apr 2025
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
Haris Riaz
Sourav Sanjukta Bhabesh
Vinayak Arannil
Miguel Ballesteros
Graham Horwood
SyDa
52
0
0
17 Apr 2025
Position: The Most Expensive Part of an LLM should be its Training Data
Position: The Most Expensive Part of an LLM should be its Training Data
Nikhil Kandpal
Colin Raffel
31
0
0
16 Apr 2025
Integrating Cognitive Processing Signals into Language Models: A Review of Advances, Applications and Future Directions
Integrating Cognitive Processing Signals into Language Models: A Review of Advances, Applications and Future Directions
Angela Lopez-Cardona
Sebastian Idesis
Ioannis Arapakis
31
0
0
09 Apr 2025
From Fairness to Truthfulness: Rethinking Data Valuation Design
From Fairness to Truthfulness: Rethinking Data Valuation Design
Dongyang Fan
Tyler J. Rotello
Sai Praneeth Karimireddy
TDI
53
0
0
07 Apr 2025
Compression Laws for Large Language Models
Compression Laws for Large Language Models
Ayan Sengupta
Siddhant Chaudhary
Tanmoy Chakraborty
26
0
0
06 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Zihan Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
95
4
0
01 Apr 2025
Opportunities and Challenges of Frontier Data Governance With Synthetic Data
Opportunities and Challenges of Frontier Data Governance With Synthetic Data
Madhavendra Thakur
Jason Hausenloy
51
0
0
21 Mar 2025
PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models
PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models
Kyeongkook Seo
Dong-Jun Han
Jaejun Yoo
45
0
0
11 Mar 2025
Position: Model Collapse Does Not Mean What You Think
Position: Model Collapse Does Not Mean What You Think
Rylan Schaeffer
Joshua Kazdan
Alvan Caleb Arulandu
Sanmi Koyejo
71
0
0
05 Mar 2025
Machine Learners Should Acknowledge the Legal Implications of Large Language Models as Personal Data
Henrik Nolte
Michèle Finck
Kristof Meding
AILaw
PILM
84
0
0
03 Mar 2025
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu
Jackson Petty
Chuan Shi
William Merrill
Tal Linzen
AI4CE
66
1
0
26 Feb 2025
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models
Xu Chu
Zhixin Zhang
Tianyu Jia
Yujie Jin
77
0
0
25 Feb 2025
Forecasting Frontier Language Model Agent Capabilities
Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale
Axel Højmark
Jérémy Scheurer
Marius Hobbhahn
LLMAG
ELM
49
1
0
21 Feb 2025
Grounding LLM Reasoning with Knowledge Graphs
Grounding LLM Reasoning with Knowledge Graphs
Alfonso Amayuelas
Joy Prakash Sain
Simerjot Kaur
Charese Smiley
80
0
0
18 Feb 2025
Toward Neurosymbolic Program Comprehension
Toward Neurosymbolic Program Comprehension
Alejandro Velasco
Aya Garryyeva
David Nader-Palacio
Antonio Mastropaolo
Denys Poshyvanyk
47
0
0
03 Feb 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team
Angang Du
Bofei Gao
Bowei Xing
Changjiu Jiang
...
Zhilin Yang
Zhiqi Huang
Zihao Huang
Ziyao Xu
Zhengyuan Yang
VLM
ALM
OffRL
AI4TS
LRM
117
150
0
22 Jan 2025
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation
  on Nepali
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
Sharad Duwal
Suraj Prasai
Suresh Manandhar
CLL
84
1
0
18 Dec 2024
Towards Data Governance of Frontier AI Models
Towards Data Governance of Frontier AI Models
Jason Hausenloy
Duncan McClements
Madhavendra Thakur
77
1
0
05 Dec 2024
Are Transformers Truly Foundational for Robotics?
Are Transformers Truly Foundational for Robotics?
James A. R. Marshall
Andrew B. Barron
AI4CE
73
0
0
25 Nov 2024
Rephrasing natural text data with different languages and quality levels
  for Large Language Model pre-training
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
Michael Pieler
Marco Bellagente
H. Teufel
Duy Phung
Nathan Cooper
...
Reshinth Adithyan
Zaid Alyafeai
Nikhil Pinnaparaju
Maksym Zhuravinskyi
Carlos Riquelme
32
1
0
28 Oct 2024
Props for Machine-Learning Security
Props for Machine-Learning Security
Ari Juels
Farinaz Koushanfar
25
2
0
27 Oct 2024
Dynamic Vocabulary Pruning in Early-Exit LLMs
Dynamic Vocabulary Pruning in Early-Exit LLMs
Jort Vincenti
Karim Abdel Sadek
Joan Velja
Matteo Nulli
Metod Jazbec
29
0
0
24 Oct 2024
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment
Elyas Obbad
Iddah Mlauzi
Brando Miranda
Rylan Schaeffer
Kamal Obbad
Suhana Bedi
Sanmi Koyejo
CVBM
53
0
0
23 Oct 2024
Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective
Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective
Zeyu Gan
Yong Liu
SyDa
46
1
0
02 Oct 2024
Synthetic continued pretraining
Synthetic continued pretraining
Zitong Yang
Neil Band
Shuangping Li
Emmanuel Candès
Tatsunori Hashimoto
CLL
SyDa
41
11
0
11 Sep 2024
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and
  Red Teaming
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar
Divyanshu Kumar
Jatan Loya
Nitin Aravind Birur
Tanay Baswa
Sahil Agarwal
P. Harshangi
SyDa
55
5
0
14 Aug 2024
ABC Align: Large Language Model Alignment for Safety & Accuracy
ABC Align: Large Language Model Alignment for Safety & Accuracy
Gareth Seneque
Lap-Hang Ho
Peter W. Glynn
Yinyu Ye
Jeffrey Molendijk
41
1
0
01 Aug 2024
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Tao Ge
Xin Chan
Dian Yu
Haitao Mi
Dong Yu
Dong Yu
SyDa
122
97
0
28 Jun 2024
A social path to human-like artificial intelligence
A social path to human-like artificial intelligence
Edgar A. Duénez-Guzmán
Suzanne Sadedin
Jane X. Wang
Kevin R. McKee
Joel Z Leibo
GNN
31
28
0
22 May 2024
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient
  Approach to Mitigate Resource Scarcity Challenges in Training Large Language
  Models for Healthcare
Crowdsourcing with Enhanced Data Quality Assurance: An Efficient Approach to Mitigate Resource Scarcity Challenges in Training Large Language Models for Healthcare
Prosanta Barai
Gondy Leroy
Prakash Bisht
Joshua M Rothman
Sumi Lee
Jennifer G. Andrews
Sydney A Rice
Arif Ahmed
35
2
0
16 May 2024
Special Characters Attack: Toward Scalable Training Data Extraction From
  Large Language Models
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
Yang Bai
Ge Pei
Jindong Gu
Yong Yang
Xingjun Ma
33
10
0
09 May 2024
More Compute Is What You Need
More Compute Is What You Need
Zhen Guo
62
0
0
30 Apr 2024
Advances and Open Challenges in Federated Learning with Foundation
  Models
Advances and Open Challenges in Federated Learning with Foundation Models
Chao Ren
Han Yu
Hongyi Peng
Xiaoli Tang
Anran Li
...
A. Tan
Bo Zhao
Xiaoxiao Li
Zengxiang Li
Qiang Yang
FedML
AIFin
AI4CE
78
7
0
23 Apr 2024
TransformerFAM: Feedback attention is working memory
TransformerFAM: Feedback attention is working memory
Dongseong Hwang
Weiran Wang
Zhuoyuan Huo
K. Sim
P. M. Mengibar
40
12
0
14 Apr 2024
Finding needles in a haystack: A Black-Box Approach to Invisible
  Watermark Detection
Finding needles in a haystack: A Black-Box Approach to Invisible Watermark Detection
Minzhou Pan
Zhengting Wang
Xin Dong
Vikash Sehwag
Lingjuan Lyu
Xue Lin
40
3
0
23 Mar 2024
Enhancing Data Quality in Federated Fine-Tuning of Foundation Models
Enhancing Data Quality in Federated Fine-Tuning of Foundation Models
Wanru Zhao
Yaxin Du
Nicholas D. Lane
Siheng Chen
Yanfeng Wang
37
3
0
07 Mar 2024
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual
  Analysis to Generate In-Depth Insights from Educational Artifacts
Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts
Zewei Tian
Min Sun
Alex Liu
Shawon Sarkar
Jing Liu
40
5
0
06 Mar 2024
Video as the New Language for Real-World Decision Making
Video as the New Language for Real-World Decision Making
Sherry Yang
Jacob Walker
Jack Parker-Holder
Yilun Du
Jake Bruce
Andre Barreto
Pieter Abbeel
Dale Schuurmans
VGen
31
46
0
27 Feb 2024
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Cleaner Pretraining Corpus Curation with Neural Web Scraping
Zhipeng Xu
Zhenghao Liu
Yukun Yan
Zhiyuan Liu
Ge Yu
Chenyan Xiong
CLIP
OnRL
27
4
0
22 Feb 2024
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
  Improves LLM Generalization
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization
Xuxi Chen
Zhendong Wang
Daouda Sow
Junjie Yang
Tianlong Chen
Yingbin Liang
Mingyuan Zhou
Zhangyang Wang
34
6
0
22 Feb 2024
DenseFormer: Enhancing Information Flow in Transformers via Depth
  Weighted Averaging
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Matteo Pagliardini
Amirkeivan Mohtashami
F. Fleuret
Martin Jaggi
40
6
0
04 Feb 2024
Federated Full-Parameter Tuning of Billion-Sized Language Models with
  Communication Cost under 18 Kilobytes
Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
Zhen Qin
Daoyuan Chen
Bingchen Qian
Bolin Ding
Yaliang Li
Shuiguang Deng
FedML
40
32
0
11 Dec 2023
Large Language Models Suffer From Their Own Output: An Analysis of the
  Self-Consuming Training Loop
Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop
Martin Briesch
Dominik Sobania
Franz Rothlauf
35
55
0
28 Nov 2023
12
Next