48
0

Delving into: the quantification of Ai-generated content on the internet (synthetic data)

Abstract

While it is increasingly evident that the internet is becoming saturated with content created by generated Ai large language models, accurately measuring the scale of this phenomenon has proven challenging. By analyzing the frequency of specific keywords commonly used by ChatGPT, this paper demonstrates that such linguistic markers can effectively be used to esti-mate the presence of generative AI content online. The findings suggest that at least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.

View on arXiv
@article{spennemann2025_2504.08755,
  title={ Delving into: the quantification of Ai-generated content on the internet (synthetic data) },
  author={ Dirk HR Spennemann },
  journal={arXiv preprint arXiv:2504.08755},
  year={ 2025 }
}
Comments on this paper