ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1911.08915
25
1
v1v2 (latest)

Universal and non-universal text statistics: Clustering coefficient for language identification

18 November 2019
D. Espitia
H. Larralde
ArXiv (abs)PDFHTML
Abstract

In this work we analyze statistical properties of 91 relatively small texts in 7 different languages (Spanish, English, French, German, Turkish, Russian, Icelandic) as well as texts with randomly inserted spaces. Despite the size (around 11260 different words), the well known universal statistical laws -- namely Zipf and Herdan-Heap's laws -- are confirmed, and are in close agreement with results obtained elsewhere. We also construct a word co-occurrence network of each text. While the degree distribution is again universal, we note that the distribution of Clustering Coefficients, which depend strongly on the local structure of networks, can be used to differentiate between languages, as well as to distinguish natural languages from random texts.

View on arXiv
Comments on this paper