30
3

Variation in the vocabulary of Russian literary texts

Abstract

In this paper, the data from a large online collection is used to study variation in the vocabulary of Russian literary texts. First, we find that variation in the vocabulary size of different authors is in a good agreement with Heaps' law with parameter λ=0.5\lambda=0.5, consistent with previous studies. The overall distribution of word frequencies has lighter tails than both the Zipf and lognormal laws predict. Next, we focus on the variation of word frequencies across texts. We confirm statistically that word frequencies vary significantly across texts, and find that the variance of the cross-text frequency distribution is in general higher for more frequent words. The dependence of the variance on the average word frequency follows a power law with exponent α=1.25.\alpha=1.25. The factor models applied to the data suggest that the most of the cross-text variation is concentrated in less than 100100 most frequent words and can be explained by about 1010 factors. For less-frequent words, the frequency data is more suitable for factor analysis if it is normalized to take into account the size of texts. The matrix of normalized frequencies for less-frequent words exhibit properties characteristic for a large random matrix deformed by a low-rank matrix. As an example of application, the spectral factors for the most frequent words are used to classify texts, and it is found that the k-means classification algorithm based on these factors can classify texts by authors with the accuracy that varied from about 7070% to 9090% for prose writers, and from 6060% to 8080% for poets.

View on arXiv
Comments on this paper