RAND-WALK: A Latent Variable Model Approach to Word Embeddings

12 February 2015

Sanjeev Arora

Andrej Risteski

Abstract

Semantic word embeddings represent the meaning of a word via a vector, and are created by diverse methods including Vector Space Methods (VSMs) such as Latent Semantic Analysis (LSA), generative text models such as topic models, matrix factorization, neural nets, and energy-based models. Many of these use nonlinear operations on co-occurrence statistics, such as computing Pairwise Mutual Information (PMI). Some use hand-tuned hyperparameters and term reweighting. Often a generative model can help provide theoretical insight into such modeling choices, but there appears to be no such model to explain the above nonlinear models. For example, we know of no generative model for which the correct solution is the usual (dimension-restricted) PMI model. This paper gives a new generative model, a dynamic version of the loglinear topic model of Mnih and Hinton (2007),, as well as a pair of training objectives called RAND-WALK to compute word embeddings. The methodological novelty is to use the prior to compute closed form expressions for word statistics. These provide an explanation for the PMI model and other recent models, as well as hyperparameter choices. Experimental support is provided for the generative model assumptions, the most important of which is that latent word vectors are spatially isotropic. The model also helps explain why linear algebraic structure arises in low-dimensional semantic embeddings. Such structure has been used to solve analogy tasks by Mikolov et al. (2013a) and many subsequent papers. This theoretical explanation is to give an improved analogy solving method that improves success rates on analogy solving by a few percent.

View on arXiv

Comments on this paper