Random walks on discourse spaces: a new generative language model with applications to semantic word embeddings

12 February 2015

Sanjeev Arora

Andrej Risteski

Abstract

Semantic word embeddings use vector representations to represent the meaning of a word. Methods to create them include Vector Space Methods (VSMs) such as Latent Semantic Analysis (LSA), matrix factorization, generative text models such as Topic Models, and neural nets. A flurry of work has resulted from the papers of Mikolov et al.~\cite{mikolov2013efficient}. These showed how to solve word analogy tasks very well by leveraging linear structure in word embeddings even though the embeddings were created using highly nonlinear energy based models. No clear explanation is known why such linear structure emerges in low-dimensional embeddings. This paper presents a loglinear generative model---related to~\citet{mnih2007three}---that models the generation of a text corpus as a random walk in a latent discourse space. A novel methodological twist is that the model is solved in closed form by integrating out the random walk. This yields a simple method for constructing word embeddings. Experiments are presented to support the modeling assumptions as well as the efficacy of the word embeddings for solving analogies. This simple model links and provides theoretical support for several prior methods for finding embeddings, as well as provides interpretations for various linear algebraic structures in word embeddings obtained from nonlinear techniques.

View on arXiv

Comments on this paper