Improved Gibbs Sampling Parameter Estimators for Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for discovering the underlying structure of discrete data. LDA and its extensions have been successfully used for both unsupervised and supervised learning tasks across a variety of data types including textual, image, and biological data. After more than a decade of intensive research on training algorithms for LDA, the Collapsed Gibbs Sampler (CGS), in which the parameters are marginalized out, remains one of the most popular LDA inference algorithms. We introduce a novel approach for estimating LDA parameters from collapsed Gibbs samples, by leveraging the full conditional distributions over the latent variable assignments to efficiently average over multiple samples, for little more than the cost of drawing a single sample. We perform extensive empirical comparisons of our estimators with those of standard collapsed inference algorithms on real-world data for both unsupervised LDA and Prior-LDA, a supervised variant of LDA for multi-label classification. Our results show a consistent advantage of our approach over traditional CGS under all experimental conditions, and over Collapsed Variational Bayes (CVB0) inference under the majority of conditions. More broadly, our results highlight the importance of averaging over multiple samples in LDA parameter estimation, and the use of efficient computational techniques to do so.
View on arXiv