27
15

DOLDA - a regularized supervised topic model for high-dimensional multi-class regression

Abstract

We introduce Diagonal Orthant Latent Dirichlet Allocation (DOLDA), a supervised topic model for multi-class classification that can handle both many classes as well as many covariates. To handle many classes we use the recently proposed Diagonal Orthant (DO) probit model together with an efficient horseshoe prior for variable selection/shrinkage. An important advantage of DOLDA is that learned topics are directly connected to individual classes without the need for a reference class. We propose a computationally efficient parallel Gibbs sampler for the new model. We study the model properties on an IMDb dataset with roughly 8000 documents, and document preliminary results in a bug prediction context where 118 components are predicted using 100 topics from bug reports.

View on arXiv
Comments on this paper