For discrete data, the likelihood can be rewritten exactly and parametrized into if has enough capacity to put no probability mass on any for which , where is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with as the encoder and as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations , e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood . The objective is to learn an encoder that maps to that has a much simpler distribution than itself, estimated by . This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.
View on arXiv