ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1410.0630
24
18

Deep Directed Generative Autoencoders

2 October 2014
Sherjil Ozair
Yoshua Bengio
    DRL
ArXivPDFHTML
Abstract

For discrete data, the likelihood P(x)P(x)P(x) can be rewritten exactly and parametrized into P(X=x)=P(X=x∣H=f(x))P(H=f(x))P(X = x) = P(X = x | H = f(x)) P(H = f(x))P(X=x)=P(X=x∣H=f(x))P(H=f(x)) if P(X∣H)P(X | H)P(X∣H) has enough capacity to put no probability mass on any x′x'x′ for which f(x′)≠f(x)f(x')\neq f(x)f(x′)=f(x), where f(⋅)f(\cdot)f(⋅) is a deterministic discrete function. The log of the first factor gives rise to the log-likelihood reconstruction error of an autoencoder with f(⋅)f(\cdot)f(⋅) as the encoder and P(X∣H)P(X|H)P(X∣H) as the (probabilistic) decoder. The log of the second term can be seen as a regularizer on the encoded activations h=f(x)h=f(x)h=f(x), e.g., as in sparse autoencoders. Both encoder and decoder can be represented by a deep neural network and trained to maximize the average of the optimal log-likelihood log⁡p(x)\log p(x)logp(x). The objective is to learn an encoder f(⋅)f(\cdot)f(⋅) that maps XXX to f(X)f(X)f(X) that has a much simpler distribution than XXX itself, estimated by P(H)P(H)P(H). This "flattens the manifold" or concentrates probability mass in a smaller number of (relevant) dimensions over which the distribution factorizes. Generating samples from the model is straightforward using ancestral sampling. One challenge is that regular back-propagation cannot be used to obtain the gradient on the parameters of the encoder, but we find that using the straight-through estimator works well here. We also find that although optimizing a single level of such architecture may be difficult, much better results can be obtained by pre-training and stacking them, gradually transforming the data distribution into one that is more easily captured by a simple parametric model.

View on arXiv
Comments on this paper