Variational Autoencoder — BayesianStatistics.com

Introduced independently by Kingma and Welling (2013) and Rezende, Mohamed, and Wierstra (2014), the Variational Autoencoder (VAE) fuses variational inference with deep neural networks to create a powerful framework for unsupervised learning and generative modelling. The VAE posits a latent variable model p_θ(x, z) = p(z) p_θ(x | z), where z is a latent code and p_θ(x | z) is a neural network decoder. Because the true posterior p_θ(z | x) is intractable, a neural network encoder q_ϕ(z | x) approximates it, and both networks are trained jointly by maximizing the ELBO.

Architecture and Training

The encoder (or recognition network) q_ϕ(z | x) maps an input x to the parameters of a distribution over the latent space — typically a diagonal Gaussian with mean μ_ϕ(x) and variance σ²_ϕ(x). The decoder (or generative network) p_θ(x | z) maps a latent code z back to a distribution over the data space. The prior p(z) is usually a standard normal N(0, I).

VAE Objective (ELBO) L(θ, ϕ; x) = E_{q_ϕ(z|x)}[log p_θ(x | z)] − KL(q_ϕ(z | x) ‖ p(z))

Reconstruction term: E_{q_ϕ(z|x)}[log p_θ(x | z)]
Regularization term: KL(q_ϕ(z | x) ‖ p(z))

The first term encourages faithful reconstruction of the input. The second term regularizes the approximate posterior toward the prior, ensuring the latent space has a coherent global structure. For Gaussian q and Gaussian prior, the KL term has a closed-form expression, while the reconstruction term is estimated via Monte Carlo sampling.

The Reparameterization Trick

A central technical innovation of the VAE is the reparameterization trick. Instead of sampling z ∼ q_ϕ(z | x) directly (which blocks gradient flow), we write z = μ_ϕ(x) + σ_ϕ(x) ⊙ ε, where ε ∼ N(0, I). This makes the sampling operation differentiable with respect to ϕ, enabling standard backpropagation through the stochastic layer. This trick is not specific to VAEs and has become a cornerstone of stochastic gradient variational inference more broadly.

Posterior Collapse

A well-known failure mode of VAEs is posterior collapse, where the encoder learns to ignore the input and produce q_ϕ(z | x) ≈ p(z) for all x. The model then relies entirely on the decoder's capacity, and the latent code becomes uninformative. Mitigation strategies include KL annealing (gradually increasing the KL weight during training), free bits (imposing a minimum information constraint per latent dimension), and using more expressive decoders or priors.

Extensions and Variants

Conditional VAEs (CVAEs) condition both encoder and decoder on auxiliary information (e.g., class labels), enabling controlled generation. β-VAE (Higgins et al., 2017) introduces a hyperparameter β > 1 weighting the KL term, encouraging disentangled latent representations at the cost of reconstruction fidelity. VQ-VAE (van den Oord et al., 2017) replaces the continuous latent space with a discrete codebook, combining the VAE framework with vector quantization for high-quality audio and image generation.

More expressive approximate posteriors can be obtained through normalizing flows applied to the encoder output, and hierarchical VAEs stack multiple layers of latent variables to capture structure at different scales. The NVAE (Vahdat & Kohl, 2020) demonstrated that deep hierarchical VAEs can produce image quality competitive with GANs.

Connections to Bayesian Inference

The VAE is a concrete instantiation of amortized variational inference: rather than optimizing a separate q for each observation (as in classical VI), a single encoder network learns to map any x to its approximate posterior. This amortization enables rapid inference at test time — a single forward pass through the encoder — and is central to the scalability of the approach.

From a Bayesian perspective, the decoder defines a likelihood, the prior over z constitutes a genuine prior belief, and the encoder performs approximate posterior inference. The marginal likelihood p_θ(x) is lower-bounded by the ELBO, connecting VAE training directly to the model evidence framework used in Bayesian model comparison.

"The VAE paper was a watershed: it showed that variational inference and deep learning are not just compatible but synergistic, each amplifying the power of the other."— Shakir Mohamed, 2015

Applications

VAEs have been applied to image generation and inpainting, molecular design (generating novel drug candidates), text generation, speech synthesis, anomaly detection (using the ELBO as an anomaly score), and representation learning for downstream tasks. In the sciences, VAEs are used to model galaxy morphologies, protein structures, and single-cell gene expression data. Their principled probabilistic foundation — grounded in variational Bayesian theory — distinguishes them from purely adversarial approaches and provides uncertainty estimates alongside generated samples.

Worked Example: 1D VAE Reconstruction

We demonstrate a simple 1D VAE on 10 data points. The encoder maps each input to a latent Normal distribution, and the decoder reconstructs the input. We compute the ELBO decomposition.

Given 10 inputs: 1.0, 2.5, 3.8, 1.2, 4.1, 2.0, 3.5, 0.8, 4.5, 1.5
Input mean = 2.49, SD = 1.31

Step 1: Encoder z_mean = (x − 2.49)/1.31 (standardize)
z_logvar = log(0.5) (learned variance)
Example: x = 4.5 → z_mean = 1.53, z_var = 0.5

Step 2: Decoder (Reconstruction) x̂ = z · 1.31 + 2.49 (reverse transform)
At z = z_mean: x̂ = x (perfect reconstruction at mean)

Step 3: Loss Decomposition Reconstruction loss (per sample): E[(x − x̂)²] ≈ 0.00 (at mean)
KL divergence: KL(N(z_mean, 0.5) ‖ N(0,1))
= 0.5(0.5 + z_mean² − 1 − log(0.5))
For x = 4.5: KL = 0.5(0.5 + 2.34 − 1 + 0.69) = 1.27
Average KL across samples = 0.62
ELBO = −(recon_loss + KL) ≈ −0.62

The VAE balances two objectives: reconstructing inputs accurately (low reconstruction loss) and keeping the latent representation close to the prior N(0,1) (low KL). Points far from the data mean (like x = 4.5) incur higher KL because their latent encodings are far from zero. This regularization prevents the VAE from simply memorizing inputs and encourages learning smooth, generalizable representations.

Interactive Calculator

Each row is an input_value (numeric). The calculator simulates a simple 1D VAE: an encoder maps inputs to a latent Normal distribution q(z|x), a decoder reconstructs x̂ from z. It shows the reconstruction loss, KL divergence from the prior, and total ELBO for the dataset.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.