Exact Bayesian inference requires computing the posterior p(θ | y), which involves the often-intractable marginal likelihood p(y) = ∫ p(y | θ) p(θ) dθ. Variational Bayesian (VB) methods sidestep this integration by reformulating inference as optimization: find a distribution q(θ) from a tractable family Q that minimizes the Kullback–Leibler divergence KL(q ‖ p(· | y)). Because this KL divergence itself involves p(y), the optimization is equivalently expressed as maximizing the evidence lower bound (ELBO).
The Evidence Lower Bound
ELBO(q) = E_q[log p(y, θ)] − E_q[log q(θ)]
= E_q[log p(y | θ)] − KL(q(θ) ‖ p(θ))
Since KL(q ‖ p(· | y)) ≥ 0, the ELBO is always a lower bound on log p(y). Maximizing the ELBO is equivalent to minimizing the KL divergence to the posterior. The second form of the ELBO — expected log-likelihood minus the KL from prior to approximate posterior — makes the trade-off intuitive: the variational distribution must fit the data while remaining close to the prior.
Mean-Field Variational Inference
The most common restriction on Q is the mean-field assumption: q(θ) = ∏ᵢ qᵢ(θᵢ), where each factor is optimized independently. Under this factorization, the optimal form of each factor satisfies:
This yields coordinate ascent variational inference (CAVI), which cycles through each factor, updating it while holding others fixed. For exponential-family models with conjugate priors, these updates have closed-form expressions, making CAVI extremely efficient. The ELBO is guaranteed to increase at each step, converging to a local optimum.
Historical Context
Variational methods entered machine learning from statistical physics (mean-field theory) and information theory. Saul, Jaakkola, and Jordan (1996) applied variational bounds to Boltzmann machines and mixture models.
Attias, Ghahramani, and Beal developed variational Bayes for graphical models. Beal's 2003 thesis provided a comprehensive treatment of VB for a wide class of latent-variable models.
Kingma and Welling introduced the Variational Autoencoder (VAE), and concurrently Rezende et al. proposed stochastic backpropagation, marrying variational inference with deep learning.
Black-box variational inference (Ranganath et al., 2014), normalizing flows (Rezende & Mohamed, 2015), and amortized inference expanded the scope of VI to virtually arbitrary models.
Stochastic Variational Inference
Classical CAVI requires a full pass over the dataset at each iteration, making it prohibitive for large-scale problems. Stochastic variational inference (SVI), introduced by Hoffman et al. (2013), uses stochastic gradient ascent on the ELBO with minibatches of data, enabling scalability to millions of observations. The reparameterization trick — expressing samples from q as deterministic transformations of noise — allows low-variance gradient estimates, and underpins much of modern deep generative modelling.
Variational inference minimizes KL(q ‖ p), the "reverse" or "exclusive" KL divergence. This tends to produce approximations that are mode-seeking: q will concentrate on one mode of the posterior and underestimate variance. The "forward" or "inclusive" KL, KL(p ‖ q), used in expectation propagation, yields moment-matching, mass-covering approximations. The choice of divergence profoundly affects the quality and character of the approximation.
Richer Variational Families
The mean-field assumption can be restrictive. Structured variational inference preserves selected dependencies. Normalizing flows transform a simple base distribution through a sequence of invertible mappings, yielding flexible yet tractable densities. These approaches close the gap between variational and sampling-based methods, sometimes rivalling MCMC accuracy at a fraction of the computational cost.
Diagnostics and Limitations
Unlike MCMC, variational methods lack straightforward convergence diagnostics. The ELBO monitors optimization progress but does not indicate how close q is to p(θ | y). Pathological optima can arise in multimodal posteriors. Recent work on variational inference diagnostics — including Pareto-smoothed importance sampling (PSIS) checks proposed by Yao et al. (2018) — aims to flag unreliable approximations.
"Variational inference trades the asymptotic exactness of MCMC for speed, turning inference into optimization — a trade that is often spectacularly worthwhile."— David Blei, 2017
Worked Example: Mean-Field VI for a Normal Model
We observe 20 values and approximate the posterior of the mean μ using variational inference with a Normal-Inverse-Gamma conjugate model. We compare the VI approximation to the exact posterior.
Prior: μ ~ N(0, 10), σ² ~ IG(1, 1)
Step 1: Exact Posterior κₙ = κ₀ + n = 0.1 + 20 = 20.1
μₙ = (κ₀·μ₀ + n·x̄)/κₙ = (0 + 49.1)/20.1 = 2.443
αₙ = α₀ + n/2 = 1 + 10 = 11
βₙ = β₀ + ½·n·s² + ½·κ₀·n·(x̄ − μ₀)²/κₙ = 1 + 3.18 + 2.99 = 7.17
Step 2: VI Approximation (CAVI) For conjugate Normal model, mean-field VI is exact:
q(μ) = N(2.443, 0.0324)
q(τ) = Gamma(11, 7.17), E[τ] = 1.534
Step 3: Comparison Exact E[μ|data] = 2.443, VI E[μ] = 2.443
KL(q ‖ p) = 0.000 — VI is exact for this model
For conjugate Normal models, mean-field variational inference recovers the exact posterior perfectly — the KL divergence is zero. This is because the true posterior factorizes into independent components for μ and τ, matching the mean-field assumption. For non-conjugate models, VI would only approximate the posterior, with the ELBO providing a lower bound on the log marginal likelihood.