Variational Bayesian Methods — BayesianStatistics.com

Exact Bayesian inference requires computing the posterior p(θ | y), which involves the often-intractable marginal likelihood p(y) = ∫ p(y | θ) p(θ) dθ. Variational Bayesian (VB) methods sidestep this integration by reformulating inference as optimization: find a distribution q(θ) from a tractable family Q that minimizes the Kullback–Leibler divergence KL(q ‖ p(· | y)). Because this KL divergence itself involves p(y), the optimization is equivalently expressed as maximizing the evidence lower bound (ELBO).

The Evidence Lower Bound

ELBO Decomposition log p(y) = ELBO(q) + KL(q(θ) ‖ p(θ | y))

ELBO(q) = E_q[log p(y, θ)] − E_q[log q(θ)]
= E_q[log p(y | θ)] − KL(q(θ) ‖ p(θ))

Since KL(q ‖ p(· | y)) ≥ 0, the ELBO is always a lower bound on log p(y). Maximizing the ELBO is equivalent to minimizing the KL divergence to the posterior. The second form of the ELBO — expected log-likelihood minus the KL from prior to approximate posterior — makes the trade-off intuitive: the variational distribution must fit the data while remaining close to the prior.

Mean-Field Variational Inference

The most common restriction on Q is the mean-field assumption: q(θ) = ∏ᵢ qᵢ(θᵢ), where each factor is optimized independently. Under this factorization, the optimal form of each factor satisfies:

Coordinate Ascent Update log q*ⱼ(θⱼ) = E_{q₋ⱼ}[log p(y, θ)] + const

This yields coordinate ascent variational inference (CAVI), which cycles through each factor, updating it while holding others fixed. For exponential-family models with conjugate priors, these updates have closed-form expressions, making CAVI extremely efficient. The ELBO is guaranteed to increase at each step, converging to a local optimum.

Historical Context

1990s

Variational methods entered machine learning from statistical physics (mean-field theory) and information theory. Saul, Jaakkola, and Jordan (1996) applied variational bounds to Boltzmann machines and mixture models.

1999–2003

Attias, Ghahramani, and Beal developed variational Bayes for graphical models. Beal's 2003 thesis provided a comprehensive treatment of VB for a wide class of latent-variable models.

2013

Kingma and Welling introduced the Variational Autoencoder (VAE), and concurrently Rezende et al. proposed stochastic backpropagation, marrying variational inference with deep learning.

2014–present

Black-box variational inference (Ranganath et al., 2014), normalizing flows (Rezende & Mohamed, 2015), and amortized inference expanded the scope of VI to virtually arbitrary models.

Stochastic Variational Inference

Classical CAVI requires a full pass over the dataset at each iteration, making it prohibitive for large-scale problems. Stochastic variational inference (SVI), introduced by Hoffman et al. (2013), uses stochastic gradient ascent on the ELBO with minibatches of data, enabling scalability to millions of observations. The reparameterization trick — expressing samples from q as deterministic transformations of noise — allows low-variance gradient estimates, and underpins much of modern deep generative modelling.

KL Direction Matters

Variational inference minimizes KL(q ‖ p), the "reverse" or "exclusive" KL divergence. This tends to produce approximations that are mode-seeking: q will concentrate on one mode of the posterior and underestimate variance. The "forward" or "inclusive" KL, KL(p ‖ q), used in expectation propagation, yields moment-matching, mass-covering approximations. The choice of divergence profoundly affects the quality and character of the approximation.

Richer Variational Families

The mean-field assumption can be restrictive. Structured variational inference preserves selected dependencies. Normalizing flows transform a simple base distribution through a sequence of invertible mappings, yielding flexible yet tractable densities. These approaches close the gap between variational and sampling-based methods, sometimes rivalling MCMC accuracy at a fraction of the computational cost.

Diagnostics and Limitations

Unlike MCMC, variational methods lack straightforward convergence diagnostics. The ELBO monitors optimization progress but does not indicate how close q is to p(θ | y). Pathological optima can arise in multimodal posteriors. Recent work on variational inference diagnostics — including Pareto-smoothed importance sampling (PSIS) checks proposed by Yao et al. (2018) — aims to flag unreliable approximations.

"Variational inference trades the asymptotic exactness of MCMC for speed, turning inference into optimization — a trade that is often spectacularly worthwhile."— David Blei, 2017

Worked Example: Mean-Field VI for a Normal Model

We observe 20 values and approximate the posterior of the mean μ using variational inference with a Normal-Inverse-Gamma conjugate model. We compare the VI approximation to the exact posterior.

Given 20 observations with x̄ = 2.455, s² = 0.318
Prior: μ ~ N(0, 10), σ² ~ IG(1, 1)

Step 1: Exact Posterior κₙ = κ₀ + n = 0.1 + 20 = 20.1
μₙ = (κ₀·μ₀ + n·x̄)/κₙ = (0 + 49.1)/20.1 = 2.443
αₙ = α₀ + n/2 = 1 + 10 = 11
βₙ = β₀ + ½·n·s² + ½·κ₀·n·(x̄ − μ₀)²/κₙ = 1 + 3.18 + 2.99 = 7.17

Step 2: VI Approximation (CAVI) For conjugate Normal model, mean-field VI is exact:
q(μ) = N(2.443, 0.0324)
q(τ) = Gamma(11, 7.17), E[τ] = 1.534

Step 3: Comparison Exact E[μ|data] = 2.443, VI E[μ] = 2.443
KL(q ‖ p) = 0.000 — VI is exact for this model

For conjugate Normal models, mean-field variational inference recovers the exact posterior perfectly — the KL divergence is zero. This is because the true posterior factorizes into independent components for μ and τ, matching the mean-field assumption. For non-conjugate models, VI would only approximate the posterior, with the ELBO providing a lower bound on the log marginal likelihood.

Interactive Calculator

Each row is a numeric value. The calculator fits a Normal model using variational inference (VI): it approximates the posterior of (μ, σ²) by optimizing the ELBO (Evidence Lower Bound). Compare the VI approximation with the exact conjugate posterior to see how well mean-field VI captures the true posterior.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.