Bayesian Statistics

Conjugate Prior

A conjugate prior is a prior distribution that, when combined with a particular likelihood function, produces a posterior distribution belonging to the same parametric family as the prior.

If p(θ) is in family F and p(θ|x) is also in F for all x, then F is conjugate to the likelihood.

Conjugate priors occupy a central place in Bayesian analysis because they transform the otherwise difficult problem of computing the posterior into a simple parameter update. When the prior and likelihood are conjugate, the posterior's parameters can be written as closed-form functions of the prior's parameters and the observed data. No integration is needed, no MCMC must be run, and the entire inferential pipeline reduces to algebra.

This algebraic convenience made Bayesian computation tractable long before modern sampling methods existed. From the 1950s through the 1980s, conjugate analysis was essentially the only practical Bayesian approach for most applied problems.

General Pattern Prior:   p(θ | η₀)   ∈  Family F
Likelihood:   p(x | θ)
Posterior:   p(θ | x)  =  p(x | θ) · p(θ | η₀) / p(x)   ∈  Family F with updated parameters η₁

The Exponential Family Foundation

Conjugacy arises naturally within the exponential family of distributions. A likelihood in the exponential family has the form p(x | θ) = h(x) · exp(η(θ)ᵀ T(x) − A(θ)), where T(x) is the sufficient statistic, η(θ) is the natural parameter, and A(θ) is the log-partition function. The conjugate prior for any exponential-family likelihood takes the form p(θ | χ, ν) ∝ exp(η(θ)ᵀ χ − ν · A(θ)), where χ and ν are the prior's hyperparameters.

After observing data x₁, ..., xₙ, the posterior hyperparameters update as χ → χ + Σ T(xᵢ) and ν → ν + n. The prior hyperparameters have a direct interpretation: χ represents "pseudo-observations" with sufficient statistic χ, and ν represents the number of such pseudo-observations.

Exponential Family Conjugate Update Prior:   p(θ | χ₀, ν₀)  ∝  exp(η(θ)ᵀ χ₀ − ν₀ · A(θ))

Posterior:   p(θ | x₁,…,xₙ)  ∝  exp(η(θ)ᵀ (χ₀ + Σᵢ T(xᵢ)) − (ν₀ + n) · A(θ))

Classic Conjugate Pairs

Several conjugate pairs appear throughout applied Bayesian analysis. The Beta-Binomial pair is perhaps the most widely taught: a Beta(α, β) prior on the success probability of a Binomial likelihood yields a Beta(α + s, β + n − s) posterior, where s is the number of successes in n trials. The Normal-Normal pair is equally fundamental: a Normal prior on the mean of a Normal likelihood (with known variance) yields a Normal posterior whose mean is a precision-weighted average of the prior mean and the sample mean.

Other important conjugate pairs include the Gamma-Poisson (for count data), the Gamma-Exponential (for rate parameters), the Dirichlet-Multinomial (for categorical proportions), the Normal-Inverse-Gamma (for the mean and variance of Normal data), and the Inverse-Wishart-Normal (for covariance matrices in multivariate settings).

The Beta-Binomial: A Canonical Example

Suppose you observe s = 7 successes in n = 10 Bernoulli trials and begin with a Beta(2, 2) prior on the success probability θ. The posterior is Beta(2 + 7, 2 + 3) = Beta(9, 5). The prior mean was 0.5; the posterior mean is 9/14 ≈ 0.643. The data pulled the estimate toward the observed proportion 0.7, but the prior exerts a moderate shrinkage effect. Had you started with a Beta(1, 1) — the uniform prior — the posterior mean would have been 8/12 ≈ 0.667, closer to the data.

Interpretive Advantages

Conjugate priors offer profound interpretive clarity. Because the prior hyperparameters update additively with the sufficient statistics of the data, they can always be understood as encoding "prior data" — fictitious observations that represent prior knowledge. The hyperparameter ν₀ in the conjugate prior framework functions as a prior sample size, directly controlling the strength of the prior relative to the data. When ν₀ is small relative to n, the posterior is dominated by the data; when ν₀ is large, the prior dominates.

This interpretation makes prior elicitation concrete. Instead of asking an expert to specify abstract distributional parameters, one can ask: "How many observations' worth of confidence do you place in your prior belief?" The answer maps directly to the conjugate prior's hyperparameters.

Limitations and Modern Context

Conjugacy constrains the form of the prior, which may not faithfully represent actual prior knowledge. A researcher who has genuine reason to believe a parameter lies in a particular non-conjugate distribution must either sacrifice computational convenience or distort their beliefs. With the advent of MCMC methods in the 1990s and variational inference in the 2000s, this trade-off has become less severe. Modern probabilistic programming languages like Stan and PyMC allow arbitrary prior specifications.

Nevertheless, conjugate priors remain valuable as default choices, interpretive benchmarks, and components of larger hierarchical models. They also play a key role in variational inference, where the optimal approximate posterior in a mean-field family often takes conjugate form.

"The conjugate prior is not merely a computational convenience. It is the unique prior that can be interpreted as additional data from the same generating process." — Howard Raiffa and Robert Schlaifer, Applied Statistical Decision Theory (1961)

Historical Development

1961

Raiffa and Schlaifer systematize conjugate analysis in Applied Statistical Decision Theory, providing tables of conjugate pairs for common likelihoods.

1962

Diaconis and Ylvisaker later formalize the connection between conjugate priors and linear posterior expectations within exponential families.

1990s

MCMC methods free practitioners from the conjugacy constraint, but conjugate priors remain widely used for their interpretive clarity and computational speed.

Example: Beta-Binomial Conjugacy in Clinical Trials

A pharmaceutical company is testing a new drug. Before the trial, prior studies suggest the drug works in about 50% of patients, but with considerable uncertainty. The statistician chooses a Beta(2, 2) prior, which is centered at 0.5 with mild concentration.

Step 1: Define the prior Prior: Beta(α₀ = 2, β₀ = 2)
Prior mean: α₀ / (α₀ + β₀) = 2/4 = 0.50
Effective sample size: α₀ + β₀ = 4 pseudo-observations

The trial enrolls 20 patients. Of these, 14 respond successfully and 6 do not (s = 14, f = 6).

Step 2: Conjugate update Posterior: Beta(α₀ + s, β₀ + f) = Beta(2 + 14, 2 + 6) = Beta(16, 8)
Posterior mean: 16 / (16 + 8) = 16/24 ≈ 0.667
Posterior mode: (16 − 1) / (16 + 8 − 2) = 15/22 ≈ 0.682

The posterior mean (0.667) lies between the prior mean (0.50) and the sample proportion (14/20 = 0.70), pulled slightly toward the prior. With only 4 pseudo-observations against 20 real observations, the data dominate — exactly as conjugacy predicts. A grid approximation over 200 parameter values produces the same posterior mean to four decimal places, confirming that conjugacy gives the exact answer with simple arithmetic.

Interactive Calculator

Each row is an outcome (success or failure). The calculator demonstrates Beta-Binomial conjugacy: starting from a Beta(2,2) prior, it updates to Beta(2+s, 2+f). It also runs a grid approximation to show that conjugacy gives the exact same answer — no numerical integration needed.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

External Links