Marginal Likelihood — BayesianStatistics.com

The marginal likelihood P(x) is the total probability of the observed data under a given model, obtained by integrating (or summing) the likelihood over all possible parameter values, weighted by the prior distribution. It appears as the denominator of Bayes' theorem, where it serves as a normalizing constant ensuring the posterior integrates to one. But its importance extends far beyond normalization: the marginal likelihood is the cornerstone of Bayesian model comparison, embodying an automatic trade-off between model fit and model complexity known as the Bayesian Occam's razor.

Computing the marginal likelihood is, in general, the central computational challenge of Bayesian statistics. For conjugate models it has a closed form. For most realistic models, it requires numerical integration, Monte Carlo estimation, or analytical approximation — and each of these approaches has spawned a rich literature of its own.

The Bayesian Occam's Razor

The marginal likelihood automatically penalizes model complexity without any explicit penalty term. A complex model with many parameters can fit a wide variety of datasets. This means its prior predictive probability is spread thinly across a large space of possible data. A simpler model concentrates its prior predictions on a smaller set of data patterns. If the observed data happen to fall within that concentrated region, the simpler model achieves a higher marginal likelihood — even though the complex model could fit the data equally well or better at its optimal parameter values.

Bayesian Occam's Razor — Intuition Complex model: spreads probability across many possible datasets
→ low probability for any specific dataset
Simple model: concentrates probability on fewer possible datasets
→ high probability for datasets in that concentrated region

Result When both models can explain the data, the simpler model
achieves a higher marginal likelihood — a built-in parsimony preference.

This automatic complexity penalty is one of the most elegant features of Bayesian inference. It requires no ad hoc penalty terms, no cross-validation, and no separate model selection criterion. The marginal likelihood directly measures the model's ability to predict the data, averaged over the prior uncertainty in its parameters. More complex models are penalized because they waste probability on data patterns that were not observed.

Why Not Just Compare Likelihoods?

A model's maximized likelihood — the likelihood evaluated at the MLE — always increases (or stays the same) when parameters are added. A polynomial of degree 10 will always fit the data at least as well as a polynomial of degree 3. But the marginal likelihood averages over parameters rather than maximizing, so adding useless parameters hurts: they dilute the prior probability without improving predictions. This is why the Bayes factor is fundamentally different from a likelihood ratio test — and why it naturally avoids overfitting.

Bayes Factors

The ratio of marginal likelihoods under two competing models is the Bayes factor, the Bayesian tool for model comparison:

Bayes Factor BF₁₂ = P(x | M₁) / P(x | M₂)

Relationship to Posterior Model Odds P(M₁ | x) / P(M₂ | x) = BF₁₂ · P(M₁) / P(M₂)

Interpretation (Kass and Raftery, 1995) BF₁₂ = 1–3:        barely worth mentioning
BF₁₂ = 3–20:       positive evidence for M₁
BF₁₂ = 20–150:     strong evidence
BF₁₂ > 150:        very strong evidence

The Bayes factor quantifies how much the data shift the odds in favor of one model versus another. A Bayes factor of 10 means the data are 10 times more probable under model 1 than model 2. Unlike p-values, the Bayes factor can provide evidence for the null hypothesis, not just against it — a BF₁₂ of 1/10 is strong evidence favoring M₂.

Computation

The marginal likelihood is an integral over the entire parameter space — often high-dimensional, with complex geometry. For conjugate models (Beta-Binomial, Normal-Normal-Inverse-Gamma, etc.), closed-form solutions exist. For everything else, approximation is necessary.

Laplace Approximation

The Laplace approximation replaces the integrand with a Gaussian centered at the posterior mode. For d-dimensional θ, the result is approximately P(x) ≈ (2π)^(d/2) |Σ̂|^(1/2) · L(θ̂) · π(θ̂), where Σ̂ is the inverse of the negative Hessian of the log-posterior at its mode. The Bayesian Information Criterion (BIC) can be derived as a rough version of this approximation.

Monte Carlo Methods

Importance sampling, bridge sampling, and nested sampling provide stochastic estimates of the marginal likelihood. Nested sampling, introduced by John Skilling in 2004, is particularly well-suited to the task: it transforms the multi-dimensional integral into a one-dimensional integral over the prior mass, making it tractable even in high dimensions. It has become the standard tool in astrophysics and cosmology.

Harmonic Mean Estimator

The harmonic mean of likelihood values from MCMC samples provides an unbiased estimate of 1/P(x), and hence an estimate of P(x). However, this estimator has infinite variance in many settings and is widely regarded as unreliable. As Radford Neal memorably put it, it is "the worst Monte Carlo method ever."

Sensitivity to the Prior

The marginal likelihood depends on the prior — not just through the likelihood but through the normalization of the prior itself. This is a critical difference from posterior inference, which is often robust to reasonable changes in the prior. Two analysts who use proper but different priors will compute different marginal likelihoods and different Bayes factors. With improper priors (such as a flat prior on the real line), the marginal likelihood is undefined because the prior does not integrate to a finite value.

This sensitivity makes prior specification especially important in model comparison. Many methodologists recommend using proper, carefully chosen priors for model comparison even when vague priors suffice for parameter estimation. Intrinsic Bayes factors and fractional Bayes factors are techniques designed to handle this issue by calibrating the prior using a portion of the data.

Connections to Information Theory

The logarithm of the marginal likelihood decomposes as:

Log Marginal Likelihood Decomposition log P(x | M) = [expected log-likelihood] − KL(posterior ‖ prior)

log P(x | M) = E_{π(θ|x)}[log P(x | θ)] − KL(π(θ|x) ‖ π(θ))

Interpretation Term 1: how well the model fits (goodness of fit)
Term 2: how much the data changed the prior (complexity penalty)

This decomposition makes the Occam's razor explicit. The first term rewards models that fit the data well. The second penalizes models whose posterior differs greatly from the prior — meaning the data forced a large shift in beliefs, which happens when the model is too flexible. The marginal likelihood is high when the model fits well and the prior was already in the right neighborhood — that is, when the model made good predictions before seeing the data.

"The marginal likelihood is the probability that the model assigned to the data before seeing it. If a model predicted the data well in advance, it deserves credit. If it must contort its parameters to fit, it does not." — David MacKay, Information Theory, Inference, and Learning Algorithms (2003)

Example: Choosing Between Two Earthquake Models

A seismologist has recorded 15 earthquakes in a region over 10 years. She considers two models for predicting earthquake frequency:

Model A (Simple): Earthquakes follow a Poisson process with a constant rate λ, with a Gamma(2, 1) prior on λ.

Model B (Complex): Earthquakes follow a two-state model with a "quiet" rate λ₁ and an "active" rate λ₂, plus a switching probability, with priors on all three parameters.

Computing the Marginal Likelihood

For each model, the marginal likelihood integrates the likelihood over all possible parameter values, weighted by the prior:

Marginal Likelihood P(data | Model) = ∫ P(data | θ, Model) · P(θ | Model) dθ

Model A must integrate over one parameter (λ). Model B must integrate over three (λ₁, λ₂, and the switching probability). Even if Model B can fit the data better at its best parameter values, it must spread its prior probability over a much larger parameter space. Most of that space predicts data that looks nothing like what was observed.

Result P(data | Model A) = 0.0042
P(data | Model B) = 0.0018

Bayes Factor = 0.0042 / 0.0018 ≈ 2.3 in favor of Model A

The Bayesian Occam's Razor

The marginal likelihood automatically penalizes model complexity. Model B has more parameters and can fit a wider range of possible datasets — but that flexibility is a liability when the data are consistent with the simpler model. The marginal likelihood rewards models that predicted the observed data in advance, not models that can be contorted to fit it after the fact. This built-in complexity penalty is why the marginal likelihood is the gold standard for Bayesian model comparison.

Interactive Calculator

Each row is the count of earthquakes recorded in a region for one year. The calculator compares two models: a simple Poisson (constant rate) and a more flexible Negative Binomial (variable rate). The marginal likelihood automatically penalizes unnecessary complexity — the Bayesian Occam's razor.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

The Bayesian Occam's Razor

Bayes Factors

Computation

Laplace Approximation

Monte Carlo Methods

Harmonic Mean Estimator

Sensitivity to the Prior

Connections to Information Theory

Example: Choosing Between Two Earthquake Models

Computing the Marginal Likelihood

Interactive Calculator

Related Topics

External Links