The term "evidence" in the context of Bayes' theorem has a precise technical meaning: it is the total probability of the observed data, computed by averaging the likelihood over all hypotheses or parameter values weighted by their prior probabilities. Denoted P(E) or P(x), it appears as the denominator of Bayes' theorem and is variously called the evidence, the marginal likelihood, the prior predictive probability, or the normalizing constant. Each name highlights a different aspect of the same quantity.
As a normalizing constant, the evidence ensures that the posterior distribution integrates to one. For many applications — particularly parameter estimation — the evidence can be ignored, since the unnormalized posterior π(θ | x) ∝ L(θ; x) · π(θ) is sufficient for MCMC sampling and for computing posterior means, medians, and credible intervals. But when comparing models or quantifying the total weight of data against a hypothesis, the evidence takes center stage.
Computing P(E) — Discrete Hypotheses P(E) = Σᵢ P(E | Hᵢ) · P(Hᵢ)
Computing P(E) — Continuous Parameter P(E) = ∫ P(E | θ) · π(θ) dθ
Why the Normalizing Constant Matters
It is tempting to dismiss P(E) as mere bookkeeping — a constant that washes out in proportional calculations. But this view misses its deeper significance. The evidence is the prior predictive probability of the data: how probable the observed data were before they were observed, given the model as a whole (including the prior). A model that predicted the data well — that assigned high prior predictive probability to data resembling what was actually observed — receives a high evidence value. A model that spread its probability thinly across many possible datasets, or that concentrated probability on data very different from what was observed, receives a low evidence value.
When comparing two models M₁ and M₂, the Bayes factor BF₁₂ = P(x | M₁) / P(x | M₂) is the ratio of their evidence values. This ratio directly measures which model predicted the observed data better. Because the evidence integrates over the prior, complex models that spread probability thinly are automatically penalized — the Bayesian Occam's razor. The evidence is therefore not just a normalizing constant; it is the single number that summarizes a model's predictive performance.
The Law of Total Probability
The computation of P(E) is a direct application of the law of total probability. In the discrete case, one enumerates all mutually exclusive hypotheses, weights the likelihood of the evidence under each by the prior probability of that hypothesis, and sums. In the continuous case, the sum becomes an integral. This is the same operation as marginalizing a joint distribution — hence the name "marginal likelihood."
Example: Medical Screening P(+Test) = P(+Test | Disease) · P(Disease) + P(+Test | Healthy) · P(Healthy)
= 0.99 × 0.001 + 0.01 × 0.999
= 0.00099 + 0.00999 = 0.01098
In the medical screening example, the evidence P(+Test) = 0.01098 tells us that about 1.1% of the population will test positive, combining both true positives (those with the disease) and false positives (healthy individuals who test positive). This total rate is what anchors the posterior probability of disease given a positive test to its surprisingly low value of about 9%.
Evidence and Surprise
The evidence also connects to information-theoretic notions of surprise. Data that are highly improbable under the model — low P(E) — are surprising and carry more information. The log-evidence, −log P(E), is the surprisal or self-information of the data under the model. Accumulated across observations, it measures the total information the data provide.
This perspective is central to minimum description length (MDL) approaches and to the connection between Bayesian inference and information theory. The model with the highest evidence (lowest surprisal) is the one that compresses the data most efficiently — it is the model that, in a precise sense, best "understands" the data.
Computational Challenges
For most models of practical interest, the evidence integral is intractable. The integral is over the full parameter space, which may be high-dimensional, and the integrand (likelihood × prior) may be highly concentrated in a small region with complex geometry. This computational difficulty is the central challenge of Bayesian statistics.
Methods for estimating the evidence include:
Conjugate analysis. When the prior is conjugate to the likelihood, the evidence has a closed form. For the Beta-Binomial model: P(k | n, α, β) = C(n,k) · B(α + k, β + n − k) / B(α, β), where B is the beta function.
Laplace approximation. Approximate the log-integrand as quadratic around its mode, yielding a Gaussian integral with a known solution.
Nested sampling. Transform the multi-dimensional integral into a one-dimensional integral over prior mass, estimated by drawing samples from nested constrained priors.
Variational lower bound (ELBO). Variational inference maximizes a lower bound on log P(E), which simultaneously approximates the posterior and provides an (under-)estimate of the evidence.
Where q(θ) → Variational approximation to the posterior
The bound is tight when q(θ) = π(θ | x).
Evidence vs. Support
It is worth distinguishing the evidence P(E) from the informal notion of "evidence for a hypothesis." In everyday language, we say the data provide evidence for H if they make H more probable. In the Bayesian framework, this is formalized by the likelihood ratio: P(E | H) / P(E | ¬H). Data provide evidence for H precisely when this ratio exceeds 1 — that is, when the data are more probable under H than under its alternative. The normalizing constant P(E), by contrast, is not about any particular hypothesis; it is about the data's overall probability under the full model.
Historical Notes
Bayes' original essay computes the posterior for a binomial parameter using what amounts to a marginal likelihood calculation — integrating the product of likelihood and prior (uniform) over the unit interval.
Harold Jeffreys develops the Bayes factor as a tool for hypothesis testing, placing the marginal likelihood (evidence) at the center of model comparison for the first time.
Robert Kass and Adrian Raftery publish their influential review "Bayes Factors" in the Journal of the American Statistical Association, standardizing interpretation scales and computational methods.
John Skilling introduces nested sampling, providing a general-purpose algorithm specifically designed for evidence computation.
"The evidence is not just a normalizing constant — it is the predictive score of the model. The model that wins is the one that saw the data coming." — David MacKay, Information Theory, Inference, and Learning Algorithms (2003)
Example: Is This Email Spam or Not?
A spam filter evaluates an incoming email containing the word "lottery." The filter maintains two models: Spam and Legitimate. It needs to compute P(Spam | "lottery") using Bayes' theorem — and the evidence term P("lottery") is the key to making the posterior a valid probability.
The Role of Evidence
The numerator is easy to estimate from training data. But what about the denominator — the evidence P("lottery")?
= (0.28 × 0.40) + (0.01 × 0.60)
= 0.112 + 0.006
= 0.118
Now the posterior:
Without dividing by P("lottery"), the numerator 0.112 is just an unnormalized score — it doesn't mean anything as a probability on its own. The evidence term ensures that P(Spam | "lottery") + P(Legit | "lottery") = 1. It accounts for the overall prevalence of the word "lottery" across all emails, both spam and legitimate. The rarer the word is overall, the more diagnostic it becomes — and the evidence term is what captures this.