Bayesian Statistics

Likelihood Function

The likelihood function L(θ | x) measures how well each possible parameter value explains the observed data, serving as the bridge between data and inference in both Bayesian and frequentist statistics.

L(θ | x) = P(x | θ), viewed as a function of θ for fixed data x

The likelihood function is one of the most important concepts in all of statistics. Given observed data x, the likelihood L(θ | x) is defined as the probability (or probability density) of the data viewed as a function of the parameter θ, with x held fixed. It is numerically identical to the sampling distribution P(x | θ), but its interpretation is fundamentally different: where the sampling distribution asks "given this parameter value, how probable are these data?", the likelihood asks "given these data, how well does this parameter value explain them?"

This reversal of roles — from varying data to varying parameters — is the conceptual key to statistical inference. The likelihood function encodes everything the data have to say about the parameter. In Bayesian inference, it is multiplied by the prior to produce the posterior. In frequentist inference, it is maximized to produce the MLE. In both frameworks, the likelihood is the data's contribution to inference.

Likelihood Function L(θ | x) = P(x | θ)   [viewed as a function of θ for fixed x]

For Independent Observations L(θ | x₁, …, xₙ) = ∏ᵢ₌₁ⁿ P(xᵢ | θ)

Log-Likelihood ℓ(θ | x) = log L(θ | x) = Σᵢ₌₁ⁿ log P(xᵢ | θ)

Likelihood Is Not a Probability

A critical distinction: the likelihood function is not a probability distribution over θ. It does not integrate to 1 over the parameter space (and in general does not integrate to any finite value). Two likelihoods that are proportional — L₁(θ) = c · L₂(θ) for some constant c > 0 — carry the same information about θ. This is why the likelihood is often written with a proportionality sign: L(θ | x) ∝ P(x | θ).

The distinction matters practically. One cannot read off a "probability that θ = 0.5" from the likelihood function. To obtain a probability distribution over θ, the Bayesian multiplies the likelihood by a prior and normalizes. The frequentist avoids distributions over θ entirely, instead using the likelihood to construct estimators and test statistics whose properties are evaluated across repeated samples.

The Likelihood Principle

The likelihood principle states that all of the evidence about θ provided by the data is contained in the likelihood function. Two datasets from different experiments that yield proportional likelihood functions are evidentially equivalent. Bayesian methods automatically satisfy this principle — the posterior depends on the data only through the likelihood. Many frequentist methods violate it, since p-values and confidence intervals depend on the sample space (outcomes that could have occurred but did not). Birnbaum's theorem (1962) showed that the likelihood principle follows from two widely accepted evidential principles.

Maximum Likelihood Estimation

The maximum likelihood estimator (MLE), introduced by R. A. Fisher in the 1920s, is the parameter value that maximizes the likelihood function:

Maximum Likelihood Estimator θ̂_MLE = arg max_θ L(θ | x) = arg max_θ ℓ(θ | x)

Score Equation ∂ℓ(θ | x) / ∂θ |_{θ=θ̂} = 0

Asymptotic Distribution θ̂_MLE ~ N(θ₀, I(θ₀)⁻¹/n)   as n → ∞

The MLE has many desirable properties: consistency, asymptotic efficiency, asymptotic normality, and invariance under reparameterization. Fisher regarded it as the ideal frequentist estimator. From a Bayesian perspective, the MLE is the posterior mode under a flat prior — and the Bernstein–von Mises theorem shows that Bayesian posteriors center on the MLE as sample sizes grow.

The Likelihood in Bayesian Inference

In Bayesian statistics, the likelihood plays a dual role. First, it is the weight function that transforms the prior into the posterior: regions of the parameter space where the likelihood is high receive increased posterior mass, while regions where it is low are downweighted. Second, the shape of the likelihood — its curvature, width, and symmetry — determines how informative the data are. A sharply peaked likelihood conveys precise information and dominates the prior; a flat likelihood is uninformative and lets the prior persist.

Bayesian Role of the Likelihood π(θ | x) ∝ L(θ | x) · π(θ)

Information Content Observed Fisher Information:  J(θ̂) = −∂²ℓ(θ | x) / ∂θ² |_{θ=θ̂}
Large J(θ̂) → sharply peaked likelihood → data are informative
Small J(θ̂) → flat likelihood → data are uninformative

Likelihood Ratios

The likelihood ratio compares how well two parameter values explain the data:

Likelihood Ratio Λ(θ₁, θ₂) = L(θ₁ | x) / L(θ₂ | x)

Interpretation Λ > 1: the data are more probable under θ₁ than θ₂
Λ = 10: the data are 10 times more probable under θ₁
Λ = 1: the data do not discriminate between θ₁ and θ₂

Likelihood ratios are the fundamental measure of evidential strength. In the odds form of Bayes' theorem, the likelihood ratio is the factor that converts prior odds to posterior odds. In forensic statistics, the likelihood ratio is the recommended measure for presenting the weight of DNA evidence, fingerprint comparisons, and other forensic data. In sequential analysis, the log-likelihood ratio accumulates evidence over time — a framework Alan Turing used at Bletchley Park for breaking the Enigma cipher.

Historical Development

1763–1812

Bayes and Laplace use what we would now call likelihoods in their work on inverse probability, though the concept is not distinguished from the sampling distribution.

1921–1922

R. A. Fisher introduces the term "likelihood" and distinguishes it sharply from probability. He defines maximum likelihood estimation and establishes its asymptotic properties.

1962

Allan Birnbaum proves that the sufficiency and conditionality principles jointly entail the likelihood principle — that all evidential content resides in the likelihood function.

1980s–present

Likelihoods for complex models (hierarchical models, latent variable models, survival models) become tractable through EM algorithms, MCMC, and numerical integration. The likelihood remains the universal interface between data and inference.

Likelihoods for Common Models

The form of the likelihood varies with the statistical model. For a Bernoulli model with parameter p and data consisting of k successes in n trials, L(p) = p^k(1−p)^(n−k). For a normal model with known variance σ², the likelihood for the mean μ is L(μ) ∝ exp(−n(μ − x̄)²/(2σ²)) — a Gaussian centered on the sample mean. For a Poisson model with rate λ, L(λ) = e^(−nλ) · λ^(Σxᵢ). In each case, the likelihood captures the data's message about the parameter through the sufficient statistics.

"What the use of P [the likelihood] implies, therefore, is that the mathematical concept of probability is inadequate to express our mental confidence or diffidence in making such inferences, and that the mathematical quantity which appears to be appropriate for expressing the degree of our confidence is the likelihood." — R. A. Fisher, "On the Mathematical Foundations of Theoretical Statistics" (1922)

Fisher's insistence on separating likelihood from probability was a conceptual breakthrough. It clarified the logic of statistical inference and gave both Bayesians and frequentists a common foundation. The likelihood function remains the one component of inference on which virtually all statisticians agree.

Example: Identifying a Mystery Animal from Footprints

A wildlife biologist finds large paw prints near a campsite in Montana. Three hypotheses are on the table: grizzly bear, mountain lion, or large dog. She measures the print width at 13 cm and asks: under each hypothesis, how likely is a 13 cm print?

Computing Likelihoods

From her reference database, she knows the distribution of paw-print widths for each species:

Likelihood of a 13 cm Print Under Each Hypothesis L(Grizzly) = P(13 cm | Grizzly) = 0.35   (13 cm is common for grizzlies)
L(Mountain Lion) = P(13 cm | Mountain Lion) = 0.05   (too wide for most lions)
L(Large Dog) = P(13 cm | Large Dog) = 0.12   (possible but unusual)

What the Likelihood Function Tells Us

The likelihood function ranks the hypotheses by how well each predicts the observed evidence. The 13 cm print is 7 times more likely under the grizzly hypothesis than the mountain lion hypothesis, and about 3 times more likely than the large dog hypothesis.

Likelihood Ratios LR(Grizzly vs. Lion) = 0.35 / 0.05 = 7.0
LR(Grizzly vs. Dog) = 0.35 / 0.12 ≈ 2.9

The likelihood tells you the evidential strength of the data alone. It doesn't tell you the probability that a grizzly made the print — for that, you also need priors (How common are grizzlies in this area? Are dogs allowed on this trail?). But the likelihood is the part of the evidence that both Bayesians and frequentists agree on. It captures exactly what the data say, no more and no less.

Likelihood Is Not Probability

A critical subtlety: L(Grizzly) = 0.35 does not mean there is a 35% chance a grizzly made the print. It means that if a grizzly did make the print, there would be a 35% chance the print would be exactly 13 cm wide. The likelihood function asks "How well does each hypothesis predict the data?" — not "How probable is each hypothesis?" Confusing the two is a common error and the source of many statistical fallacies.

Interactive Calculator

Each row is a paw print width in cm found along a trail. The calculator computes the likelihood of the data under three hypotheses — Grizzly Bear (mean 13 cm), Mountain Lion (mean 8 cm), and Large Dog (mean 7 cm) — and reports likelihood ratios. The hypothesis that best predicts the observed measurements has the highest likelihood.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

External Links