Bayesian Statistics

Prior Probability

The prior probability distribution encodes an agent's beliefs about an unknown quantity before observing data, serving as the starting point for Bayesian inference and the mechanism by which background knowledge enters the analysis.

π(θ) — the probability distribution over θ before observing data

In Bayesian statistics, every inference begins with a prior distribution — a probability distribution π(θ) that represents the analyst's beliefs about the parameter θ before any data from the current study are observed. The prior is not a guess or an afterthought; it is a fundamental component of the Bayesian framework, carrying the same formal status as the likelihood function. When combined with the likelihood via Bayes' theorem, the prior yields the posterior distribution, which synthesizes prior knowledge with observed evidence.

The choice of prior is one of the most distinctive and most debated aspects of Bayesian statistics. It is where background knowledge, expert judgment, physical constraints, and sometimes principled ignorance enter the analysis. Done well, it improves inference by incorporating genuine information. Done carelessly, it can distort conclusions. The rich theory of prior specification — conjugate priors, reference priors, weakly informative priors, maximum entropy priors — reflects decades of effort to navigate this balance.

Bayes' Theorem with Explicit Prior π(θ | x) = L(θ; x) · π(θ) / ∫ L(θ; x) · π(θ) dθ

Where π(θ)       →  Prior distribution — beliefs about θ before data
L(θ; x)    →  Likelihood — probability of data x given θ
π(θ | x)   →  Posterior — updated beliefs after observing data x

Types of Priors

Informative Priors

An informative prior concentrates its probability mass in a specific region of the parameter space, reflecting genuine prior knowledge. A pharmacologist estimating a drug's half-life might use a log-normal prior centered on values reported in previous studies. A physicist modeling a fundamental constant might use a prior tightly concentrated around its CODATA value. Informative priors allow the analyst to leverage existing knowledge, producing sharper posteriors — especially when the current data are sparse.

Weakly Informative Priors

Weakly informative priors, championed by Andrew Gelman and others, are designed to regularize inference without dominating it. They rule out implausible parameter values while remaining broad enough to let the data drive the posterior. A weakly informative prior for a regression coefficient might be a normal distribution with a standard deviation large enough to cover all reasonable effect sizes but small enough to penalize absurdly large ones. This approach has become standard practice in applied Bayesian modeling.

Non-Informative and Reference Priors

Non-informative (or "objective") priors attempt to encode ignorance — to let the data speak for themselves. The oldest example is the uniform prior, used by Laplace's rule of succession. Harold Jeffreys proposed priors based on the Fisher information matrix, yielding invariance under reparameterization. José Bernardo developed reference priors that maximize the expected Kullback-Leibler divergence between prior and posterior, formalizing the idea of "learning the most from the data."

Jeffreys Prior π(θ) ∝ √det(I(θ))

Where I(θ) = Fisher information matrix
I(θ)ᵢⱼ = −E[∂² log L(θ; x) / ∂θᵢ ∂θⱼ]

Example: Bernoulli Parameter I(p) = 1 / [p(1−p)]
π(p) ∝ p^(−1/2) · (1−p)^(−1/2) = Beta(1/2, 1/2)

Conjugate Priors

A prior is conjugate to a likelihood if the posterior belongs to the same distributional family as the prior. Conjugacy yields closed-form posteriors, eliminating the need for numerical integration. The Beta-Binomial, Normal-Normal, and Gamma-Poisson pairs are classic examples. While computational advances have reduced the necessity of conjugate analysis, conjugate priors remain invaluable for building intuition, teaching, and as components of larger hierarchical models.

Improper Priors

Some reference priors do not integrate to a finite value — they are improper. The flat prior π(θ) = 1 on the real line is the simplest example. Improper priors can still yield proper (normalizable) posteriors if the likelihood function is sufficiently informative. However, they must be used with care: in model comparison, improper priors lead to undefined Bayes factors, and in some models they can produce improper posteriors.

Cromwell's Rule

Dennis Lindley formulated Cromwell's Rule: never assign probability 0 or 1 to any proposition that is not a logical certainty. If π(θ) = 0 for some value of θ, no amount of data can ever make the posterior positive there — the prior has permanently excluded that possibility. In Cromwell's own words to the Church of Scotland: "I beseech you, in the bowels of Christ, think it possible that you may be mistaken." The rule is a practical safeguard against overconfident priors.

The Role of the Prior in Practice

In small-sample settings, the prior exerts substantial influence on the posterior. With 5 observations from a normal distribution, even a mildly informative prior on the mean can noticeably shift the posterior relative to the maximum likelihood estimate. This is a feature, not a bug: with little data, it is rational to lean on prior knowledge.

As the sample size grows, the likelihood dominates. Under regularity conditions, the posterior converges to a point mass at the true parameter value regardless of the prior — a property known as posterior consistency or the Bernstein–von Mises theorem. Two analysts with different priors will, with enough data, reach essentially identical conclusions. The prior matters most when data are scarce and least when data are abundant.

Bernstein–von Mises Theorem (Informal) As n → ∞, the posterior π(θ | x₁, …, xₙ) converges to
N(θ̂, I(θ̂)⁻¹/n)

Where θ̂    →  Maximum likelihood estimate
I(θ̂) →  Fisher information at θ̂
The posterior becomes approximately normal and independent of the prior.

Prior Elicitation

In applied work, specifying the prior is often the most challenging step. Prior elicitation is the process of translating an expert's knowledge into a formal probability distribution. Techniques include:

Quantile-based elicitation. Ask the expert: "What value is the parameter equally likely to be above or below?" (the median). "What is the value such that you'd be surprised if the parameter exceeded it?" (an upper quantile). Fit a distribution to the elicited quantiles.

Predictive elicitation. Rather than asking about parameters directly, ask about observable quantities: "If we ran this experiment, what range of outcomes would you expect?" Invert the model to find the prior over parameters that induces the stated predictive distribution.

Historical data. Use data from previous studies as the basis for the prior. If a meta-analysis of 20 clinical trials yields a pooled effect estimate and confidence interval, these can be translated into an informative prior for a new trial.

Sensitivity Analysis

Because the prior involves judgment, responsible Bayesian analysis includes prior sensitivity analysis: checking whether the conclusions change meaningfully under alternative reasonable priors. If the posterior is robust to a range of priors, the specific choice matters little and the conclusions are trustworthy. If the posterior is highly sensitive to the prior, the data are insufficient to overwhelm prior assumptions, and the analyst should report this honestly.

Sensitivity analysis can range from informal (rerunning the analysis with a few different priors) to formal (computing the range of posteriors over a class of priors, as in robust Bayesian analysis). Either way, it is an essential part of Bayesian practice — and it has no natural counterpart in frequentist statistics, where hidden assumptions are harder to identify and stress-test.

Philosophical Significance

The prior is the lightning rod of the Bayesian-frequentist debate. Critics charge that priors introduce subjective bias, that different priors yield different conclusions, and that the choice of prior is arbitrary. Defenders reply that all statistical methods involve assumptions — frequentist methods simply hide them in the choice of test statistic, significance level, and stopping rule. The prior makes assumptions explicit, inspectable, and subject to criticism.

"The subjectivist states his judgments, whereas the objectivist sweeps them under the rug by calling assumptions knowledge and reporting opinions as facts." — L. J. Savage, The Foundations of Statistics (1954)

De Finetti's representation theorem provides the deepest justification: if an agent judges observations to be exchangeable, the existence of a prior is not an assumption but a mathematical consequence. The prior is the price of coherence — and coherence is not optional for rational agents.

Example: Estimating a New Restaurant's Quality

You're visiting a new city and considering a restaurant with no online reviews yet. Before stepping inside, you already have some beliefs about how likely it is to be good — that's your prior probability.

Building the Prior

From experience, you know that restaurants in this neighborhood's price range are "good" about 60% of the time. You also notice the restaurant has a James Beard Award sticker in the window, which you've seen at good restaurants 80% of the time and mediocre ones only 10% of the time. You combine these pieces of prior information:

Prior Construction Base rate: P(Good) = 0.60
After noticing the award sticker (before eating):

P(Good | Sticker) = P(Sticker|Good)·P(Good) / P(Sticker)
                = (0.80 × 0.60) / (0.80 × 0.60 + 0.10 × 0.40)
                = 0.48 / 0.52
                ≈ 0.923

Before even ordering, your prior probability that this is a good restaurant is about 92%. This prior will now serve as the starting point when you begin collecting data — the appetizer, the service, the main course — each of which will update your belief further.

Strong vs. Weak Priors

If you had arrived with no knowledge at all — no neighborhood base rate, no award sticker — you might use an "uninformative" prior: 50/50 good or not. This weak prior would let the data (your meal experience) dominate quickly. But with the award sticker, your prior is strong, meaning you'd need a truly terrible meal to overcome that 92% starting belief. The strength of a prior reflects how much evidence you'd need to change your mind — and that's exactly what prior probability quantifies.

Interactive Calculator

Each row is a product review (positive or negative). The calculator shows how the same data leads to different posteriors depending on the prior. A weak prior Beta(1,1) lets data dominate; a strong prior Beta(20,20) resists change. Watch how they diverge early but converge as data accumulates.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

External Links