Bayesian Statistics

Bayes Theorem

The foundational equation of Bayesian statistics — describing how to update the probability of a hypothesis in light of new evidence.

P(H | E) = P(E | H) · P(H) / P(E)

Bayes' Theorem takes three inputs — a prior belief, the likelihood of observed evidence under that belief, and the overall probability of the evidence — and returns a posterior belief that precisely quantifies how the evidence should shift one's confidence. Its beauty lies in its simplicity and universality. Whether applied to medical diagnosis, criminal forensics, machine learning, or the search for a lost submarine, the same equation governs rational belief revision.

It is not merely a tool within statistics; it is, as E. T. Jaynes argued, the unique extension of logic to situations of uncertainty.

Bayes' Theorem — Full Form P(H | E)  =  P(E | H) · P(H)  /  P(E)

Where P(H | E)   →  Posterior — updated probability of H after observing E
P(E | H)   →  Likelihood — probability of E if H is true
P(H)       →  Prior — initial probability of H before observing E
P(E)       →  Evidence — total probability of E across all hypotheses

Each component carries distinct meaning. The prior P(H) encodes everything known about the hypothesis before the current evidence arrives — previous experiments, expert judgment, physical constraints, or formal symmetry arguments. The likelihood P(E | H) measures how well the hypothesis predicts the specific evidence observed. The evidence P(E) normalizes the result, ensuring the posterior is a valid probability distribution. And the posterior P(H | E) is the complete output: a new probability that synthesizes prior knowledge with observed data.

Historical Origins

The theorem's history is entwined with the problem of "inverse probability" — reasoning backward from observed effects to probable causes. While direct probability is straightforward (given a fair coin, what is the probability of seven heads in ten flips?), inverse probability is subtle (given seven heads in ten flips, what can we conclude about the coin's fairness?).

1763

Thomas Bayes' manuscript, An Essay towards solving a Problem in the Doctrine of Chances, is published posthumously by Richard Price in the Philosophical Transactions of the Royal Society.

1774

Pierre-Simon Laplace independently discovers and substantially generalizes the result, applying inverse probability to celestial mechanics, demographic estimation, and the theory of errors.

1920s–1930s

The frequentist revolution: R. A. Fisher, Jerzy Neyman, and Egon Pearson develop maximum likelihood, hypothesis testing, and confidence intervals. Inverse probability falls out of mainstream use.

1939–1961

Harold Jeffreys publishes Theory of Probability. L. J. Savage's The Foundations of Statistics (1954) provides axiomatic foundations for subjective probability. Dennis Lindley begins systematic development of Bayesian inference.

1990s

The computational revolution. Gelfand and Smith (1990) demonstrate Gibbs sampling. BUGS software makes MCMC accessible. Complex hierarchical models can be fitted routinely. The Bayesian resurgence begins.

2010s–present

Stan, probabilistic programming, and variational inference bring Bayesian methods to every scientific discipline and industry. Bayesian deep learning, causal inference, and adaptive clinical trials are active frontiers.

The Mechanics of Updating

What makes Bayes' Theorem so powerful is its iterative nature. The posterior from one round of evidence becomes the prior for the next, creating a principled mechanism for sequential learning. With each new observation, beliefs are refined — sharpened when evidence is consistent, and pulled in new directions when it is surprising.

The Odds Form Posterior Odds  =  Likelihood Ratio  ×  Prior Odds

O(H|E)  =  [ P(E|H) / P(E|¬H) ]  ×  O(H)

The odds form is especially intuitive. The likelihood ratio — how much more probable the evidence is under the hypothesis than under its negation — quantifies the diagnostic strength of the evidence. Evidence equally likely under both hypotheses (likelihood ratio = 1) does not shift the odds. Evidence ten times more likely under H multiplies the odds by ten.

Taking logarithms converts multiplication to addition. Each piece of evidence contributes a "weight of evidence" measured in units that Alan Turing called bans and decibans during his cryptanalytic work at Bletchley Park — one of the earliest and most consequential applications of sequential Bayesian reasoning.

Why the Prior Matters Less Than You Think

A common objection to Bayesian reasoning is that the prior is "subjective." But under mild regularity conditions, agents with different priors who observe the same data will see their posteriors converge as the sample size grows. The data eventually overwhelm the prior — a property known as Bayesian consistency.

In finite samples, the prior does matter. But this is a feature, not a bug: the prior makes assumptions explicit and testable, rather than hiding them in the choice of estimator or test.

The Normalizing Constant

The denominator P(E) — the marginal likelihood or evidence — is obtained by summing (or integrating) the numerator over all hypotheses:

Marginal Likelihood (Discrete) P(E)  =  Σᵢ P(E | Hᵢ) · P(Hᵢ)

Marginal Likelihood (Continuous) P(E)  =  ∫ P(E | θ) · π(θ) dθ

For simple problems this integral is easy. For complex models — those with high-dimensional parameter spaces, hierarchical structure, or non-conjugate priors — computing P(E) is the central computational challenge of Bayesian statistics. The development of MCMC, variational inference, nested sampling, and other approximation methods has been driven largely by the need to evaluate or circumvent this integral.

When comparing two models, the marginal likelihood takes on a starring role. The Bayes factor — the ratio of marginal likelihoods — measures the evidence one model provides over another. Because a complex model must spread its prior predictions over a wider range of possible data, it pays an automatic complexity penalty: the Bayesian Occam's razor.

A Worked Example: Medical Diagnosis

A screening test for a disease has 99% sensitivity and 99% specificity. The disease affects 1 in 1,000 people. A patient tests positive. What is the probability they have the disease?

Given P(Disease) = 0.001     P(+Test | Disease) = 0.99     P(+Test | Healthy) = 0.01

Applying Bayes' Theorem P(Disease | +Test)  =  0.99 × 0.001  /  (0.99 × 0.001 + 0.01 × 0.999)
                   =  0.00099 / 0.01098
                   ≈  9.0%

Despite 99% accuracy in both directions, a positive result means only a 9% chance of disease. The low base rate (1 in 1,000) means false positives from the healthy population vastly outnumber true positives. This result — which surprises most people, including many physicians — is a direct consequence of Bayes' Theorem.

Applications

Spam Filtering

Paul Graham's 2002 essay "A Plan for Spam" introduced Bayesian spam filtering, computing P(spam | words) for each email. The Naive Bayes classifier proved remarkably effective, establishing Bayesian reasoning as a practical tool in consumer technology.

Search and Rescue

Bayesian search theory maintains a posterior probability map over possible target locations. After each search pass, the map is updated based on detection probabilities and areas covered. This approach was used to locate the USS Scorpion (1968), Steve Fossett's crash site (2007), and Air France Flight 447's black boxes (2011).

Machine Learning

Bayesian methods underpin Gaussian processes, Bayesian neural networks, Bayesian optimization, and latent Dirichlet allocation. The posterior provides not only predictions but calibrated confidence estimates — critical for safety-sensitive applications.

Forensic Evidence

DNA evidence and fingerprint evaluation are fundamentally Bayesian problems. The likelihood ratio is the correct measure of evidential strength. Confusing the likelihood with the posterior produces the "prosecutor's fallacy," a documented source of wrongful convictions.

"The theory of probabilities is at bottom nothing but common sense reduced to calculus; it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which they are often unable to account." — Pierre-Simon Laplace, Théorie analytique des probabilités (1812)

Philosophical Significance

Bayes' Theorem is a straightforward consequence of the axioms of probability — it is not in dispute as a mathematical identity. What is debated is whether probability should be interpreted as degree of belief (the Bayesian view) or as long-run frequency (the frequentist view).

Cox's theorem (1946) showed that any consistent system for reasoning under uncertainty must be isomorphic to probability theory. De Finetti's representation theorem showed that exchangeability implies a latent parameter with a prior distribution. Together, they suggest Bayesian updating is not one method among many — it is the unique rational method.

These results do not settle all debates. The choice of prior remains contested, and there are settings where Bayesian methods perform poorly. But as a framework for coherent reasoning under uncertainty, Bayes' Theorem has no rival.

Example

You're a college student and you've been feeling under the weather. You have a sore throat. Two illnesses are on your mind: a common cold, which 20% of students currently have, and strep throat, which only 2% of students have. You know that 40% of people with a cold get a sore throat, while 90% of people with strep throat get one. The question is: given that you have a sore throat, how likely is it that you have strep vs. a cold?

Setting Up the Problem

First, identify what you know. These are the priors — the base rates of each illness before considering the sore throat evidence:

Priors (Base Rates) P(Cold) = 0.20
P(Strep) = 0.02
P(Neither) = 0.78

Next, the likelihoods — how probable a sore throat is under each hypothesis:

Likelihoods P(Sore Throat | Cold) = 0.40
P(Sore Throat | Strep) = 0.90
P(Sore Throat | Neither) = 0.02

Computing P(Sore Throat)

Before we can apply Bayes' Theorem, we need the total probability of having a sore throat — regardless of cause. This is the evidence term, computed by summing over all possibilities:

Total Probability of Evidence P(Sore Throat) = P(ST|Cold)·P(Cold) + P(ST|Strep)·P(Strep) + P(ST|Neither)·P(Neither)
             = (0.40 × 0.20) + (0.90 × 0.02) + (0.02 × 0.78)
             = 0.080 + 0.018 + 0.0156
             = 0.1136

So about 11.4% of students are walking around with a sore throat right now.

Applying Bayes' Theorem

Now we update each hypothesis:

Posterior — Cold P(Cold | Sore Throat) = P(ST|Cold) · P(Cold) / P(ST)
                     = (0.40 × 0.20) / 0.1136
                     ≈ 70.4%
Posterior — Strep P(Strep | Sore Throat) = P(ST|Strep) · P(Strep) / P(ST)
                      = (0.90 × 0.02) / 0.1136
                      ≈ 15.8%
Posterior — Neither P(Neither | Sore Throat) = (0.02 × 0.78) / 0.1136 ≈ 13.7%

What the Numbers Tell Us

Before noticing the sore throat, you'd have put only a 2% chance on strep. After, that jumps to nearly 16% — an eightfold increase. The cold, already common, becomes the dominant explanation at 70%. The sore throat was evidence, and Bayes' Theorem told you exactly how much to update each belief.

The Key Insight

Notice that strep throat's probability multiplied by roughly 8× (from 2% to 15.8%) while the cold only went from 20% to 70.4% — about a 3.5× increase. This is because a sore throat is more diagnostic of strep (90% of strep patients get one) than of a cold (only 40%). The likelihood ratio determines the strength of the update, while the prior determines the starting point. Even strong evidence can't overcome a very low base rate — which is exactly why strep, despite being powerfully associated with sore throats, is still not the most probable explanation.

If you then took a rapid strep test and it came back positive, you could run Bayes' Theorem again using today's posteriors as tomorrow's priors. Each new piece of evidence refines the picture — that's the iterative power of Bayesian updating.

Interactive Calculator

Edit the dataset below — each row is a patient record with a known condition (disease or healthy) and a test result (positive or negative). Click Calculate to see Bayes’ Theorem applied to the data, or Animate to watch the statistics update as each record is factored in.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

External Links