In 1962, Allan Birnbaum published a result that sent a shockwave through the foundations of statistics. He proved that two principles almost universally accepted by statisticians — the sufficiency principle and the conditionality principle — logically entail a third principle, the likelihood principle, which most frequentist methods violate. The theorem forced a stark choice: abandon one of two seemingly innocuous premises, or accept a conclusion that undermines the foundations of significance testing, confidence intervals, and p-values.
The theorem remains one of the most discussed and debated results in statistical foundations. It is not a theorem about how to do statistics; it is a theorem about what evidence means — and its implications reach directly into the heart of the Bayesian-frequentist divide.
Equivalently Ev(E, x) = Ev(E′, x′) whenever L(θ; x) ∝ L(θ; x′)
Where Ev(E, x) → The evidential meaning of outcome x from experiment E
L(θ; x) → The likelihood function for parameter θ given data x
∝ → Proportional as a function of θ
The Three Principles
Birnbaum's theorem connects three principles about statistical evidence. To state them precisely, we need the concept of an experiment E — a specification of the sample space, the family of probability distributions indexed by a parameter θ, and the observation procedure — and an outcome x of that experiment.
The Sufficiency Principle (SP)
If T is a sufficient statistic for θ in experiment E, and two outcomes x and x′ satisfy T(x) = T(x′), then x and x′ carry the same evidence about θ.
Intuition A sufficient statistic captures everything the data say about the parameter.
Two datasets yielding the same sufficient statistic are evidentially equivalent.
The sufficiency principle is nearly uncontroversial. Fisher championed it. Neyman and Pearson relied on it. It is built into the foundations of both Bayesian and frequentist statistics. A statistician who uses T(x) = (sample mean, sample variance) to summarize a normal sample is implicitly invoking sufficiency — the individual data points, once summarized, add nothing about the parameters.
The Conditionality Principle (CP)
If an experiment E is a mixture of component experiments — say, a coin is flipped to determine whether experiment E₁ or E₂ is performed — then the evidential meaning of the outcome depends only on the component experiment that was actually performed, not on the experiment that might have been performed but was not.
and the coin lands on E₁, yielding outcome x₁, then:
Ev(E, (1, x₁)) = Ev(E₁, x₁)
Intuition The experiment you didn't perform is irrelevant to the evidence.
Only the experiment actually conducted matters.
The conditionality principle also seems difficult to deny. If a fair coin determines whether you use a highly precise instrument or a crude one, and the coin selects the precise instrument, your evidence should be evaluated as if you had simply chosen the precise instrument from the start. The crude instrument sitting unused in the drawer cannot affect what the precise instrument tells you.
Even many frequentists accept conditionality. Fisher explicitly endorsed it through his theory of ancillary statistics — statistics that carry no information about θ but affect the precision of the inference. Fisher argued that inference should be conducted conditional on the observed value of any ancillary statistic.
The Likelihood Principle (LP)
Two outcomes from possibly different experiments that generate proportional likelihood functions carry identical evidence about the parameter.
then Ev(E, x) = Ev(E′, x′)
Intuition All the evidence about θ is contained in the likelihood function.
How the data were collected — the stopping rule, the sample space,
the outcomes that could have occurred but did not — is irrelevant
once the likelihood function is known.
The likelihood principle is far more contentious. It directly contradicts standard frequentist practice, in which the sampling distribution — the set of all possible outcomes and their probabilities — plays a central role. Under the likelihood principle, p-values are incoherent, because they depend on tail probabilities (outcomes more extreme than what was observed), which the likelihood principle declares irrelevant.
The Proof: An Elegant Construction
Birnbaum's proof is remarkably concise. Its power lies in a clever construction that bridges the three principles through a mixture experiment.
Suppose experiments E₁ and E₂ both involve the same parameter θ, and suppose outcomes x₁ from E₁ and x₂ from E₂ yield proportional likelihood functions: L₁(θ; x₁) ∝ L₂(θ; x₂). We want to show that Ev(E₁, x₁) = Ev(E₂, x₂).
Construct a mixture experiment E*: flip a fair coin. If heads, perform E₁. If tails, perform E₂. Now consider the statistic T defined on outcomes of E* by:
T(1, x) = x (if E₁ was selected, T is the outcome itself)
T(2, x) = x₁* (if E₂ was selected and L₂(θ;x) ∝ L₁(θ;x₁*), set T = x₁*)
That is, T maps any E₂-outcome to whichever E₁-outcome has a proportional likelihood.
The key insight is that T is a sufficient statistic for θ in the mixture experiment E*. This can be verified: given T(outcome) = x₁, the conditional distribution of the full outcome given T depends only on the coin flip (which component was selected), and the coin flip is independent of θ. So the conditional distribution given T does not depend on θ — which is precisely the definition of sufficiency.
Now chain the two principles together:
= Ev(E*, (2, x₂)) (by Sufficiency Principle, since T(1,x₁) = T(2,x₂))
= Ev(E₂, x₂) (by Conditionality Principle)
Therefore: Ev(E₁, x₁) = Ev(E₂, x₂) — the Likelihood Principle. ∎
The proof is three lines. The conditionality principle equates the evidence from each component experiment with the evidence from the mixture. The sufficiency principle equates two mixture outcomes that map to the same sufficient statistic. Together, they establish that any two outcomes with proportional likelihoods carry the same evidence — regardless of which experiment produced them.
Why This Matters: The Stopping Rule Principle
The most practically consequential implication of the likelihood principle is the stopping rule principle: the reason you stopped collecting data is irrelevant to the evidence the data provide.
Experiment E₁: Flip a coin exactly 12 times. Observe 9 heads and 3 tails.
Experiment E₂: Flip a coin until 3 tails appear. This also yields 9 heads and 3 tails.
The likelihood functions are L₁(θ) = C(12,9)·θ⁹(1−θ)³ and L₂(θ) = C(11,2)·θ⁹(1−θ)³. These are proportional as functions of θ (they differ only in the combinatorial constant). By the likelihood principle, the evidence about θ is identical.
Yet the frequentist p-values differ. Under E₁ (binomial), the one-sided p-value for testing θ = 0.5 is about 0.073. Under E₂ (negative binomial), it is about 0.033. The same data yield different conclusions — solely because of the experimenter's intention about when to stop.
For a Bayesian, this is no surprise. The posterior distribution depends on the data only through the likelihood, and the two likelihoods are proportional, so the posteriors are identical. The stopping rule is irrelevant. Birnbaum's theorem shows that this Bayesian position is not an arbitrary choice — it is the logical consequence of sufficiency and conditionality.
Reactions and Controversy
Birnbaum's theorem provoked intense debate that continues to this day. The reactions fall into several camps:
The Bayesian Response
Bayesians regarded the theorem as a vindication. Since Bayesian inference automatically satisfies the likelihood principle — the posterior depends on the data only through the likelihood — Birnbaum's result showed that any statistician who accepted sufficiency and conditionality was implicitly committed to a Bayesian-compatible framework. Dennis Lindley, I. J. Good, and others cited the theorem as evidence that frequentist methods were internally inconsistent.
The Frequentist Response
Frequentists responded in several ways. Some, like Deborah Mayo, challenged the proof itself, arguing that Birnbaum's notion of "evidential equivalence" is ambiguous and that the proof conflates different senses of the term. Others accepted the theorem but rejected the conditionality principle in its strong form — arguing that it applies only to certain types of ancillary statistics, not to arbitrary mixture experiments. Still others accepted sufficiency but maintained that the version of conditionality used in the proof is subtly different from Fisher's original principle.
Birnbaum's Own Ambivalence
Perhaps most remarkably, Birnbaum himself remained conflicted. Despite proving the theorem, he continued to explore frequentist concepts of evidence and never fully embraced the Bayesian position. In later writings he questioned whether the likelihood principle was as constraining as it appeared, and he explored alternative formulations that might preserve some role for sampling distributions.
Allan Birnbaum publishes "On the Foundations of Statistical Inference" in the Journal of the American Statistical Association, presenting the theorem and its proof.
The paper appears with extensive discussion by L. J. Savage, George Barnard, Oscar Kempthorne, David Cox, and others — reflecting the immediate recognition of its foundational importance.
George Barnard, who had been sympathetic to the likelihood principle, begins to retreat from its strong form, citing practical difficulties with pure likelihood methods.
Birnbaum dies at age 53, leaving the foundational questions he raised still unresolved.
Deborah Mayo publishes detailed critiques arguing that Birnbaum's proof is flawed — specifically, that the sufficiency and conditionality principles used in the proof are stronger than the "weak" versions most statisticians accept. Michael Evans and others offer rebuttals.
A special issue of Statistical Science revisits Birnbaum's theorem on its 50th anniversary, featuring contributions from Mayo, Evans, Martin, and others — demonstrating the theorem's continued relevance to foundational debates.
Formal Statement
Let E = (𝒳, {Pθ : θ ∈ Θ}) denote an experiment with sample space 𝒳 and a parametric family of distributions indexed by θ. An evidence function Ev assigns to each (experiment, outcome) pair an element of some evidence space, representing the evidential meaning of that outcome.
Conditionality Principle (CP) If E* is a mixture of E₁ and E₂ with mixing variable J independent of θ,
and J = j is observed, then Ev(E*, (j, x)) = Ev(Eⱼ, x).
Likelihood Principle (LP) If L_E(θ; x) = c · L_{E′}(θ; x′) for all θ ∈ Θ and some c > 0,
then Ev(E, x) = Ev(E′, x′).
Birnbaum's theorem states: SP + CP ⟹ LP. The converse implications (LP ⟹ SP and LP ⟹ CP) are straightforward, so in fact SP + CP ⟺ LP.
Implications for Frequentist Methods
If the likelihood principle holds, several standard frequentist procedures become problematic:
P-values depend on the probability of outcomes "more extreme" than the observed data. But what counts as extreme depends on the sample space, which the likelihood principle declares irrelevant. Two experiments with proportional likelihoods but different sample spaces yield different p-values — a violation of the likelihood principle.
Confidence intervals are constructed to have a specified coverage probability across repeated sampling. Their evidential interpretation for the data at hand is not guaranteed. A 95% confidence interval does not mean there is a 95% probability that the parameter lies within it — that would be a Bayesian credible interval.
Unbiasedness is a property of an estimator across all possible samples. It says nothing about the quality of the estimate from the sample actually obtained. The likelihood principle focuses entirely on the observed sample.
Power of a test depends on the probability of rejection under alternative hypotheses — a calculation over the sample space that the likelihood principle renders irrelevant to evidential assessment.
Connection to Bayesian Inference
Bayesian inference automatically satisfies the likelihood principle. The posterior distribution is:
Consequence If L(θ; x) ∝ L(θ; x′) for all θ, then π(θ | x) = π(θ | x′).
The posterior depends on the data only through the likelihood function.
This is why Birnbaum's theorem is often cited as foundational support for Bayesian methods. It shows that the Bayesian approach is not just one valid framework among many — it is the framework that naturally emerges when two basic evidential principles are taken seriously.
The theorem does not, by itself, establish Bayesian inference as the only valid approach. One could accept the likelihood principle and use pure likelihood methods (maximum likelihood estimation, likelihood ratios) without introducing priors. But the theorem narrows the field considerably, excluding the vast majority of frequentist procedures from contention.
Allan Birnbaum
Allan Birnbaum (1923–1976) was a mathematical statistician at New York University's Courant Institute. His foundational work was driven by a desire to find a rigorous, unified theory of statistical evidence that could bridge the Bayesian-frequentist divide. The theorem that bears his name was his most influential contribution, though he also made important contributions to the theory of statistical confidence and to the concept of evidential interpretation.
Birnbaum was a careful and philosophically sophisticated thinker. His 1962 paper is notable not only for the theorem itself but for the clarity with which he articulated the foundational issues at stake. He understood that the theorem posed a genuine dilemma for mainstream statistics, and he spent the remaining years of his career exploring its consequences without ever fully resolving the tensions it exposed.
"The likelihood principle is not a rule telling us how to use likelihood functions in practice. It is a principle about what constitutes statistical evidence — a principle that places severe constraints on any theory of inference that claims to be based on evidence." — Allan Birnbaum, "On the Foundations of Statistical Inference" (1962)
Modern Reassessments
The debate over Birnbaum's theorem remains active. Recent work has focused on several questions:
Is the proof valid? Mayo (2010, 2014) argues that Birnbaum's proof uses a strong version of sufficiency (applying across mixture experiments) that goes beyond what is ordinarily accepted. Evans (2013) and others have defended the proof, arguing that the sufficiency principle as standardly understood is sufficient.
Are the principles correctly formulated? Some authors distinguish between "evidential equivalence" (same evidential import) and "inferential equivalence" (leads to the same inferences). The proof may trade on ambiguity between these concepts.
What is the scope of the conditionality principle? Some statisticians accept conditionality for "natural" ancillary statistics but not for the artificial mixture experiments Birnbaum constructs. This distinction, however, is difficult to formalize.
Whether one accepts or rejects Birnbaum's theorem, it has permanently sharpened the foundational debate. It forces every statistician to confront a question that cannot be avoided: what, precisely, do we mean by "evidence"? Any answer that accepts both sufficiency and conditionality leads inexorably to the likelihood principle — and from there, the path to Bayesian inference is short.
Example: Clinical Trial Design — Two Different Stopping Rules
A pharmaceutical company tests a new drug for migraine relief. Two biostatisticians design the same study differently:
Statistician A plans a fixed-sample trial: enroll exactly 100 patients and observe how many experience relief. She observes 65 successes out of 100.
Statistician B uses a sequential design: keep enrolling patients until 65 successes are observed. She happens to reach 65 successes after exactly 100 patients.
The Same Data, Different Conclusions?
Both statisticians observed the identical outcome: 65 successes in 100 patients. Yet under frequentist analysis, the p-values differ because the sampling distributions differ — one is binomial, the other negative binomial.
Both designs yield the same likelihood — all information about θ is captured here.
What Birnbaum's Theorem Says
Birnbaum's theorem states that if you accept two seemingly innocent principles — sufficiency (don't throw away relevant information) and conditionality (evaluate the experiment that was actually performed) — then you must accept the likelihood principle: the evidence about θ depends only on the likelihood function, not the stopping rule.
This result strikes at the heart of frequentist methods. If the stopping rule doesn't matter for evidence, then p-values — which depend on what could have happened but didn't — are incorporating irrelevant information. A Bayesian who uses the likelihood function directly gets the same posterior regardless of which stopping rule generated the data. Birnbaum's theorem says this isn't just a Bayesian preference; it's a logical consequence of principles that most statisticians already accept.