Calibrated probability assessment is the property that an agent's stated probabilities align with empirical frequencies over repeated trials. If a weather forecaster assigns 80% probability to rain on 100 different occasions, calibration means it should rain on approximately 80 of those days. Calibration is a necessary condition for good probability assessment — but not a sufficient one, since a forecaster who always says "the climatological base rate" is perfectly calibrated but provides no useful information.
The concept is central to Bayesian epistemology, where degrees of belief are meant to function as genuine probabilities. If an agent's credences are not calibrated, they fail as a guide to action: decisions based on miscalibrated probabilities will systematically misallocate resources, misjudge risks, and lead to poor expected outcomes.
Calibration Error CE = Σₖ nₖ/N · (p̄ₖ − ōₖ)²
where p̄ₖ is the mean predicted probability in bin k, ōₖ is the observed frequency in bin k, and nₖ is the number of predictions in bin k.
The Brier Score and Proper Scoring Rules
Calibration is one component of the Brier score, which can be decomposed into calibration, resolution, and uncertainty (the Murphy decomposition). The Brier score is a proper scoring rule: it is minimized in expectation when the forecaster reports their true beliefs. This means a Bayesian agent — one who updates beliefs via Bayes' theorem — has no incentive to misreport probabilities when evaluated by a proper scoring rule.
Murphy Decomposition BS = Calibration − Resolution + Uncertainty
Other proper scoring rules include the logarithmic score (which penalizes confident wrong predictions more severely) and the continuous ranked probability score (CRPS) for distributional forecasts. All proper scoring rules incentivize calibration, but they differ in how they trade off calibration against sharpness (the concentration of predictive distributions).
Calibration in Human Judgment
Research in cognitive psychology, beginning with the landmark studies of Lichtenstein, Fischhoff, and Phillips in the 1970s and 1980s, has consistently found that human probability assessments are overconfident. Events assigned 90% confidence occur only about 75% of the time. This miscalibration is robust across domains and expertise levels, though it can be reduced with training.
Philip Tetlock's Good Judgment Project (2011–2015) demonstrated that calibration is a trainable skill. The best forecasters — "superforecasters" — achieved near-perfect calibration through techniques including: frequent updating (Bayesian revision), granular probability scales (distinguishing 60% from 65%), actively seeking disconfirming evidence, and tracking their own accuracy. Their calibration curves closely followed the 45-degree line, and they outperformed intelligence analysts with access to classified data.
Bayesian Coherence and Calibration
From a Bayesian perspective, calibration is closely related to coherence. De Finetti's theorem shows that a coherent agent — one whose beliefs satisfy the axioms of probability — cannot be made a sure loser in a betting scenario (the Dutch book argument). An agent who is perfectly coherent and updates via Bayes' theorem will, in the long run, be well-calibrated (this follows from the martingale convergence theorem applied to Bayesian posterior probabilities).
However, calibration and coherence are not identical. An agent can be perfectly calibrated without being coherent (e.g., a forecaster who uses different models for different events). And an agent can be coherent but poorly calibrated in finite samples if their prior is badly misspecified. The relationship is asymptotic: Bayesian coherence implies calibration in the limit under mild regularity conditions.
Calibration Plots and Reliability Diagrams
The standard diagnostic tool for assessing calibration is the reliability diagram (or calibration plot). Predictions are binned by their stated probability, and the observed frequency within each bin is plotted against the bin's average predicted probability. A perfectly calibrated forecaster traces the 45-degree diagonal. Points above the diagonal indicate underconfidence (events happen more often than predicted); points below indicate overconfidence.
In machine learning, calibration of neural networks has become a major concern. Guo et al. (2017) showed that modern deep networks, despite high accuracy, are poorly calibrated — they tend to be overconfident. Bayesian neural networks and post-hoc calibration methods (temperature scaling, Platt scaling) attempt to restore calibration, echoing the classical Bayesian argument that averaging over parameter uncertainty produces better-calibrated predictions.
Allan Murphy and colleagues begin systematic study of weather forecast calibration, establishing the framework of proper scoring rules for probability assessment.
Lichtenstein, Fischhoff, and Phillips publish influential studies demonstrating systematic overconfidence in human probability judgments.
Daniel Kahneman and Amos Tversky's work on heuristics and biases highlights miscalibration as a key feature of human reasoning under uncertainty.
Philip Tetlock's Superforecasting demonstrates that calibration is a learnable skill and that the best human forecasters rival statistical models.
"The ideal forecaster is one who, whenever they say 70%, it happens 70% of the time. Calibration is the minimal requirement for rational probability assessment." — Allan H. Murphy, Weather and Forecasting (1993)
Worked Example: Evaluating a Weather Forecaster's Calibration
A weather forecaster makes 20 predictions about whether it will rain, each with a stated credence (probability). We evaluate their calibration using the Brier score and calibration bins.
(0.90,1), (0.80,1), (0.70,1), (0.60,0), (0.50,1),
(0.90,1), (0.80,0), (0.70,1), (0.60,1), (0.50,0),
(0.30,0), (0.20,0), (0.10,0), (0.40,1), (0.85,1),
(0.75,1), (0.65,0), (0.55,1), (0.45,0), (0.35,0)
Step 1: Brier Score Brier = (1/n) Σ (credence − outcome)²
= (1/20)[(0.9−1)² + (0.8−1)² + ⋯ + (0.35−0)²]
= (1/20)(0.01 + 0.04 + 0.09 + 0.36 + 0.25 + ⋯)
= 0.167
Step 2: Calibration Bins 80–100% bin: avg credence 0.85, observed rate 4/5 = 0.80 ✓
60–80% bin: avg credence 0.68, observed rate 4/6 = 0.67 ✓
40–60% bin: avg credence 0.50, observed rate 2/5 = 0.40 ≈
0–40% bin: avg credence 0.24, observed rate 0/4 = 0.00 ✓
The forecaster achieves a Brier score of 0.167, which is better than the 0.25 score of a coin-flipper. Their calibration is good in the high-confidence bins (80–100% and 60–80% match well) but shows slight overconfidence in the 40–60% range. A perfectly calibrated forecaster would see exactly 50% of events occur among 50% predictions.