Bayesian History Matching — BayesianStatistics.com

Many scientific and engineering problems involve complex computer simulators — climate models, reservoir simulators, cosmological codes — with numerous uncertain input parameters. The goal is to identify which parameter settings are consistent with observed real-world data. Full Bayesian calibration, which requires evaluating the posterior distribution over the entire parameter space, is often computationally infeasible when each simulator run takes hours or days.

Bayesian history matching, developed primarily by Ian Vernon, Michael Goldstein, and colleagues at Durham University, takes a different approach. Instead of finding the best parameter values, it systematically eliminates the worst ones. Through successive waves of emulation and implausibility testing, the method progressively shrinks the parameter space to a "not ruled out yet" (NROY) region — the set of inputs that could plausibly have produced the observed data.

The Implausibility Measure

The core tool is the implausibility measure, which quantifies how far the simulator's expected output at a candidate input point is from the observed data, scaled by all relevant sources of uncertainty.

Implausibility Measure I(x) = |z − E[f(x)]| / √( Var[f(x)] + Var[e] + Var[δ] )

Where z         → Observed data
E[f(x)]   → Emulator expectation at input x (approximation of simulator output)
Var[f(x)] → Emulator uncertainty (how well the emulator knows the simulator output)
Var[e]    → Observation error variance
Var[δ]    → Model discrepancy variance (structural difference between simulator and reality)

If I(x) exceeds a threshold c (typically c = 3, corresponding roughly to a 3-sigma criterion), the input x is deemed implausible and ruled out. The remaining inputs form the NROY space. Crucially, the method does not claim that the NROY inputs are correct — only that they have not been shown to be wrong given the available evidence and uncertainty.

The Role of Emulators

Because running the full simulator at every candidate point is prohibitively expensive, history matching relies on emulators — fast statistical surrogates that approximate the simulator's input-output relationship. Gaussian process emulators are the most common choice: they provide both a predicted output and a quantification of prediction uncertainty at any input point.

Emulator vs. Simulator

A simulator might take 12 hours to evaluate a single input configuration. A Gaussian process emulator, trained on a few hundred carefully chosen simulator runs, can predict the output at a new input in milliseconds — along with a credible interval reflecting how much the prediction should be trusted. History matching exploits these fast approximate evaluations to explore vast parameter spaces that would be inaccessible by direct simulation.

Iterative Waves

History matching proceeds in waves, each refining the NROY space.

Wave Structure Wave 1: Train emulator on initial design → compute I(x) → cut implausible regions
Wave 2: Generate new design points in NROY space → retrain emulator → cut further
Wave 3: Repeat with improved emulator in reduced space
…
Wave k: NROY space is small enough for full Bayesian calibration or direct sampling

Each wave uses the emulator to evaluate the implausibility across the current NROY space, removes implausible regions, and then generates new simulator runs within the reduced space to build a more accurate emulator for the next wave. Because the emulator is being trained on a progressively smaller and more homogeneous region, its accuracy improves wave by wave, enabling finer discrimination between plausible and implausible inputs.

Model Discrepancy

A distinctive feature of history matching is its explicit treatment of model discrepancy — the structural difference between the simulator and reality. No simulator perfectly represents the real system, and ignoring this mismatch leads to overconfident conclusions. The discrepancy term Var[delta] accounts for the fact that even the "true" parameter values might not make the simulator exactly reproduce the data.

This is a crucial philosophical point. Full Bayesian calibration without a discrepancy term implicitly assumes the simulator is a perfect model of reality — an assumption that is almost always false. History matching's explicit inclusion of model discrepancy makes it more robust and less prone to confidently wrong conclusions.

"We do not seek to find the best input. We seek to rule out the worst. The history matching philosophy is one of structured skepticism." — Michael Goldstein and Ian Vernon, "Bayes Linear Emulation and History Matching" (2009)

Bayes Linear Approach

Much history matching work uses Bayes linear methods rather than full Bayesian inference. The Bayes linear approach specifies only means, variances, and covariances — not full probability distributions — and updates these moments using adjusted expectations rather than Bayes' theorem. This makes the analysis tractable for high-dimensional problems where specifying and computing with full joint distributions would be impractical.

Why Rule Out Rather Than Fit?

History matching adopts a falsificationist philosophy: it is often easier to demonstrate that a parameter setting is inconsistent with data than to determine that it is correct. By focusing on ruling out implausible regions first, the method avoids wasting computational effort on detailed inference in parts of parameter space that are clearly wrong. The full Bayesian posterior, if desired, can be computed in the final small NROY region where emulation is most accurate.

Applications

Climate Modeling

Climate models have dozens of uncertain parameters governing cloud physics, aerosol interactions, and ocean mixing. History matching has been applied at the UK Met Office and elsewhere to identify which parameter configurations are consistent with observed temperature records, top-of-atmosphere radiation budgets, and other climate constraints.

Reservoir Engineering

Oil and gas reservoir simulators predict fluid flow through porous rock. History matching calibrates the simulator to observed production data (pressures, flow rates) to forecast future reservoir behavior and guide extraction strategies.

Cosmology

Galaxy formation simulations such as GALFORM involve complex subgrid physics with many free parameters. History matching has been used to identify the region of parameter space consistent with observed galaxy luminosity functions, color distributions, and morphological statistics.

Systems Biology

Complex biological models with many rate constants and interaction parameters can be calibrated using history matching. Applications include immune system modeling, epidemiological models, and gene regulatory networks.

Relationship to Full Bayesian Calibration

History matching is not a competitor to full Bayesian calibration — it is a complement and often a precursor. By efficiently reducing the parameter space to the NROY region, history matching makes subsequent full Bayesian analysis computationally feasible. The NROY region also serves as a diagnostic: if it is empty, the simulator cannot match the data under any input setting, suggesting model misspecification.

Interactive Calculator

Each row is a calibration point: parameter (numeric input value), simulated (model output), observed (real-world measurement), and obs_error (observation uncertainty). The calculator computes the implausibility metric I(x) = |simulated - observed| / sqrt(obs_error² + model_discrepancy²), classifying parameter settings as plausible (I < 3) or implausible (I ≥ 3).

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.