Speed Prior — BayesianStatistics.com

The Speed Prior addresses a fundamental question at the intersection of algorithmic information theory, Bayesian inference, and computational learning theory: among all computable hypotheses consistent with observed data, how should we weight them? Solomonoff's universal prior assigns probability proportional to 2^(−K(x)), where K(x) is the Kolmogorov complexity (shortest program length) of x. Schmidhuber's Speed Prior refines this by additionally penalizing hypotheses that require more computation time to generate, yielding a prior that favors hypotheses that are both simple and fast.

The motivation is both philosophical and practical. The Solomonoff prior, while theoretically elegant, is uncomputable. Moreover, it assigns substantial probability to programs that produce the observed data only after astronomically long computations. The Speed Prior offers a computable (or at least more tractable) alternative that is better aligned with the resource constraints of physical systems — including, Schmidhuber argues, the universe itself.

Speed Prior S(x) ∝ Σ_p 2^(−|p|) · 2^(−log t(p, x))

Simplified Form S(x) ∝ 2^(−K_t(x))

Where |p| → length of program p (in bits)
t(p,x) → runtime of program p to produce output x
K_t(x) → time-bounded Kolmogorov complexity

Relationship to Solomonoff Induction

Solomonoff's universal prior (1964) is the foundational prior in algorithmic information theory. It assigns to any binary string x a probability M(x) = Σ_p 2^(−|p|), summed over all programs p on a universal Turing machine that output x (and halt). This prior is universal — it dominates any computable probability measure — but it is also uncomputable and gives no weight to computational efficiency.

The Speed Prior modifies the Solomonoff prior by down-weighting programs proportional to their runtime. A program that takes time t to produce x contributes not 2^(−|p|) but approximately 2^(−|p|) / t to the prior probability. The effect is that fast-generating hypotheses dominate. If two programs produce the same data, and one is short but slow while the other is slightly longer but much faster, the Speed Prior may prefer the faster one.

The Lazy Universe Hypothesis

Schmidhuber has argued that if the universe is a computation, then the Speed Prior is its natural measure. The idea is that a computationally generated universe would tend to minimize resources — producing computable regularities that are efficient to generate. This "lazy universe" perspective suggests that the laws of physics should be expressible by fast programs, and that apparent randomness might reflect computational shortcuts rather than genuine stochasticity. While speculative, this view connects algorithmic probability theory to cosmology and the philosophy of physics.

Technical Properties

The Speed Prior has several notable properties. First, it is enumerable from above — there exists an algorithm that produces a sequence of decreasing upper bounds converging to S(x). This is a weaker property than computability but stronger than the Solomonoff prior's semi-computability. Second, the Speed Prior induces a bias toward the types of patterns that are discoverable by polynomial-time algorithms, making it more relevant to machine learning practice than the Solomonoff prior. Third, the total predictive error of the Speed Prior, measured in terms of expected log loss, is bounded by O(K_t(μ)), where μ is the true data-generating distribution and K_t(μ) is its time-bounded complexity.

Connections to Minimum Description Length

The Minimum Description Length (MDL) principle of Rissanen selects the hypothesis that minimizes the total code length: the description length of the model plus the description length of the data given the model. The Speed Prior can be viewed as an extension of MDL that also penalizes the computational cost of the encoding and decoding process. In this sense, the Speed Prior bridges the gap between Bayesian model selection (which uses Solomonoff-style priors) and computational complexity theory (which cares about runtime).

Implications for Machine Learning

Modern deep learning implicitly implements a version of the Speed Prior through its reliance on gradient descent and fixed computational budgets. Models that require excessive computation to train or evaluate are effectively discarded. Stochastic gradient descent, early stopping, and architecture search all introduce biases toward computationally efficient hypotheses. The Speed Prior provides a theoretical framework for understanding these implicit biases and their consequences for generalization.

"The simplest explanation of the data is not just the shortest one, but the one that can be computed most quickly. Efficiency is not separate from simplicity — it is a deeper form of it." — Juergen Schmidhuber, The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions (2002)

1964

Ray Solomonoff introduces the universal prior based on Kolmogorov complexity, establishing the foundation for algorithmic probability.

1978

Levin introduces the concept of time-bounded Kolmogorov complexity, foreshadowing the Speed Prior's emphasis on computational cost.

2002

Schmidhuber publishes the Speed Prior, combining Solomonoff's universal prior with penalties for computational time to yield near-optimal computable predictions.

Critiques and Open Questions

The Speed Prior inherits several open questions from algorithmic information theory. The choice of reference universal Turing machine affects the prior's values, though the impact diminishes for longer strings (up to an additive constant). The precise runtime penalty function is somewhat arbitrary — linear penalties, logarithmic penalties, and polynomial penalties all yield different Speed Priors with different properties. Furthermore, the relationship between the Speed Prior and physical law remains speculative, lacking empirical tests that could distinguish it from alternatives.

Example: Ranking Binary Sequences by Complexity

Consider four 8-bit binary sequences. We use run-length encoding as a proxy for Kolmogorov complexity and assign Speed Prior weights proportional to 2^{−compressed_length}.

Compression and prior weight "00000000": 1 run → Weight = 2⁻¹ = 0.500
"00001111": 2 runs → Weight = 2⁻² = 0.250
"00110011": 4 runs → Weight = 2⁻⁴ = 0.0625
"01001011": 6 runs → Weight = 2⁻⁶ = 0.0156

The constant sequence "00000000" has only 1 run — it is maximally compressible and receives the highest Speed Prior probability. The alternating-block sequence "00001111" has 2 runs and receives half the weight. The complex sequence "01001011" has 6 runs and receives 32 times less weight than the constant sequence. After normalization, the Speed Prior assigns 60% probability to the constant sequence, 30% to the two-block sequence, and less than 2% to the complex sequence. This captures the Speed Prior's key insight: among hypotheses consistent with observations, simpler (faster-to-compute) ones should be preferred.

Interactive Calculator

Each row is a binary sequence. The calculator computes the description length of each sequence via run-length encoding as a proxy for Kolmogorov complexity, then assigns a Speed Prior weight ∝ 2^{−compressed_length}. Simpler (more compressible) sequences get higher prior probability.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.