In 1964, Ray Solomonoff published what may be the most profound result in the theory of inductive inference: a construction of a universal prior distribution that assigns higher probability to simpler hypotheses and that provably converges to the true data-generating distribution faster than any other computable predictor. Solomonoff's theory unifies Bayesian probability, Kolmogorov complexity, and computability theory into a single framework, providing the deepest known justification for Occam's razor and establishing an absolute standard against which all learning algorithms can be measured.
The Universal Prior
Solomonoff's key construction is the universal prior (also called the Solomonoff prior or universal semimeasure). Consider a universal Turing machine U. The prior probability of a binary string x is defined as the sum over all programs that output sequences beginning with x, weighted by their algorithmic probability:
Here, the sum ranges over all programs p (viewed as binary strings) whose output on universal Turing machine U begins with x, and |p| is the length of program p in bits. The factor 2−|p| assigns each program a probability proportional to its simplicity — shorter programs receive exponentially higher weight.
This prior has a remarkable property: it dominates every computable probability distribution. For any computable measure μ, there exists a constant cμ such that M(x) ≥ cμ · μ(x) for all strings x. This means that no computable pattern in the data can escape the universal prior — it assigns nonzero probability to every computable regularity.
Optimal Prediction
Using the universal prior for Bayesian prediction yields the Solomonoff predictor. Given an observed sequence x1, x2, …, xn, the predictive probability of the next symbol is:
Solomonoff proved that this predictor converges to the true conditional probabilities at a rate that is optimal in a precise sense: the total expected divergence between the Solomonoff predictions and the true distribution is bounded by a constant (related to the Kolmogorov complexity of the true distribution), regardless of how long the sequence is. This is an extraordinarily strong convergence guarantee — it means the Solomonoff predictor makes only a bounded total number of "significant" prediction errors over an infinite sequence.
Despite its theoretical optimality, Solomonoff's predictor is incomputable. Computing the universal prior requires summing over all programs that produce a given output — a task that involves solving the halting problem. The theory therefore serves not as a practical algorithm but as a gold standard: a theoretical ideal that practical methods approximate. Just as the Bayes error rate bounds what any classifier can achieve, Solomonoff induction bounds what any predictor can achieve. Practical approximations include the Minimum Description Length (MDL) principle, the Context Tree Weighting algorithm, and certain compression-based approaches to classification.
Formalization of Occam's Razor
Solomonoff's theory provides the first rigorous mathematical formalization of Occam's razor — the principle that simpler explanations should be preferred. The universal prior assigns probability 2−|p| to each program, so a hypothesis that can be described in k bits receives prior probability at least 2−k. Simpler hypotheses (shorter programs) automatically receive higher prior weight. The convergence theorem then shows that this simplicity bias is not just aesthetically appealing but provably optimal for prediction.
Historical Development
Ray Solomonoff presents the first version of his theory at a conference, proposing to use algorithmic probability as a universal prior for induction. This predates both Kolmogorov's and Chaitin's independent work on algorithmic complexity.
Solomonoff publishes the two-part paper "A Formal Theory of Inductive Inference" in Information and Control, establishing the mathematical foundations including the universal prior and convergence results.
Kolmogorov and Chaitin independently develop algorithmic information theory (Kolmogorov complexity), providing the complexity-theoretic foundations that complement Solomonoff's probabilistic framework.
Solomonoff extends his theory to address the problem of total evidence and develops improved bounds on the rate of convergence.
Marcus Hutter develops AIXI, combining Solomonoff induction with sequential decision theory to define a theoretically optimal universal artificial intelligence.
Connection to Kolmogorov Complexity
The universal prior is intimately related to Kolmogorov complexity K(x) — the length of the shortest program that produces x. The relationship is approximately:
This means the universal prior assigns each string a probability roughly equal to 2−K(x): strings with short descriptions (low Kolmogorov complexity) receive high prior probability, while random or patternless strings — which have high Kolmogorov complexity — receive low prior probability. Since Kolmogorov complexity is itself incomputable, this connection reinforces the incomputability of the universal prior while deepening our understanding of its structure.
Philosophical Significance
Solomonoff's theory addresses foundational questions in the philosophy of science. The problem of induction — how can we justify inferring general laws from finite observations? — receives a Bayesian answer: the universal prior provides a uniquely privileged starting point that is, in a precise sense, maximally open-minded about the possible structure of the world, while still favoring simpler explanations. The convergence theorem then guarantees that predictions will eventually become accurate regardless of the true underlying process (provided it is computable).
"Solomonoff induction is the theoretical solution to the problem of induction. All other methods of inference are either approximations to it or are provably suboptimal in comparison."— Marcus Hutter
Impact on Artificial Intelligence
Solomonoff's work is increasingly recognized as a cornerstone of AI theory. Marcus Hutter's AIXI agent — which combines Solomonoff induction for prediction with sequential decision theory for action selection — defines a theoretical upper bound on the intelligence of any computable agent. While AIXI is itself incomputable, it has inspired practical approximations and provided a formal framework for thinking about artificial general intelligence. The theory also connects to the speed prior proposed by Jürgen Schmidhuber, which modifies the universal prior to additionally favor computationally efficient programs, trading some theoretical generality for greater practical relevance.