Bayesian Program Synthesis

Bayesian program synthesis learns programs from data by placing prior distributions over program spaces and performing Bayesian inference, framing the search for explanatory code as a problem of posterior computation over structured symbolic representations.

P(program | data) ∝ P(data | program) · P(program)

Bayesian program synthesis addresses one of the most ambitious goals in artificial intelligence: automatically writing programs that explain observed data. Instead of learning statistical parameters within a fixed model, it searches over a combinatorial space of programs — symbolic, compositional, executable objects — using Bayesian inference to balance the complexity of each candidate program against how well it accounts for the evidence. This approach connects machine learning to the foundations of induction, offering a principled framework for learning structured, interpretable, and generalizable knowledge from sparse data.

The Inference Framework

In Bayesian program synthesis, we define a space of candidate programs, a prior distribution over that space, and a likelihood model relating programs to observations. The posterior probability of a program given data follows from Bayes' theorem:

Program PosteriorP(program | data) ∝ P(data | program) · P(program)

The likelihood P(data | program) measures how well the program accounts for the observed data — for a deterministic program, this might be 1 if the program's output matches the data and 0 otherwise, or for a probabilistic program, it is the probability the program assigns to the data.

The prior P(program) encodes a preference for simpler, shorter, or more natural programs. This prior is essential: without it, overly complex programs that memorize the data would dominate, undermining generalization.

The Role of the Prior

The choice of prior over programs is perhaps the most consequential design decision. Several approaches have been explored:

Description length priors assign higher probability to shorter programs, following the minimum description length (MDL) principle. A program's prior probability decreases exponentially with its length, measured in some reference language. This connects directly to Solomonoff's theory of inductive inference and Kolmogorov complexity.

Grammar-based priors define a probabilistic context-free grammar (PCFG) over the programming language, assigning probabilities to each production rule. Programs generated by common patterns receive higher prior probability, encoding a form of naturalness prior.

Neural-guided priors use neural networks to learn a distribution over programs from a corpus of existing code, concentrating search on regions of program space that resemble human-written programs.

Learning to Count: A Canonical Example

Consider learning the concept "even number" from the examples {2, 8, 14, 6}. A Bayesian program synthesizer might consider programs like "x mod 2 = 0" (even), "2 < x < 15" (range), and "x ∈ {2, 6, 8, 14}" (memorization). The first program is short and explains the data perfectly. The second also covers all examples but makes predictions (e.g., 3, 5) that seem arbitrary. The third memorizes the data but has zero generalization. The Bayesian posterior — combining fit with simplicity — overwhelmingly favors the first program, matching human intuition about concept learning.

Computational Challenges

The central computational challenge is that program spaces are vast, discrete, and combinatorially structured. Unlike continuous parameter spaces where gradient methods can guide search, the space of programs requires specialized exploration strategies:

Enumeration exhaustively generates programs in order of increasing size or decreasing prior probability. It is complete but only feasible for small program spaces or very simple target programs.

Stochastic search methods — including MCMC over program trees, evolutionary algorithms, and simulated annealing — explore the space by making local modifications to candidate programs. These methods can handle larger spaces but may get stuck in local optima.

Neural-guided search uses a learned neural model to propose likely program components, dramatically narrowing the search space. The DreamCoder system, for example, iterates between synthesizing programs and learning a neural recognition model that accelerates future synthesis.

Key Systems and Results

2008

Liang et al. develop a Bayesian framework for learning simple programs from input-output examples, demonstrating that hierarchical priors enable learning from very few examples.

2015

Lake, Salakhutdinov, and Tenenbaum publish the Bayesian Program Learning (BPL) model for one-shot handwritten character recognition, matching human performance by learning generative programs for each character.

2021

The DreamCoder system (Ellis et al.) combines neural-guided search with library learning, discovering reusable subroutines across multiple domains including list processing, geometry, and physics.

The BPL work was particularly influential because it demonstrated that Bayesian program synthesis could capture the richness of human concept learning. By learning a motor program for each character — specifying strokes, their order, and spatial relations — BPL could generate new examples of learned characters, recognize characters from a single example, and even produce novel characters in the style of an alphabet, all from a principled Bayesian framework.

Connection to Human Cognition

Bayesian program synthesis has deep connections to cognitive science. Researchers like Joshua Tenenbaum and colleagues have argued that human concept learning is best understood as a form of probabilistic inference over structured, compositional representations — essentially programs. Children's ability to learn new concepts from one or two examples, generalize appropriately, and compose known concepts into novel ones mirrors the capabilities of Bayesian program synthesis systems.

"People learn richer representations than standard pattern recognition — they learn something more like a program, a structured description that captures the causal and compositional nature of the world."— Joshua Tenenbaum

Future Directions

The field continues to grapple with scaling program synthesis to richer languages and larger datasets. Hybrid approaches that combine the structured priors of Bayesian synthesis with the pattern recognition capabilities of large neural networks represent a promising frontier. The ultimate vision — systems that learn not just to classify or predict but to understand, by discovering the programs that generate the data they observe — remains one of AI's deepest aspirations.

Interactive Calculator

Each row is an input-output pair (integers). The calculator evaluates a set of candidate programs — +1, *2, ^2, +3, *3, and identity — by computing P(program | I/O data) using Bayes' theorem. Each program has a uniform prior, and the likelihood is 1.0 if the program produces the exact output (0.01 otherwise, as noise tolerance). The posterior ranking reveals which program best explains the observed I/O examples.

Dataset (CSV)

Click Calculate to see results, or Animate to watch the statistics update one record at a time.