Bayesian Statistics

Neural Network Gaussian Process

The neural network Gaussian process (NNGP) establishes that a neural network with infinitely many hidden units converges in distribution to a Gaussian process, revealing a deep connection between deep learning and Bayesian nonparametrics.

f(x) → GP(0, K(x, x′)) as width → ∞

One of the most striking theoretical results at the intersection of deep learning and Bayesian statistics is that a single-hidden-layer neural network with random weights, as its width tends to infinity, defines a function drawn from a Gaussian process. This observation, first made by Radford Neal in 1994, transforms the study of neural networks from a purely optimization-driven enterprise into one amenable to the full toolkit of Bayesian inference — priors, posteriors, marginal likelihoods, and principled uncertainty quantification.

Neal's Foundational Insight

Consider a single-hidden-layer network with H hidden units:

Single-Hidden-Layer Network f(x) = b + Σₕ₌₁ᴴ vₕ · φ(wₕᵀx + bₕ)

Conditions Weights wₕ, biases bₕ drawn i.i.d. from suitable priors
Output weights vₕ ~ N(0, σ²ᵥ / H)

Central Limit Argument As H → ∞, f(x) → GP(0, K(x, x′)) by the Central Limit Theorem

Each hidden unit contributes an independent random term to the sum. By the central limit theorem, the finite-dimensional distributions of f(x) become jointly Gaussian as the number of hidden units grows. The resulting GP kernel K(x, x′) depends on the activation function φ and the weight prior variance. For the error-function activation, Neal derived the kernel in closed form; for ReLU, a piecewise-linear formula was later obtained.

1994

Radford Neal, in his PhD thesis at the University of Toronto, proves that Bayesian neural networks with one hidden layer converge to GPs in the infinite-width limit and advocates MCMC inference for finite networks.

1998

Christopher Williams derives the explicit GP kernel for networks with sigmoidal activations, making the NNGP correspondence computationally usable.

2018

Lee, Bahri, Novak, Schoenholz, Pennington, and Sohl-Dickstein extend the correspondence to deep networks, showing that the GP kernel can be computed layer by layer through a recursive formula. Their "Neural Tangents" library enables practical NNGP inference.

2018–2020

The Neural Tangent Kernel (NTK) theory of Jacot, Gabriel, and Hongler shows that gradient-descent training of infinitely wide networks is described by a related kernel, connecting optimization dynamics to GP regression.

Deep NNGP Kernels

For deep networks with L layers, the NNGP kernel is computed recursively. Let K⁽⁰⁾(x, x′) be the input inner product (plus bias variance). At each layer l, the kernel is obtained by passing the previous layer's kernel through an "expected activation" integral that depends on the nonlinearity. For ReLU networks, this yields:

ReLU NNGP Kernel Recursion K⁽ˡ⁾(x, x′) = (σ²_w / 2π) · ‖K⁽ˡ⁻¹⁾‖ · (sin θ + (π − θ) cos θ)
where θ = arccos(K⁽ˡ⁻¹⁾(x,x′) / √(K⁽ˡ⁻¹⁾(x,x) · K⁽ˡ⁻¹⁾(x′,x′)))

As depth increases, these kernels tend to become more "degenerate," concentrating on whether two inputs are identical or not. This pathology — related to the so-called "edge of chaos" phenomenon in deep network initialization — motivates careful tuning of weight and bias variances to maintain expressive kernels at depth.

NNGP vs. Neural Tangent Kernel

The NNGP describes the prior distribution of an infinitely wide network before training. The Neural Tangent Kernel (NTK) describes the posterior — or more precisely, the function obtained after gradient descent training with infinitesimal learning rate. For a single step of Bayesian inference with a Gaussian likelihood, NNGP and NTK predictions coincide. But for finite-width networks trained with SGD, the two can diverge, and neither fully captures the rich "feature learning" regime that makes deep learning powerful in practice.

Practical Implications

The NNGP correspondence provides several practical benefits. First, it offers a baseline: if an infinitely wide network performs similarly to a finite one, the finite network may not be leveraging its depth effectively. Second, NNGP inference is exact — no gradient descent, no hyperparameter tuning of learning rates — making it a useful sanity check. Third, the theory guides network initialization: weight variances that produce well-behaved GP kernels also produce well-behaved gradient flows during training.

Empirically, NNGP and NTK models are competitive with trained neural networks on small-to-medium datasets but fall behind on large-scale tasks like ImageNet, suggesting that finite-width feature learning provides benefits beyond what the infinite-width limit captures.

"The infinite-width limit of a neural network is a Gaussian process. This is either the most beautiful or the most damning thing you can say about neural networks, depending on your perspective." — Radford Neal, reflecting on his 1994 result

Connections to Bayesian Deep Learning

The NNGP theory provides a rigorous foundation for Bayesian deep learning. It shows that placing priors on network weights and biases induces a well-defined function-space prior — a GP — which can be reasoned about independently of the parameterization. This function-space perspective has influenced the development of practical approximate inference methods, including variational inference for deep GPs and the use of GP priors as regularizers in modern architectures.

Related Topics