A kernel function k(x, x′) takes two inputs and returns a real number measuring their "similarity" in a sense relevant to the problem at hand. In the context of Bayesian nonparametric modeling — particularly Gaussian processes — the kernel serves as the covariance function of a prior distribution over functions. The choice of kernel encodes all prior assumptions about the function to be learned: its smoothness, its typical amplitude, its characteristic length-scales, whether it exhibits periodicity, and how it behaves at large distances.
Properties of Valid Kernels
Common Kernel Functions Squared Exponential: k(x,x′) = σ² exp(−‖x−x′‖² / 2ℓ²)
Matérn-ν: k(r) = σ² · (2^{1−ν}/Γ(ν)) · (√(2ν)r/ℓ)^ν · K_ν(√(2ν)r/ℓ)
Periodic: k(x,x′) = σ² exp(−2 sin²(π|x−x′|/p) / ℓ²)
Linear: k(x,x′) = σ² · (x−c)ᵀ(x′−c)
A function k is a valid kernel if and only if the Gram matrix K with entries Kᵢⱼ = k(xᵢ, xⱼ) is positive semi-definite for every finite set of inputs. Mercer's theorem establishes that any such kernel corresponds to an inner product in some (possibly infinite-dimensional) feature space: k(x, x′) = ⟨φ(x), φ(x′)⟩. This is the "kernel trick" — computations that depend only on inner products can be carried out in the high-dimensional feature space without ever computing the feature mapping φ explicitly.
James Mercer proves his theorem on the eigenfunction expansion of positive-definite kernels, laying the mathematical foundation for kernel methods.
Aronszajn develops the theory of reproducing kernel Hilbert spaces (RKHS). Parzen connects kernels to density estimation. The mathematical infrastructure for kernel methods matures.
Boser, Guyon, and Vapnik introduce the support vector machine (SVM), popularizing the kernel trick for classification. Simultaneously, Neal and Williams develop the GP perspective, connecting kernels to Bayesian priors.
Rasmussen and Williams publish Gaussian Processes for Machine Learning, providing a comprehensive treatment of kernels as covariance functions in the Bayesian nonparametric setting.
Kernels as Bayesian Priors
In GP regression, placing a GP prior f ~ GP(0, k) over an unknown function is equivalent to specifying a prior distribution over functions whose sample paths have regularity determined by k. The squared exponential kernel produces infinitely differentiable sample paths. The Matérn-ν kernel produces paths that are ⌈ν⌉−1 times differentiable — the Matérn-1/2 gives Ornstein-Uhlenbeck (continuous but rough) paths, Matérn-3/2 gives once-differentiable paths, and as ν → ∞ the Matérn converges to the squared exponential. The choice of kernel thus encodes a meaningful scientific prior about the expected regularity of the underlying process.
Kernels can be combined to build more expressive covariance functions. The sum of two valid kernels is valid (modeling additive contributions). The product of two valid kernels is valid (modeling interactions). A kernel applied to a transformed input is valid. This compositional algebra enables practitioners to construct kernels encoding complex prior knowledge: a sum of a periodic kernel and a squared exponential captures a signal with both periodic and smooth aperiodic components; a product of a linear kernel and a periodic kernel captures periodic patterns whose amplitude grows linearly with time. The "automatic statistician" project of Duvenaud, Lloyd, Grosse, Tenenbaum, and Ghahramani uses this compositionality to search over kernel structures, interpreting the resulting models in natural language.
Hyperparameter Learning
The kernel hyperparameters — length-scales ℓ, signal variance σ², periodicity p, and smoothness ν — control the prior over functions. In the Bayesian framework, these can be set by maximizing the marginal likelihood (type-II maximum likelihood or empirical Bayes), which automatically balances data fit against model complexity. Alternatively, a fully Bayesian treatment places priors on hyperparameters and marginalizes over them using MCMC or variational inference, producing richer uncertainty estimates.
Beyond Euclidean Inputs
Kernels can be defined on structured inputs: strings (string kernels), graphs (graph kernels), sets, distributions, and manifolds. This flexibility allows GP models to be applied to molecules (predicting drug activity from molecular graphs), text (modeling documents as distributions over words), and spatial data on the sphere (climate modeling on Earth's surface). Each kernel encodes a domain-specific notion of similarity, and the GP posterior provides calibrated uncertainty regardless of the input type.
"The kernel is the soul of a Gaussian process. Choose the kernel well and the model captures the structure of the problem; choose poorly and no amount of data will save you." — Carl Edward Rasmussen, Gaussian Processes for Machine Learning (2006)