Arnold Zellner introduced the g-prior in 1986 as an elegant solution to a persistent problem in Bayesian linear regression: how to specify a prior covariance matrix for the regression coefficients β when little is known about their joint structure. The key insight is to borrow the covariance structure from the data itself — specifically, from the matrix (XᵀX)⁻¹ that appears in the ordinary least squares estimator — and control the prior's strength through a single scalar parameter g.
The result is a prior that respects the correlation structure inherent in the predictors, scales naturally with the measurement units, and reduces the entire prior specification to a single number. This combination of parsimony and structural coherence has made the g-prior one of the most widely used priors in Bayesian model selection and variable selection for linear regression.
Where X → n × p design matrix
g → scalar shrinkage parameter (g > 0)
σ² → error variance (often given its own prior, typically π(σ²) ∝ 1/σ²)
The Role of g
The parameter g controls the trade-off between the prior and the data. The posterior mean under the g-prior is a shrinkage estimator:
Where β̂_OLS = (XᵀX)⁻¹ Xᵀy (the ordinary least squares estimate)
When g → ∞, the prior becomes diffuse and the posterior mean converges to the OLS estimate — the data speak for themselves. When g → 0, the posterior mean shrinks toward zero — the prior dominates. The factor g/(g+1) is the shrinkage coefficient, and it appears throughout the g-prior's analytics. In model selection, g also governs the Bayes factor's automatic penalty for model complexity.
Popular choices include g = n (the "unit information prior," which concentrates prior information equivalent to one observation), g = p² (recommended by some for large model spaces), and g = max(n, p²). The hyper-g prior of Liang et al. (2008) places a Beta prior on g/(g+1), yielding a mixture that is robust to the choice of g. The empirical Bayes approach sets g to maximize the marginal likelihood. Each choice implies different behavior in model selection consistency and prediction.
Marginal Likelihood and Model Selection
One of the g-prior's greatest strengths is that the marginal likelihood — the quantity needed for Bayes factors and posterior model probabilities — is available in closed form. After integrating out β and σ², the marginal likelihood for a model Mₖ with design matrix Xₖ takes a compact expression involving only g, n, p, and the coefficient of determination R²ₖ:
This expression reveals the Bayesian Occam's razor at work. The first factor (1 + g)^(−pₖ/2) penalizes model complexity — models with more parameters (larger pₖ) are penalized more heavily. The second factor rewards models that fit the data well (high R²ₖ). The balance between these two forces is controlled by g.
Extensions and Variants
The basic g-prior has inspired numerous extensions. The mixtures of g-priors framework places a hyperprior on g, avoiding the need to fix a single value. Liang, Paulo, Molina, Clyde, and Berger (2008) showed that several popular hyperpriors — including the hyper-g, hyper-g/n, and Zellner-Siow priors — lead to closed-form or easily computable marginal likelihoods while satisfying desirable theoretical properties like model selection consistency.
The Zellner-Siow prior uses an inverse-gamma hyperprior on g, g ~ InvGamma(1/2, n/2), and is considered one of the best default priors for Bayesian variable selection. It does not have a fully closed-form marginal likelihood, but the one-dimensional integral over g is easily computed numerically.
Why (XᵀX)⁻¹?
Using (XᵀX)⁻¹ as the prior covariance means that the prior "knows about" the correlations among predictors. If two predictors are highly correlated, the prior accounts for this, spreading uncertainty along the direction of collinearity. This avoids the pathologies that arise from using a diagonal prior (like ridge regression's identity matrix), which ignores predictor correlations and can produce implausible prior predictions when predictors are measured on very different scales.
Geometrically, (XᵀX)⁻¹ is proportional to the ellipsoid of the least squares sampling distribution. The g-prior thus centers its uncertainty on the same shape as the likelihood, differing only in scale. This alignment is what produces the clean shrinkage formula and the closed-form marginal likelihood.
"By using the data matrix in the prior, we incorporate the experimental design information and achieve a prior that adapts automatically to the geometry of the problem." — Arnold Zellner, On Assessing Prior Distributions and Bayesian Regression Analysis with g-Prior Distributions (1986)
Historical Note
Arnold Zellner introduces the g-prior in a paper that emphasizes its computational tractability and natural scaling properties.
Kass and Wasserman study the "unit information prior" (g = n) as a general-purpose default prior approximation.
Liang, Paulo, Molina, Clyde, and Berger systematically study mixtures of g-priors, establishing theoretical properties and computational methods for Bayesian variable selection.
Example: Simple Regression with Three g Values
Consider 20 data points from a study of advertising spend (x, in thousands) versus monthly revenue (y, in thousands). The OLS regression gives slope β̂ = 2.0 and R² = 0.95. How does the choice of g affect posterior inference?
Posterior slope = [g / (1 + g)] × β̂OLS
g = 20 (= n): Shrinkage = 20/21 ≈ 0.952 → Posterior slope ≈ 1.905
g = 100: Shrinkage = 100/101 ≈ 0.990 → Posterior slope ≈ 1.980
g = 1000: Shrinkage = 1000/1001 ≈ 0.999 → Posterior slope ≈ 1.998
The "unit information prior" (g = n = 20) applies the most shrinkage, pulling the slope about 5% toward zero. Larger g values produce posteriors closer to the OLS estimate. The marginal likelihood (Bayes factor) automatically penalizes overly large g — with strong data (high R²), g = n is often preferred because it represents the information in a single observation, balancing parsimony with fit. This single-parameter control over regularization strength is the g-prior's central appeal.