Bayesian Statistics

Bayesian Regret

Bayesian regret measures the excess integrated risk of a decision rule compared to the Bayes-optimal rule, quantifying the price paid for using a suboptimal procedure when the parameter is drawn from a specified prior distribution.

BayesRegret(δ) = R(δ, π) − R(δ_Bayes, π) = ∫ R(θ, δ) π(θ) dθ − inf_δ′ ∫ R(θ, δ′) π(θ) dθ

Bayesian regret is the difference between the integrated risk of a decision rule δ and the minimum possible integrated risk (the Bayes risk) achievable by the optimal Bayes rule under a given prior π. It answers the question: how much worse is my procedure than the best I could possibly do if I committed to full Bayesian analysis with prior π?

A Bayesian regret of zero means the rule is Bayes-optimal. Positive Bayesian regret measures the cost of suboptimality, and rules with smaller regret are preferred. The concept is fundamental to decision theory, bandit problems, online learning, and the evaluation of statistical procedures.

Bayesian Regret BayesRegret(δ, π)  =  R(δ, π) − R(δ_Bayes, π)

Where R(δ, π)  =  ∫ R(θ, δ) · π(θ) dθ   (integrated risk of δ)
R(δ_Bayes, π)  =  inf_δ′ R(δ′, π)   (Bayes risk = minimum integrated risk)
R(θ, δ)  =  E_{Y|θ}[L(θ, δ(Y))]   (frequentist risk)

Connection to Admissibility and Efficiency

Bayesian regret is the numerator (or, when normalized, the complement) of Bayesian efficiency. A rule with zero Bayesian regret has efficiency 1 and is Bayes-optimal. A rule with large Bayesian regret has low Bayesian efficiency. By Wald's complete class theorem, inadmissible rules have strictly positive Bayesian regret for every prior — they are always improvable.

For admissible but non-Bayes rules (such as minimax rules that are limits of Bayes rules), the Bayesian regret may be positive for any specific prior but approaches zero along a sequence of priors. These "least favorable" priors characterize the minimax rule as the Bayes rule that minimizes the worst-case regret.

Bayesian Regret in Multi-Armed Bandits

The concept of Bayesian regret has found its most extensive modern application in the multi-armed bandit problem. An agent must repeatedly choose among K actions (arms), each with an unknown reward distribution. The Bayesian regret is the expected cumulative loss compared to always pulling the best arm, where the expectation is taken over both the randomness of rewards and the prior distribution of arm parameters.

Bandit Bayesian Regret BR(T)  =  E_π E_{rewards}[ Σₜ₌₁ᵀ (μ* − μ_{Aₜ}) ]

where μ* = max_k μₖ is the best arm's mean, Aₜ is the arm chosen at time t, and the outer expectation is over the prior π.

Thompson sampling (posterior sampling), proposed by William R. Thompson in 1933, achieves asymptotically optimal Bayesian regret: BR(T) = O(√(KT log K)). This makes it one of the most efficient bandit algorithms, and its Bayesian foundation provides a natural uncertainty-directed exploration strategy. At each round, Thompson sampling draws a parameter from the posterior of each arm and plays the arm with the highest sampled value — a beautifully simple procedure that automatically balances exploration and exploitation.

Bayesian vs. Frequentist Regret

Bayesian regret averages over the prior, while frequentist regret (or worst-case regret) takes the supremum over all parameter values. The minimax approach minimizes the worst-case regret; the Bayesian approach minimizes the average-case regret. These lead to different optimal strategies: minimax strategies tend to be more conservative (exploring more), while Bayesian strategies can be more aggressive when the prior is informative. The distinction matters most when the prior is a poor model of the true environment. In practice, Thompson sampling often outperforms minimax-optimal algorithms like UCB, suggesting that Bayesian priors, even when approximate, capture useful structure.

Regret Decomposition

Bayesian regret can often be decomposed into interpretable components. In the estimation setting:

(1) Excess bias risk: The cost of systematic errors introduced by the rule (e.g., using an estimator with bias).

(2) Excess variance risk: The cost of additional variability beyond the Bayes-optimal variance.

(3) Structural cost: The cost of using a restricted class of rules (e.g., linear rules when the optimal rule is nonlinear).

In the bandit setting, regret decomposes into an exploration cost (information gathering) and an exploitation loss (suboptimal actions taken). The Bayesian framework provides the natural allocation: explore more when the posterior is uncertain, exploit when it is concentrated.

Applications Beyond Bandits

Bayesian regret arises in clinical trial design (the expected number of patients assigned to inferior treatments), portfolio allocation (excess return lost compared to the optimal portfolio), Bayesian experimental design (the expected information loss from a suboptimal design), and reinforcement learning (the cumulative cost of learning a policy). In each case, minimizing Bayesian regret is equivalent to finding the Bayes-optimal strategy.

"Regret is the appropriate measure of the cost of learning. The Bayesian formulation of regret is natural because it averages over the unknown environment, reflecting the agent's genuine uncertainty." — Shipra Agrawal and Navin Goyal, Advances in Neural Information Processing Systems (2012)

Worked Example: Three-Armed Bandit with Thompson Sampling

A website tests three button colors (A, B, C) for click-through rate. After 25 impressions allocated by Thompson Sampling, we compute the cumulative Bayesian regret.

Given Arm A: 9 pulls, 7 successes → mean = 0.778
Arm B: 8 pulls, 3 successes → mean = 0.375
Arm C: 8 pulls, 7 successes → mean = 0.875

Step 1: Posterior (Beta-Bernoulli) Prior: Beta(1, 1) for all arms
A: Beta(1+7, 1+2) = Beta(8, 3), E[θ_A] = 0.727
B: Beta(1+3, 1+5) = Beta(4, 6), E[θ_B] = 0.400
C: Beta(1+7, 1+1) = Beta(8, 2), E[θ_C] = 0.800

Step 2: Cumulative Regret Best arm = C with observed mean 0.875
Regret per pull of A: 0.875 − 0.778 = 0.097 × 9 pulls = 0.875
Regret per pull of B: 0.875 − 0.375 = 0.500 × 8 pulls = 4.000
Regret per pull of C: 0.875 − 0.875 = 0.000 × 8 pulls = 0.000
Total cumulative regret = 0.875 + 4.000 + 0.000 = 4.875
Average regret per pull = 4.875/25 = 0.195

The cumulative regret of 4.875 over 25 pulls reflects the exploration cost — particularly the 8 pulls on arm B (the worst arm). Thompson Sampling's Bayesian approach naturally reduces pulls on B as its posterior concentrates on lower values. The theoretical regret bound for Thompson Sampling is O(√(K·n·log n)), which for K=3 and n=25 gives roughly √(3·25·3.2) ≈ 15.5, well above our actual regret.

Interactive Calculator

Each row has an arm (label) and a reward (numeric). The calculator computes cumulative Bayesian regret: the difference between the reward obtained and what would have been obtained always pulling the best arm. It compares a Thompson Sampling posterior (Beta-Bernoulli) with the observed per-arm means.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

External Links