The Principle of Maximum Entropy (MaxEnt) provides a systematic method for constructing prior distributions and probability assignments from incomplete information. Given a set of constraints — typically expectations of known functions — MaxEnt selects the unique distribution that satisfies those constraints while maximizing the Shannon entropy. The resulting distribution is, in a precise sense, the least informative distribution consistent with what is known. It makes no assumptions beyond the stated constraints and is therefore maximally honest about the limits of one's knowledge.
Jaynes developed this principle throughout the 1950s and 1960s, drawing on Shannon's information theory and the statistical mechanics tradition of Gibbs and Boltzmann. He argued that MaxEnt is not merely a useful heuristic but the uniquely rational method for probability assignment in the face of incomplete information — a claim that remains both influential and debated.
Maximum Entropy Problem Maximize: H(p) = −Σᵢ p(xᵢ) log p(xᵢ)
Subject to: Σᵢ p(xᵢ) = 1
Σᵢ p(xᵢ) · fₖ(xᵢ) = Fₖ for k = 1, …, m
Solution (Gibbs Form) p*(xᵢ) = (1/Z) · exp(−Σₖ λₖ · fₖ(xᵢ))
where Z = Σᵢ exp(−Σₖ λₖ · fₖ(xᵢ)) is the partition function
Key Results
The MaxEnt principle yields many familiar distributions as special cases. With no constraints beyond normalization, it gives the uniform distribution — reflecting complete ignorance. Constraining only the mean E[x] = μ yields the exponential distribution (for non-negative x) or the geometric distribution (for discrete x). Constraining both the mean and variance gives the Normal (Gaussian) distribution. Constraining the mean of the logarithm gives the power-law distribution. Each of these well-known distributions is thus revealed as the maximally non-committal distribution consistent with specific knowledge.
Jaynes showed that the Boltzmann distribution of statistical mechanics — p(state) ∝ exp(−E/kT) — is precisely the MaxEnt distribution subject to a constraint on average energy E[E] = U. The temperature T emerges as a Lagrange multiplier. This derivation makes no assumptions about ergodicity, ensembles, or equal a priori probabilities. It replaces the traditional foundations of statistical mechanics with a purely inferential argument, reinterpreting the Boltzmann distribution as a statement about knowledge rather than physical dynamics.
Axiomatic Foundations
Shore and Johnson (1980) provided an axiomatic foundation for MaxEnt. They showed that any method of probability assignment satisfying four basic consistency axioms — subset independence, system independence, coordinate invariance, and a uniqueness requirement — must be the MaxEnt method. This result gives MaxEnt the status of a theorem rather than a heuristic: if you accept the axioms, you are logically committed to maximum entropy.
The axioms ensure that the method gives consistent results regardless of how the problem is decomposed, that constraints on independent systems factor correctly, and that the result does not depend on how the outcome space is labeled. These are minimal requirements for any rational method of probability assignment, and MaxEnt is the unique method satisfying them.
Continuous Maximum Entropy
For continuous distributions, Shannon entropy is replaced by differential entropy, and the MaxEnt principle requires a reference measure m(x) to avoid dependence on the choice of coordinates:
Continuous MaxEnt as Minimum KL Divergence p* = argmin_p D_KL(p ‖ m) subject to constraints
The reference measure m(x) plays the role of a "default" prior in the absence of any constraints. When m is uniform and the constraints specify moments, the solution reduces to the exponential family. The continuous MaxEnt framework thus provides a bridge between information theory and exponential family theory, unifying two major pillars of statistical methodology.
Applications in Bayesian Statistics
MaxEnt provides a principled method for constructing prior distributions. When an analyst knows certain properties of a parameter — its range, mean, variance, or other moments — but lacks specific knowledge of its distribution, the MaxEnt prior is the rational default. For example, if a parameter is known to be positive with mean μ, the MaxEnt prior is Exponential(1/μ). If it is real-valued with known mean and variance, the MaxEnt prior is Normal.
This approach has been advocated by Jaynes and his followers as a systematic alternative to ad hoc prior selection. It provides a rigorous answer to the question "What prior should I use?" that depends only on explicitly stated information and a transparent optimization principle.
Criticisms and Limitations
Critics of MaxEnt argue that the choice of constraints is itself subjective — different analysts may identify different relevant constraints, leading to different MaxEnt distributions. Seidenfeld (1986) argued that MaxEnt can violate conditionalization in some settings, producing priors that are not coherently updatable. Others have noted that MaxEnt is sensitive to the choice of reference measure in the continuous case, reintroducing a form of the arbitrariness it was designed to eliminate.
Defenders respond that these criticisms apply to any method of prior selection, and that MaxEnt at least makes the constraints and assumptions transparent. The principle does not claim to eliminate judgment — it claims to discipline it, ensuring that no information beyond the stated constraints is smuggled in.
"The maximum entropy distribution is the only unbiased assignment we can make; to use any other would amount to arbitrary assumption of information which by hypothesis we do not have." — E. T. Jaynes, Information Theory and Statistical Mechanics (1957)
Claude Shannon publishes "A Mathematical Theory of Communication," defining information entropy and establishing the foundation for MaxEnt.
E. T. Jaynes publishes two landmark papers connecting maximum entropy to statistical mechanics and probability theory, launching the MaxEnt program.
Shore and Johnson prove that MaxEnt is the unique probability assignment method satisfying basic consistency axioms.
Jaynes' posthumous Probability Theory: The Logic of Science is published, presenting his mature synthesis of MaxEnt, Bayesian inference, and the foundations of probability.
Example: Deriving the MaxEnt Distribution
You measure the waiting times (in minutes) of 25 customers at a service desk. The sample mean is μ = 2.5 and the sample variance is σ² = 0.36. Which distribution should you use as a model?
Entropy = ln(2.0) ≈ 0.693 nats
(b) Constraint E[X] = 2.5: MaxEnt → Exponential(λ = 1/2.5 = 0.4)
Entropy = 1 + ln(2.5) ≈ 1.916 nats
(c) Constraints E[X] = 2.5, Var[X] = 0.36: MaxEnt → Normal(2.5, 0.36)
Entropy = ½ ln(2πe × 0.36) ≈ 1.070 nats
Each additional constraint reduces the set of compatible distributions. The Uniform is maximally uncertain given only a range. The Exponential is the least informative distribution given a known mean (over non-negative reals). The Normal is the least informative distribution given both a mean and variance. MaxEnt does not "choose" the Normal because it fits the data best — it chooses the Normal because, among all distributions with mean 2.5 and variance 0.36, the Normal makes the fewest additional assumptions. This is precisely Jaynes' insight: maximum entropy is maximum honesty about what you don't know.