The Amelioration of Uncertainty

The Intellectual oubliette of p-values and Confidence Intervals.

Criticizing P-values and Confidence Intervals (CI’s) is nothing new, but my aim is different. My goal is to tear down, shred, burn and destroy them using arguments you probably haven’t seen before, aided by Entropy and it’s various mathematical properties. Let the fun begin.

The usual procedure for modeling the “data generation mechanism” is to observe some data equation and match their histogram to a common distribution like Uniform, Normal, Binomial, Gamma, exponential, chi-squared, beta, Dirichlet, Bernoulli, and so on. These are Exponential Family type distributions which have sufficient statistics and a form similar to this single parameter example:

(1)   equation

Assuming IID, the Maximum Likelihood Principle implies equation satifies:

(2)   equation

If there’s a good enough fit between the data and equation statisticians believe they’ve captured the “data generation mechanism”.

Now forget your training for a moment and look at the problem anew. Given a function G and data we can compute a value

(3)   equation

which can then be used to maximize the entropy equation subject to the constraint


The result is the same equation with equation being a Lagrange multiplier that turns out to satisfy the same equation (2)!

This puts everything in a new light. Those assumptions (distributional form + IID + Maximum Likelihood) aren’t needed and don’t embody some mystical properties of “randomness”. The Entropy Concentration Theorem shows that most new data satisfying the same constraints (3) will look like equation. So those coverage properties CI’s supposedly have depend almost entirely on whether new data has the same concrete relationships (3) between themselves. This observation brings up three points:

First, new data rarely satisfy the same conditions again. Unless something is forcing (3) to hold, it’s very unlikely to be true in the future. There are examples of that physical forcing. Conservation of energy provides an example and simulations provide another. At my son’s school they evenly apportion abilities to each class, so test averages can be roughly the same for each. But generally new data satisfies different constraints or different equation’s and CI’s have significantly greater or smaller coverage than advertised. An example comes from financial markets, where price correlations, which amount to constraints of the form (3), stubbornly refuse to have the same “equation” this year as last(*).

Second, you learn very little by modeling the “data generation mechanism” this way. Data can always be explained by some equation, since you can make equation fit the data as well as you’d like by enlarging the set of constraint functions equation. The set equation isn’t even close to being unique either. There’s no telling from this process which, if any, will hold in the future. Even when new data satisfy the same constraints, you haven’t learned much, since any causes leading to (3) will likely produce the same equation preventing you from distinguishing among them.

Third, CI coverage/calibration isn’t what’s needed. The unshakable Frequentist belief to the contrary reminds me of the (fabled?) priest who refused to spy Jupiter’s moons in Galileo’s telescope because his philosophy told him they’re impossible. If you look through the Bayesian telescope you’ll see examples like this one, or the one given toward the bottom of this post. The real goal is to make best guesses by counting over states compatible with our knowledge. Even when real frequencies f are your concern, the best way to handle them is with Bayesian distributions equation.

Statistics is easier than Statisticians are making it. Given a generic function equation which is largely insensitive to equation, you can accurately guess equation in most cases just by knowing equation. The successful use of a equation covering equation to get a good interval estimate of equation doesn’t mean future equation‘s will have a frequency distribution similar to equation. They don’t even have to be remotely related. Hell, future equation’s needn’t even be possible.

It’s not properties of “random processes” we’re exploiting, its properties of “aggregation” functions like equation. The Entropy Concentration Theorem shows the mapping from data to frequency distributions is one such aggregation, but it’s just a special case. Statisticians misunderstood this and then tried to force-fit all inference into that example, thereby trapping themselves in an intellectual oubliette of their own making.


(*) Of course you can model changing constraint values equation. Unfortunately, Frequentists do so in practice by repeating the same mistake on equation as they’re making on equation.

UPDATE: If the facts at the beginning of this post are unfamiliar then a good place to start is Information Theory and Statistics written by Solomon Kullback in the 1950′s. Kullback was a Frequentist working at a time when Bayesian statistics was at a low point and wasn’t aware of the Entropy Concentration Theorem. It’s crazy just how big a chunk of statistics Kullback was able to unify and motivate using the mathematics of entropy. That should give any Frequentist pause for thought.

August 27, 2013
  • August 30, 2013Brendon J. Brewer

    “Dramatization of a statistician trapped in an oubliette”

    LOL! That made my day.

Leave a Reply or trackback