The Amelioration of Uncertainty

## The Intellectual oubliette of p-values and Confidence Intervals.

Criticizing P-values and Confidence Intervals (CI’s) is nothing new, but my aim is different. My goal is to tear down, shred, burn and destroy them using arguments you probably haven’t seen before, aided by Entropy and it’s various mathematical properties. Let the fun begin.

The usual procedure for modeling the “data generation mechanism” is to observe some data and match their histogram to a common distribution like Uniform, Normal, Binomial, Gamma, exponential, chi-squared, beta, Dirichlet, Bernoulli, and so on. These are Exponential Family type distributions which have sufficient statistics and a form similar to this single parameter example:

(1)

Assuming IID, the Maximum Likelihood Principle implies satifies:

(2)

If there’s a good enough fit between the data and statisticians believe they’ve captured the “data generation mechanism”.

Now forget your training for a moment and look at the problem anew. Given a function G and data we can compute a value

(3)

which can then be used to maximize the entropy subject to the constraint

The result is the same with being a Lagrange multiplier that turns out to satisfy the same equation (2)!

This puts everything in a new light. Those assumptions (distributional form + IID + Maximum Likelihood) aren’t needed and don’t embody some mystical properties of “randomness”. The Entropy Concentration Theorem shows that most new data satisfying the same constraints (3) will look like . So those coverage properties CI’s supposedly have depend almost entirely on whether new data has the same concrete relationships (3) between themselves. This observation brings up three points:

First, new data rarely satisfy the same conditions again. Unless something is forcing (3) to hold, it’s very unlikely to be true in the future. There are examples of that physical forcing. Conservation of energy provides an example and simulations provide another. At my son’s school they evenly apportion abilities to each class, so test averages can be roughly the same for each. But generally new data satisfies different constraints or different ’s and CI’s have significantly greater or smaller coverage than advertised. An example comes from financial markets, where price correlations, which amount to constraints of the form (3), stubbornly refuse to have the same “” this year as last(*).

Second, you learn very little by modeling the “data generation mechanism” this way. Data can always be explained by some , since you can make fit the data as well as you’d like by enlarging the set of constraint functions . The set isn’t even close to being unique either. There’s no telling from this process which, if any, will hold in the future. Even when new data satisfy the same constraints, you haven’t learned much, since any causes leading to (3) will likely produce the same preventing you from distinguishing among them.

Third, CI coverage/calibration isn’t what’s needed. The unshakable Frequentist belief to the contrary reminds me of the (fabled?) priest who refused to spy Jupiter’s moons in Galileo’s telescope because his philosophy told him they’re impossible. If you look through the Bayesian telescope you’ll see examples like this one, or the one given toward the bottom of this post. The real goal is to make best guesses by counting over states compatible with our knowledge. Even when real frequencies f are your concern, the best way to handle them is with Bayesian distributions .

Statistics is easier than Statisticians are making it. Given a generic function which is largely insensitive to , you can accurately guess in most cases just by knowing . The successful use of a covering to get a good interval estimate of doesn’t mean future ‘s will have a frequency distribution similar to . They don’t even have to be remotely related. Hell, future ’s needn’t even be possible.

It’s not properties of “random processes” we’re exploiting, its properties of “aggregation” functions like . The Entropy Concentration Theorem shows the mapping from data to frequency distributions is one such aggregation, but it’s just a special case. Statisticians misunderstood this and then tried to force-fit all inference into that example, thereby trapping themselves in an intellectual oubliette of their own making.

(*) Of course you can model changing constraint values . Unfortunately, Frequentists do so in practice by repeating the same mistake on as they’re making on .

UPDATE: If the facts at the beginning of this post are unfamiliar then a good place to start is Information Theory and Statistics written by Solomon Kullback in the 1950′s. Kullback was a Frequentist working at a time when Bayesian statistics was at a low point and wasn’t aware of the Entropy Concentration Theorem. It’s crazy just how big a chunk of statistics Kullback was able to unify and motivate using the mathematics of entropy. That should give any Frequentist pause for thought.

August 27, 2013