## IID doesn’t mean what you think it does

The post “What do we need to Model?” showed what our goal in modeling errors should be. This one shows how it’s achieved. Assigning a distribution to the fixed parameters is like finding a prior ; it’s successful whenever the ‘s high probability region accurately describes where is located in space.

Fortunately it’s not our job to model the frequency of measurement errors. Statisticians usually conjure up error distributions without physically checking anything, but if they did check, they’d discover error patterns aren’t stable. Errors depend on factors outside of the measuring instrument itself, and those factors change. Even in processes designed to yield stable patterns, they don’t remain stable – just ask anyone making a living in Quality Control.

Our job is merely to pin down the fixed parameters as much as possible. But this brings us to the crux of the problem: What do we really know about ?

When last I taught physics lab the students were told wooden rulers like the one above had an accuracy of about . Better rulers were reasonably accurate to about 1/5 to 1/10 their smallest division. Digital calipers came with a known accuracy from the factory, which was usually the smallest significant figure on display. So given accuracy and the knowledge that we took careful measurements using calibrated instruments, the one true thing we know is:

(1)

lies in a hypersphere in for some reasonable radius proportional to . Any distribution whose high probability manifold corresponds to this hypersphere would do in practice, but it’s worth considering a maxent approach. Given the state of knowledge (1) we’ll use a distribution which maximizes the entropy constrained by:

(2)

The result is . The “IID” results from the symmetry evident in (1) and (2). Maximizing the entropy expands the high probability region as much as possible, so even if (2) weren’t symmetric the maxent solution would try to get as close to IID as it could. The form of this distribution in no way implies causal or frequency “independence”; it just provides a big symmetric region needed to locate .

Since IID has nothing to do with being “IID” as commonly conceived, it’s easy to find examples that’ll leave Frequentist’s head spinning. Take these current and future errors provided by a rogue laboratory ruler with initial accuracy and :

(3)

Their sampling distribution definitely isn’t IID , and yet if you use this model you’ll get a 95% Bayesian interval which correctly implies . This success shouldn’t be surprising. However horrendous that model is as a frequency distribution, it is a good Bayesian “prior” which accurately locates . The interval calculation shows that 19 out of every 20 potential lead to . Far from being surprising that it worked, it would only be exceptional if it didn’t!

This also puts to rest a mystery noticed by every thoughtful statistician since Laplace. To quote from Jaynes (page 198)

In the middle 1950s the writer [Jaynes] heard an after-dinner speech by Professor Willy Feller [Frequentist], in which he roundly denounced the practice of using Gaussian probability distributions for errors, on the grounds that the frequency distributions of real errors are almost never Gaussian. Yet in spite of Feller’s disapproval, we continued to use them, and their ubiquitous success in parameter estimation continued.

NIID assumptions work better than Frequentists expect because the state of knowledge in (1) is far more realistic, modest, knowable, and just plain true than any fantasies about long range frequencies. Anyone who understands that probabilities aren’t frequencies can use (1) to get results inexplicable to those not in the know.

August 15, 2013Brendon J. Brewer

link • my site

I completely agree with you about the epistemological status of the probability distribution for the errors.

However, I have two technical issues I’d like to raise. The MaxEnt distribution corresponding to the constraint of (1) (actually, to the constraint of P(the proposition of equation 1) = 1, since MaxEnt applies to constraints on probability distributions) is a uniform distribution inside a 10-dimensional box, not a Gaussian. Also, this all comes from maximising the entropy with respect to a uniform prior (which Jaynes called a “measure”). I think this is often a reasonable starting point but we should acknowledge that it’s there.

BTW are you familiar with the work of Ariel Caticha from SUNY? Some of his stuff is a bit out there but in my opinion he’s more coherent than Jaynes was about the status of MaxEnt. My only complaint about Ariel is that he loves Jeffreys priors. I think they’re cool to think about but not some kind of objective magic.

August 15, 2013Joseph

link • author

Yes, as noted you could use any distribution which captures the same region, including a uniform one over a hypercube. Note those are not strict inequalities in (1) so you might want to use a cube of length for some reasonable .

Most people think that assuming a uniform distribution is a statement about frequencies of some kind. In truth assuming a uniform distribution on x means “do a count over x”. That is to say, once you throw a uniform distribution into the mix to be manipulated using the sum rule, product rule, Bayes Theorem, and so on, the effect it has is to do a count over x.

It is always better to think of uniform distributions this way unless you’re dealing with actual frequency distributions. Here we’re trying to assign a distribution to fixed parameters so frequencies don’t come into it.

Thus when we go to maximize the entropy we use a uniform distribution for because our ultimate goal is to count the number of possibilities for which lead to conditions of the form . In other words, we use a uniform distribution over because we’re interested in counting possibilities over this space. If I had of chosen something wildly different for M, I wouldn’t have been able to say:

“The interval calculation shows that 19 out every 20 potential lead to . Far from being surprising that it worked, it would only be exceptional if it didn’t! ”

That assumption is really just a way of telling the mathematical machinery that we want to count ‘s and don’t want to count either functions of those, or count elements of some deeper state space.

August 15, 2013Daniel Lakeland

link • my site

There is, in principle, a big difference between a strict inequality

and a less strict inequality . If our is not quite big enough with strict equality then we eliminate things from consideration that might actually occur. Often there’s some multiplier that is going to be big enough, at least in practical terms. we know for example that measurements coming from a wooden ruler are not off by more than say .

The normal distribution is a way to maximize entropy while retaining a both a scale, and an unbounded support. I haven’t thought about it, but I wonder if there’s some elegant way to derive a gaussian distribution as the standardization of a nonstandard mixture of uniform distributions by putting an elegant mixture distribution over different widths. That would be a very powerful intuitive reason for gaussians in this framework.

August 15, 2013Daniel Lakeland

link • my site

Testing LaTeX, I thought it was enough to put “latexpage” inside square brackets, but possibly I need dollar signs around the math? ?

August 15, 2013Joseph

link • author

Yeah you still need the dollar signs. I added them.

August 15, 2013Daniel Lakeland

link • my site

In other words, suppose is a distance from zero, so that we can put a uniform distribution from whose height is then . Let be some mixture weighting distribution. We seek so that

Then is the mixture distribution for different widths of uniform distributions. It turns out via playing around with the integral that the distribution is

So, in other words, a standard gaussian is a mixture of uniform distributions centered on 0 where the half-width parameter is uncertain, and chi-distributed with 3 degrees of freedom. We know it’s not zero, and we think it’s most likely about and an expected value of about 1.596. Somehow that’s interesting to me.

August 15, 2013konrad

link

Is your idea that eq 1 is an approximate statement of prior information, made precise in eq 2? What’s unclear to me is why one would ever (and particularly in this example) have prior information of exactly the form in eq 2 – Jaynes gives a different and more detailed justification, but still doesn’t quite explain (at least to my satisfaction) why we should expect to often have exactly the information under which the solution is Gaussian.

The way you present it (especially given your later comment that “you could use any distribution which captures the same region”), it looks as if you just chose a constraint that will give a convenient answer.

August 15, 2013Joseph

link • author

Konrad,

No it definitely isn’t an explanation for Maxent. For the post it’s sufficient that after the fact the resultant NIID can be seen to work by inspection of it’s high probability manifold. I only mentioned it here because of the connection between the IID property, symmetry, and having large high probability manifolds.

There is definitely a missing piece to the maxent puzzle which hasn’t been discovered yet. Probably everyone who’s thought about it seriously had that intuition. Most who use maxent significantly eventually run into trouble. Jaynes seems to be an exception here. My experience with Jaynes is almost entirely from his papers not his book, and after spending a long time with them I’ve noticed that over and over again he just misses pitfalls when using entropy which trip up others. Over and over again, he confidently uses entropy in ways that don’t seem right at first, but after much work, do appear to be right after all.

I believe I know what the missing piece is and that it clears up a large number of mysteries in both statistics and maxent in a simple down to earth fashion. I don’t intent to publish it though. I have no academic career to worry about and frankly given how difficult it’s been to communicate trivial stuff to statisticians, I doubt it would do any good. I do however enjoy using these ideas on real problems that interest me. I’ll just stick with that.

August 16, 2013Brendon J. Brewer

link • my site

konrad,

You’ve hit on what I think is the main weakness of MaxEnt. I think MaxEnt is a great method for updating probability distributions when the new information is a constraint about your probabilities. But in practice when does that ever happen? It’s pretty rare.

I wrote a blog post on this a few years back that may or may not be interesting.

http://letterstonature.wordpress.com/2008/12/29/where-do-i-stand-on-maximum-entropy/

Also, the true error vector being in the “high probability manifold” of the prior [/latexpage] isn’t a sufficient condition for “success”. e.g. For an iid standard normal prior, (1,1,…,1) is in the “high probability manifold” but will result in misleading answers.

August 16, 2013Brendon J. Brewer

link • my site

“For an iid standard normal prior, (1,1,…,1) is in the “high probability manifold” but will result in misleading answers.”

Of course this isn’t a real problem. The probability of a data set that would result in a “misleading” posterior distribution is very small. But if you think error sequences that “look correlated” are fairly probable, you shouldn’t use an IID prior that implies they’re improbable.

August 16, 2013Joseph

link • author

Brendon, there’s no need to close the “latexpage” tag. Maybe this weekend I’ll go into wordpress and fix all this.

Given a high probability region, the purpose of the probability manipulations is to get a kind of “majority vote”. In this case we’re saying 95% of the possibilities in that region lead to .

For this example, the “majority” is basically any set of errors were there is some “cancellation”. The only ones that fail to be in the majority are the ones with extreme values all in the same direction. This is the key quality the errors need to have and it has nothing to do with whether they are IID or Normal as Frequentists understand it.

Of course, if had a histogram, then the errors wouldn’t be all extreme and there would be some cancellation. So this in a sense is a sufficient condition, but what most everyone fails to see is that it isn’t even close to being necessary. This wouldn’t be a problem except that the Frequentist’s “sufficient condition” is such a strong physical assumption it doesn’t really hold much in practice. And even if it did, there would still be plenty of problems where we’d want to exploit the added freedom we get from only relying on the far weaker “necessary” condition.

It’s also worth thinking about why we bother to take multiple measurements at all. If we just wanted a correct statement of the form we could take one measurement, get bounds for the errors and produce a 100% interval.

Taking multiple measurements and considering a 95% vote is useful because those “minority” possibilities in the 5% are all spread out in the tails of the interval. So by excluding them we can shrink the interval (a,b) by far more than non-statisticians would naively think.

In other words, we’re specifically exploiting the insensitivity of the function to it’s argument (see the post on noninformative priors) which results from the fact that it bunches up near zero over the region we care about. Recall that the point estimate is .

Also note that the entire point of assigning is to define “the region we care about” and has nothing to do with the shape of error histograms out to infinity.

August 16, 2013Joseph

link • author

Brendon,

In regards to maxent, you had a nice summary which accuratly characterizes the current state of the subject. As explained to Konrad though, that’s not the whole story. I don’t want to get into this, but I’ll make two points.

First, to really get maxent you have to completely drop any frequentist intuition about probabilities distributions. Think about them purely in the way they’re used in this post.

Second, consider the following scenario. Suppose we have a single set of information which can naturally be expresses using multiple constraints. Call them A and B. Our ultimate goal in finding a maxent distribution is to get an interval estimate for some function .

Now suppose that for whatever reason it’s difficult to carry out the maximization subject to both A and B. One way of dealing with this is to just maximize with respect to A first. Then check to see if this already is enough of a constraint to make the variance of g small. If it is, then don’t bother maximizing with respect to B at all. This works great, but unfortunately both the constraint A and the resulting are going to seem strange, especially if you’re thinking of them as frequencies.

Most real applications of maxent involve just such a reduction (including most statistical mechanics), which is why the constraints can be so mysterious. So while I understand where the view that we can’t use maxent much comes from, I believe this is very mistaken. You can use it endlessly, but you need a missing piece of the puzzle which makes sense out of what’s going on.

August 16, 2013Corey

link

Daniel Lakeland,

Many distributions can be expressed as a scale mixture of Gaussians (e.g., t distribution, Laplace distribution, logistic distribution, symmetric stable distribution), and lots of Gibbs sampling algorithms use these decompositions. By expressing the Gaussian as a scale mixture of uniforms, you’ve made it easy to express all such distributions as scale mixtures of uniforms.

August 16, 2013konrad

link

I still don’t see the justification for Eq 2: we have a set of unknown errors, perhaps with an upper bound on their size – why would the expectation of the MSE be known in such a situation? If instead we were working with an inequality constraint the bounded errors could give a bound on the MSE, which we could use as a constraint even if it’s not all that is known – but where does an equality constraint on expected MSE come from?

August 16, 2013Joseph

link • author

Konrad,

One was never given. As I said, I didn’t really want to go into it. Especially since it’s not needed for the post. I’ll say a little something though. This goes beyond anything I’ve seen published anywhere.

Consider the function . We don’t know , but our state of knowledge might allow us to create a which is good in the sense that is in it’s high probability manifold.

Now it’s reasonable to require that . Note that in general. In fact they can be quite different.

That’s all I’m going to say about it.