## Hidden dangers of a sloppy understanding of probabilities

Andrew Gelman’s new favorite example of the hidden dangers of noninformative priors is the following. If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being pure noise, but the posterior probability for theta>0 is .84. Gelman thinks this is an example of priors gone wild, but I claim this prior works perfectly. Here’s why.

Suppose we use the uninformative prior what’ll we get?

The prior’s indicating the true value is somewhere between -300 and +300, while the data is saying it’s within . The posterior combines these two pieces of information and says roughly,

99% of all possible values consistent with the evidence are in

Which would seem to be reason enough to guess .

But what if we ask the question is ? With a constant loss function, our best guess would be it is. This is because most possibilities consistent with the evidence are greater than zero. Given y=1 there’s no basis to guess the opposite.

But even here the uninformative prior isn’t misleading us. We can think of the decision problem as estimating the value of the indicator function

and like all point estimates, the spread of the distribution matters a great deal. For example, in the normal distribution we may make the point estimate , but we’re unlikely to observe this unless the spread is small. In general the natural way to measure spread is the entropy, which following Boltzmann is . For the normal distribution , so entropy is a generalization of the standard deviation of sorts. Thus consider the entropy for this case,

Then , which is close to the minimum, indicating the guess is a reliable one. But which is much closer to the maximum possible value .69. This is a clear warning that while our best guess is there’s plenty of reason to think it may not be true in practice.

As long as we heed the warning, we’re in good shape. If we don’t use a point estimate for theta but rather average over this warning will automatically be considered. That’s the magic of real Bayes rather than ad-hocaries like hypothesis testing.

Now consider the highly informative prior . The Bayesian posterior implies a 99% interval and our best guess would still be that . Since , the informative answer is completely consistent with the less informative one.

So lets review the performance of this wild and crazy prior:

What more could you ask of a distribution? That it bring you tea and crumpets every morning? The real danger here wasn’t the prior, but Statisticians who retain to much Frequentist intuition about the nature of probabilities. Unlike , the statement “P=.84″ is not a claim about the real world. Rather it’s a statement of how well certain information pins down the location of . Use it as such and you’ll be fine.

UPDATE: See the update in the third comment below. I can summarize my point this way. Gelman thinks .84 is too big. If you’re using that number in a classical/Frequentist way then maybe it is. If however, you do the correct Bayesian thing and average anything of interest over , then you’ll be perfectly fine. It’s correctly considering all possible values of consistent with the evidence.

November 25, 2013Brendon J. Brewer

link • my site

Interesting post as usual, Joseph. I enjoyed the joke about tea and crumpets.

“If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being pure noise, but the posterior probability for theta>0 is .84.”

I hate the phrase “consistent with”. For me, it’s up there with “random” as one of the most useless phrases in science. If you think there’s a special value such as theta=0 which is extra plausible, then sure, analysis based on the flat prior might disagree a bit with your intuition which is not using a flat prior. I’m rather stunned that anyone thinks this is a problem.

A plausible interpretation is that Gelman is using this example to get beginners to think, rather than because he really thinks there’s a problem.

November 25, 2013Joseph

link • author

Brendon,

I’m open to a better word than “consistent”.

So what’s a better way to describe the fact that if the key consequences of the highly informative prior are true, then they imply the key consequences of the uninformative prior are true?

Specifically, and both priors would guess , so they’re either both right or both wrong about that.

November 25, 2013Joseph

link • author

UPDATE: If the entropy stuff above is too cryptic then I’ll reword it.

Let and suppose our real goal is to say something about some other parameter of interest .

Then if is large, that’s telling us we shouldn’t use to make inferences about . Rather we should use the correct expression:

In other words, the uninformative prior is warning us that we can’t trust the statement “H is true”.

But the high value of is actually giving us a much much stronger warning. It’s warning us that we shouldn’t be using the intermediary H at all. We should be using:

November 25, 2013konrad

link

I think it’s fairly clear from the comment thread on Gelman’s blog what the issue is:

Gelman has extra information about the applications he is interested in, namely that theta is likely to be close to zero. His complaint is that the uninformative prior doesn’t capture this obvious (to him) information. Unfortunately the information is not so obvious to the rest of us, because we don’t have the same applications in mind. I don’t think frequentist intuition comes into it.

There may be a secondary issue, in that Gelman seems to think that the probability .84 is very(?) close to certainty. But as I pointed out on that thread, it’s the sort of probability that is routinely beaten at the poker table – the only sensible conclusion on obtaining a posterior of .84 is that you’re in a condition of uncertainty – theta>0 is likely but far from certain. Again, frequentist intuition is on the same page: a p-value of .16 is not considered significant.

November 25, 2013Brendon J. Brewer

link • my site

Spot on, Konrad.

November 25, 2013Joseph

link • author

I agree Konrad, from a Bayesian point of view .84 is indicating that not only should we be warry of accepting H, we shouldn’t use H as an intermediary all. We should average over the posterior.

My point is that if you do that you wont be lead astray, even if you have substantial prior information and chose not to use it out of convience or something (unless of course that prior info is so extreme the data’s irrelevant)