Andrew Gelman’s new favorite example of the hidden dangers of noninformative priors is the following. If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being pure noise, but the posterior probability for theta>0 is .84. Gelman thinks this is an example of priors gone wild, but I claim this prior works perfectly. Here’s why.
Suppose we use the uninformative prior what’ll we get?
The prior’s indicating the true value is somewhere between -300 and +300, while the data is saying it’s within . The posterior combines these two pieces of information and says roughly,
99% of all possible values consistent with the evidence are in
Which would seem to be reason enough to guess .
But what if we ask the question is ? With a constant loss function, our best guess would be it is. This is because most possibilities consistent with the evidence are greater than zero. Given y=1 there’s no basis to guess the opposite.
But even here the uninformative prior isn’t misleading us. We can think of the decision problem as estimating the value of the indicator function
and like all point estimates, the spread of the distribution matters a great deal. For example, in the normal distribution we may make the point estimate , but we’re unlikely to observe this unless the spread is small. In general the natural way to measure spread is the entropy, which following Boltzmann is . For the normal distribution , so entropy is a generalization of the standard deviation of sorts. Thus consider the entropy for this case,
Then , which is close to the minimum, indicating the guess is a reliable one. But which is much closer to the maximum possible value .69. This is a clear warning that while our best guess is there’s plenty of reason to think it may not be true in practice.
As long as we heed the warning, we’re in good shape. If we don’t use a point estimate for theta but rather average over this warning will automatically be considered. That’s the magic of real Bayes rather than ad-hocaries like hypothesis testing.
Now consider the highly informative prior . The Bayesian posterior implies a 99% interval and our best guess would still be that . Since , the informative answer is completely consistent with the less informative one.
So lets review the performance of this wild and crazy prior:
What more could you ask of a distribution? That it bring you tea and crumpets every morning? The real danger here wasn’t the prior, but Statisticians who retain to much Frequentist intuition about the nature of probabilities. Unlike , the statement “P=.84″ is not a claim about the real world. Rather it’s a statement of how well certain information pins down the location of . Use it as such and you’ll be fine.
UPDATE: See the update in the third comment below. I can summarize my point this way. Gelman thinks .84 is too big. If you’re using that number in a classical/Frequentist way then maybe it is. If however, you do the correct Bayesian thing and average anything of interest over , then you’ll be perfectly fine. It’s correctly considering all possible values of consistent with the evidence.