The Amelioration of Uncertainty

Hidden dangers of a sloppy understanding of probabilities

Andrew Gelman’s new favorite example of the hidden dangers of noninformative priors is the following. If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being pure noise, but the posterior probability for theta>0 is .84. Gelman thinks this is an example of priors gone wild, but I claim this prior works perfectly. Here’s why.

Suppose we use the uninformative prior equation what’ll we get?

The prior’s indicating the true value is somewhere between -300 and +300, while the data is saying it’s within equation. The posterior combines these two pieces of information and says roughly,

99% of all possible values consistent with the evidence are in equation

Which would seem to be reason enough to guess equation.

But what if we ask the question is equation? With a constant loss function, our best guess would be it is. This is because most possibilities consistent with the evidence are greater than zero. Given y=1 there’s no basis to guess the opposite.

But even here the uninformative prior isn’t misleading us. We can think of the equation decision problem as estimating the value of the indicator function

    equation

and like all point estimates, the spread of the distribution matters a great deal. For example, in the normal distribution we may make the point estimate equation, but we’re unlikely to observe this unless the spread equation is small. In general the natural way to measure spread is the entropy, which following Boltzmann is equation. For the normal distribution equation, so entropy is a generalization of the standard deviation of sorts. Thus consider the entropy for this case,

    equation

Then equation, which is close to the minimum, indicating the equation guess is a reliable one. But equation which is much closer to the maximum possible value .69. This is a clear warning that while our best guess is equation there’s plenty of reason to think it may not be true in practice.

As long as we heed the warning, we’re in good shape. If we don’t use a point estimate for theta but rather average over equation this warning will automatically be considered. That’s the magic of real Bayes rather than ad-hocaries like hypothesis testing.

Now consider the highly informative prior equation. The Bayesian posterior implies a 99% interval equation and our best guess would still be that equation. Since equation, the informative answer is completely consistent with the less informative one.

So lets review the performance of this wild and crazy prior:

  • It gives a good interval for equation consistent with 0.
  • If you have to guess whether equation it will make the best guess possible, but will also warn you this guess is uncertain.
  • If you use an informative prior in the future the results will be consistent with the old one.
  • What more could you ask of a distribution? That it bring you tea and crumpets every morning? The real danger here wasn’t the prior, but Statisticians who retain to much Frequentist intuition about the nature of probabilities. Unlike equation, the statement “P=.84″ is not a claim about the real world. Rather it’s a statement of how well certain information pins down the location of equation. Use it as such and you’ll be fine.

    UPDATE: See the update in the third comment below. I can summarize my point this way. Gelman thinks .84 is too big. If you’re using that number in a classical/Frequentist way then maybe it is. If however, you do the correct Bayesian thing and average anything of interest over equation, then you’ll be perfectly fine. It’s correctly considering all possible values of equation consistent with the evidence.

    November 24, 2013
    6 comments »
    • November 25, 2013Brendon J. Brewer

      Interesting post as usual, Joseph. I enjoyed the joke about tea and crumpets.

      “If we observe data y ~ N(theta,1) and get y=1, then this is consistent with being pure noise, but the posterior probability for theta>0 is .84.”

      I hate the phrase “consistent with”. For me, it’s up there with “random” as one of the most useless phrases in science. If you think there’s a special value such as theta=0 which is extra plausible, then sure, analysis based on the flat prior might disagree a bit with your intuition which is not using a flat prior. I’m rather stunned that anyone thinks this is a problem.

      A plausible interpretation is that Gelman is using this example to get beginners to think, rather than because he really thinks there’s a problem.

    • November 25, 2013Joseph

      Brendon,

      I’m open to a better word than “consistent”.

      So what’s a better way to describe the fact that if the key consequences of the highly informative prior are true, then they imply the key consequences of the uninformative prior are true?

      Specifically, equation and both priors would guess equation, so they’re either both right or both wrong about that.

    • November 25, 2013Joseph

      UPDATE: If the entropy stuff above is too cryptic then I’ll reword it.

      Let equation and suppose our real goal is to say something about some other parameter of interest equation.

      Then if equation is large, that’s telling us we shouldn’t use equation to make inferences about equation. Rather we should use the correct expression:

      equation

      In other words, the uninformative prior is warning us that we can’t trust the statement “H is true”.

      But the high value of equation is actually giving us a much much stronger warning. It’s warning us that we shouldn’t be using the intermediary H at all. We should be using:

      equation

    • November 25, 2013konrad

      I think it’s fairly clear from the comment thread on Gelman’s blog what the issue is:

      Gelman has extra information about the applications he is interested in, namely that theta is likely to be close to zero. His complaint is that the uninformative prior doesn’t capture this obvious (to him) information. Unfortunately the information is not so obvious to the rest of us, because we don’t have the same applications in mind. I don’t think frequentist intuition comes into it.

      There may be a secondary issue, in that Gelman seems to think that the probability .84 is very(?) close to certainty. But as I pointed out on that thread, it’s the sort of probability that is routinely beaten at the poker table – the only sensible conclusion on obtaining a posterior of .84 is that you’re in a condition of uncertainty – theta>0 is likely but far from certain. Again, frequentist intuition is on the same page: a p-value of .16 is not considered significant.

    • November 25, 2013Brendon J. Brewer

      Spot on, Konrad.

    • November 25, 2013Joseph

      I agree Konrad, from a Bayesian point of view .84 is indicating that not only should we be warry of accepting H, we shouldn’t use H as an intermediary all. We should average over the posterior.

      My point is that if you do that you wont be lead astray, even if you have substantial prior information and chose not to use it out of convience or something (unless of course that prior info is so extreme the data’s irrelevant)

    Leave a Reply or trackback