The Amelioration of Uncertainty

## The Definition of a Frequentist

The illustrious Dr. Mayo recently reminded me of Larry Wasserman’s take on what is a Frequentist. In Larry’s words:

[Nate Silver] “One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated.”

[Wasserman] It does not get much more frequentist than that. And if using Bayes’ theorem helps you achieve long run frequency calibration, great. If it didn’t, I have no doubt he would have used something else.

Well I have major doubts; especially since non-calibrated forecasts are often far superior to calibrated ones.

Let’s take a look at some 10 day rain forecasts. Suppose over the next 10 days we have five days of rain and five of sunshine. So the actual weather is:

Now a Frequentist produces a forecast:

This is indeed calibrated the way Wasserman would like. If you look at those days when they had rain 40% of the time. For those which had they had rain 60% of the time.

Now here’s my Bayesian forecast:

Well this isn’t calibrated at all and a Frequentist would say I better adopt Frequentist principles or risk looking the fool.

But if we used these two distributions to predict the weather, we’d naturally go with the rule “predict rain on days when the odds favor rain”. In this case the Frequentist would get 50% of the predictions wrong while the Bayesian would get 100% of the predictions right!

So I propose the following definition of a Frequentist:

A Frequentist is someone who can’t wrap their brain around the fact that is predicatively better than even though the latter satisfies Pr(rain) = “frequency of rain” while the former violates it completely.

This isn’t a minor point either. As I tried to explain here (but no one seems to have understood) everyone in Finance is trying to find calibrated distributions like to predict stock prices. What they should be striving for is . If you succeed you’ll make a lot more money.

August 5, 2013

LOL, but the problem is you’re not calibrating on long run frequencies. You haven’t set up a sensible problem (we’re predicting known outcomes from no data), but given such a scenario and given that you’re calibrating only on these outcomes, the best solution is 1,1,1,1,1,0,0,0,0,0 – this gives perfect prediction _and_ perfect calibration.

• August 5, 2013Joseph

Just pretend P_F, P_B come from some models based on something or other and we don’t know the actual outcome ahead of time.

In the long run the Sun goes super nova, so we’re only required to predict the weather for a finite number of days. For convenience I made n=10 and not n=1,825,000,000,000, but I could have chosen any n.

• August 5, 2013Joseph

Incidentally it’s worth looking at the entropies involved. The relevant space is . The entropy assuming we know nothing about the sequence is . This is the max possible entropy.

The Entropy for either or are identical and are equal to . This is not dramatically lower, indicating these distributions are not massively informative.

A simple calculation using indicates these distributions, thought of as distributions on , have high probability manifolds which are about 80% the full space W. In other words, these distributions eliminate about 20% of the possible sequences.

The reason why works better is because the true sequence is in a higher part of it’s high probability manifold than is the case for

Your distribution mentioned at the end of your comment only has a single point in it’s high probability manifold. The entropy is which is as low as it can go.

I get your point that the actual outcome validates P_B as a better set of predictions (in this particular example). But if, for large n, P_B continues to give 100% prediction accuracy while predicting with only 60% certainty there is clearly something wrong – in the financial setting you’ll be making way less money than you could if you had a better calibrated model with the same predictive ability (or in the weather example, you might have gone to a lot of trouble carrying an umbrella on those 40% forecast days – unnecessary with better calibration). And even if obtaining better calibration is only possible by sacrificing some prediction accuracy, it’s probably worth it.

It’s like the bias-variance debate where (some) frequentists used to call for unbiased estimates no matter what. This is wrong, but the fact that biased estimates can outperform unbiased ones does not mean we should neglect bias entirely.

Re the entropy argument: P_B may be the best model with that particular entropy, but a model with lower entropy may be much better, even if it has worse prediction accuracy. I agree it is silly to calibrate by keeping entropy fixed and making prediction worse; usually the effect of better calibration would be to decrease entropy.

• August 5, 2013Joseph

There will be a whole continuum of distributions with entropies lower than of P_B. As long as the true weather sequence remains in the high probability region everything is good.

For example if you let P’ be just like P_B except replace .6 with .9 and .4 with .1, then you’d get a distribution with entropy S=3.5 which is quite a bit closer to zero and will obviously be more useful in general.

But while P’ is getting closer to being “calibrated” in the sense that “Pr(rain)=.9 then it rains 100%” is closer to being calibrated than is “Pr(rain)=.6 then it rains 100%”, it’s not the “calibration” which makes it more useful.

What makes it useful are two properties. First, the distribution is “truthful” in the sense that the actual weather is in the high probability manifold. Second, the distribution is “informative” in the sense that the high probability manifold is small (i.e. the entropy is small).

If you can find improved distributions which maintain the “truthfulness” condition, but are more “informative” then they’ll be even more useful. Eventually, you’ll get the best possible distribution which has zero entropy. It’s the best possible one since it’s a perfect and certain forecast.

In general, your modeling goal is to find a distribution which is “truthful” but makes the entropy as small as you can get it. In practice you’ll never get down to S=0 though, so your best distribution for prediction at that smallest practical entropy will not be the one calibrated so that “whenever pr(rain)=p it rains p% of the time”.

• August 5, 2013Brendon J. Brewer

Yeah, calibration is bollocks. Nice example.

• August 5, 2013Brendon J. Brewer

If there is one “long run” criteria of “performance” that is relevant, it is this (IMO). Over my life I will learn the truth of some propositions. I want my probabilities for these (before I found out the answer) to be high. i.e. I don’t want to be surprised.

This can be made formal using things like the logarithmic scoring rule. If H turns out to be true and my probability was P(H) then I get log(H) “utility dollars”. Of course, my probability then goes to P(H)=1. If you optimise this criterion then you just end up with Bayesian inference.

• August 6, 2013Joseph

No doubt there are instances when something which we’d naturally call “calibration” makes perfect sense. Anyone who’s drunk the Frequentist cool-aid however is liable to uncritically accept “calibration” without even thinking about it enough to realize it’s often nonsense.

Wasserman is a prime example. If he bumped his head,he’d forget more mathematical statistics than most people know, but that didn’t stop him saying something very silly in that post.

• August 6, 2013Corey

I’m really enjoying your most recent series of posts — especially the exchanges among Jaynes enthusiasts that they inspire. It’s really nice to see the Jaynesian school of thought develop in the informal yet intellectually rigorous format of blog comment threads.

• October 20, 2013Kevin

You keep on referring to the “high probability manifold” of a distribution. Can you point me to an explanation of what that means?

• October 20, 2013Joseph

Kevin,

It’s loosely referred to by different people in different ways. I forget how Jaynes referred to it. In addition to “high probability manifold” you might see “high probability region” or “highest posterior density intervals” (HPD) or who knows what else. But basically it’s: