The Amelioration of Uncertainty

## Mother Nature makes fools of Statisticians

The most dispiriting thing I’ve read in a while was this post by John Cook involving coin flips. The discussants are certain they have all the puzzle pieces and all that remains is to arrange them with the right prose. There’s not even a hint of awareness that they might be missing something.

Statisticians talk as though Nature generates coin flips from a model but she does no such thing. Real coin flips are about atoms and electrons. They’re electromagnetic forces, gravity, stray backgrounds fields. They’re wind, and sound, and percussion, impulse and impact. Muscle flexing, ATP-ADP reactions, and nervous system signals. Real coin flips involve all kinds of things, but what they don’t involve is “binomial models”. Those are a figment of the Statistician’s mind.

Since there is no such model in the real world, the key question is:

How easy is it for Mother Nature to fool the Statistician into believing a given model?

To make things precise, we can say data from n flips will fool the Statistician whenever they would fail to reject at the level.

Now comes the interesting part. If the model is then it’s very difficult for Mother Nature to bamboozle anyone since for only about .6% of would do so. The model on the hand is a different story. No matter how big n gets, 99% of would, if observed, fool the Statistician.

In other words, there are countless knuckleheads running around thinking they’ve verified of the “fairness” of a coin, when all they did was choose a model which virtually guarantees they’ll be conned by Mother Nature.

It’s at this point that Statistics becomes sublimely absurd. Flush with pride for having “objectively modeled” the “data generation mechanism” they conclude that after flips each n=100 sequence would come up equally often. Stop for a moment, dear reader, and savor the Alice-in-Wonderland like quality of this. They’re using an unphysical model which will appear correct almost no matter what’s actually happening, to confidently predict the outcome of a trial that would take longer to perform than our solar system will exist.

There is an alternative to this pablum. Suppose I’d like to predict the frequency of heads in the 100 flips I’m about to make and I have no idea what’ll happen because I never measured anything (Statisticians never do!). Since 99.99% of all sequences in have a frequency of heads between .31 and .69, I’ll predict is one of those 99.99% and will lie in the same interval.

The fraction .9999 is a reasonable measure of how strongly my state of knowledge is consistent with the prediction .

This is the best, and most robust, guess I can make under those circumstances (*). It could be wrong, but inventing fantastical claims about flips isn’t going to magically make it more right. If happens to be one of those .01% exceptions, the only way I’ll know ahead of time is by doing some real physics.

Simple, clear, true and focused entirely on real things that actually exist. That’s the alternative.

(*) It’s hard to explain this robustness because most Statisticians are lost-in-the-sauce, but here is a nonsense explanation which will nevertheless convince most scholars. Let be the set of all frequency distributions (for a fixed total number of flips ) on and imagine is a single draw from some . Then the statement is extremely likely to hold almost no matter which distribution in is used. This remains true even for the vast (vast!) majority of ‘s which differ radically from the uniform distribution.

(I just know some knucklehead will respond by claiming Statistics can be put on a sure foundation by imagining that is drawn at random from F. This just goes to prove there’s nothing quite as dumb as a smart person.)

October 28, 2013
• October 28, 2013Joseph

Just to add to that nonsense robustness explanation:

For the vast majority of elements of , will always be true without exception.

That’s because the vast majority of possible ‘s have support entirely contained within the "good" 99.99% part of .

You’re attacking a straw statistician. Sure, I imagine there are any number of knuckleheads “thinking they’ve verified of the “fairness” of a coin”, but no statistician who actually understands what a hypothesis test is would think that. Nor would they “conclude that…each n=100 sequence would come up equally often” or “confidently predict the outcome of a trial”. The idea that absence of evidence is not evidence of absence (i.e. that many tests lack power) is something that every vaguely competent statistician understands.

• October 28, 2013Joseph

There are literally hundreds of thousands of examples of real statisticians (ph.d types) doing exactly that in Google. I’ve had Frequentists throw this example in face many dozens of times as evidence of their supposed ability to verify objectively the fairness of a coin. It fact, it’s hard to find even Bayesians who don’t think like this, however carefully they word their pronouncements for public consumption.

This is an extremely common demonstration in the classroom and the frequency f drifting closer to .5 as n grows is taken as the definitive objective “hard fact” proof of randomness in the real world. I don’t care if some people carefully caveat their hypothesis testing blah blah. I’m definitely not attacking any strawmen.

I’m attacking the way people really think and because they think like this they make a huge mistake. The point of this post was to point out explicitly what that mistake was.

The fact is that the p_0=.5 model appears to be true not because it is, but because almost no matter what those atoms, electrons, sound waves, muscles, and so on do, it will almost always appear to “confirm” the model. Very few people really get this. It’s very close to saying “The Tooth Fairy model implies my next flip will either be a head or tail” and then after observing a head claiming “Prediction confirmed, so we have objective proof the Tooth Fairy exists”.

• October 28, 2013Joseph

Seriously Konrad, almost all statisticians believe that f drifting closer to .5 as n increases is proof that nature is selecting sequences out of “at random”, i.e. from a uniform distribution which is envisioned as a real frequency distribution.

Do you really think I’m just making this up? If so I encourage you to read Cook’s post and the comments again.

• October 28, 2013Joseph

Let me rephrase it like this. You think I’m attacking people who misinterpret a failure to reject the null. But that’s not the target of the post.

I’m attacking people who think that the p_0=.5 “model” is something other than an slick and indirect way to count up how many elements of S_n lead to a< f < b.

Every Statistician I've ever met thought that the p_0=.5 model was, in fact, far more than just a trick for counting elements of S_n

• October 28, 2013Brendon J. Brewer

When a statistician talks about coin flips they are not talking about coin flips. They are talking about probability distributions. Making a model with a “success probability” which may not be 0.5 is a way of making the probability distribution over sequences non-uniform, which is useful because the uniform prior over sequences often implies absurd conclusions.

• October 28, 2013Joseph

Exactly. The assumption really means “I know that “. If you assume then that really means “I know that ” for some appropriate S.

This is not a simple reinterpretation of the formalism. It’s easily possible that without any histogram ever looking anything like IID draws from a model .

There is a flexibility to Statistics which most people don’t know is there because virtually everyone’s intuition is completely dominated by frequentist ideas.

Ah, ok – then I’m with the Statisticians here: I do think the “p=.5 model” is far more than just a trick for counting elements of S_n. This is because in interesting applications we often have information to the effect that some elements of S_n are far more likely to be observed than others, so straight counting can be misleading. The counting idea only works in the absence of strong bias in favour of particular subsets of S_n.

The generative modelling “trick” allows us to make progress in cases where such a strong bias may be present but where we are willing to make an exchangeability assumption. The claim is that it is useful (for purposes of prediction and/or explanation) to model the coin tosses as exchangeable in the sense that our best prediction for the outcome of an as-yet-unobserved toss does not depend on any information (such as toss number) that may be different for different tosses – an assumption of this sort underlies _every_ generative model. Once we make this assumption we get the useful fiction of p (the limit, as the data set D goes to infinity, of the predictive probability P(H|I_prior,D)), for free. It is a fiction because D never really goes to infinity, but it is useful because in many applications D can be large enough to result in P(H|I_prior,D) (which is sometimes substantially different from 0.5) not being sensitive to small changes in the size of D – unfortunately this leads many people to think p is just a property of nature and not a function of information at all, and this common misconception persists because it does not really render generative models less useful in practice.

Of course any model assumption can be poor when it does not apply to the phenomenon being described. But what makes this type of assumption attractive is that it allows us to ignore a lot of typically-not-very-useful information (the fact that the experimenter coughed before toss nr 273; the fact that voter nr 1004657 was wearing faded jeans). The problem in real-world situations is not that we know too little, but that we know too much, with most of what we know being irrelevant (or hard to make use of). Generative models are a way of specifying hypotheses about what information is irrelevant for inference without assuming that S_n is free of strong bias.

• October 28, 2013Joseph

“The counting idea only works in the absence of strong bias in favour of particular subsets of S_n”

Absolutely untrue!!! In fact, the exact opposite is true to an extraordinary degree. Look at the robustness stuff and my first comment again. The vast, vast, vast, vast, vast majority of “strong biases” will make true and I know this precisely because this inequality holds for 99.99% of all elements of . In most cases, a “biased” distribution f(x) on will never make that inequality false because most such distributions put all their mass inside that 99.99%.

Your second and third paragraph are at best unnecessary. If and 99.99% of all have the property that , then it’s no big surprise when . What more do you need?

“Exchangeability assumptions” and “generative models” don’t make this one bit more credible. They do however, obscure a number of trivial facts to such an extent that even top notch mathematicians can’t tell up from down anymore.

One example of this phenomenon is given in the post. Some statisticians believe a “failure to reject p=.5 at the alpha=.01 level” means that p does equal .5. Other statisticians caveat this by saying effectively, “we have evidence for p=.5” but we haven’t actually proved it and it could be wrong.

The truth is that the p=.5 model is consistent with 99% of everything that could be true no matter how many data points are collected. This model isn’t “right” or “wrong” in the sense that statisticians might argue about it as in the previous paragraph, it’s just vacuous (or very nearly so).

It’s like me coming up with the “Wilson Weather Model” which always predicts “it will either rain or not rain tomorrow” and then bragging about how my predictions always come true.

“The vast, vast, vast, vast, vast majority of “strong biases” will make .31<f_true<.62 true" – er, yes, but those are uninteresting cases. Say we are analysing voting behaviour – we don't care nearly as much about the sort of bias that will cause votes to cluster by time of day than we do about the sort of bias that will influence the result of the election (if we do care about the former we can expand the model accordingly). It doesn't matter that the cases you are excluding are a small minority when those are the cases of actual interest. Also, specifically in the election example, we can have cases where we reliably infer, say, .51<f_true<.55 – and this is hugely informative despite knowing that the model assumptions are quite far from being true in the application under study.

"Some statisticians believe a “failure to reject p=.5 at the alpha=.01 level” means that p does equal .5. Other statisticians caveat this by saying effectively, “we have evidence for p=.5” " – only if they are incompetent. Without denying that incompetent statisticians exist, I don't think we need to be overly concerned about claims that are plainly false – it's more important to focus on the many statistical claims that are true but unhelpful.

• October 29, 2013Joseph

I’ll take these separately beginning with the second paragraph.

“only if they are incompetent”

Every statistician I’ve ever met took a failure to reject the null as partial evidence for the null. Every textbook I’ve ever seen did the same. Every paper that mentioned hypothesis testing, whether theoretical or applied, did the same.

But whatever. You’re missing my real point. If you don’t like the way I worded it then substitute:

“failure to reject the null means XXXXX”

and put whatever your favorite formulation is in for XXXXX.

Now my point is that the p=.5 model doesn’t “mean” anything. If interpreted as a meaningful model, it’ll be consistent with almost anything that could physically be happening. It will appear predicatively valid almost no matter what.

It isn’t even really a model. It’s actually just a indirect way to count states in S_n. Once you realize that, a great deal of statistical insanity just evaporates. The subject becomes a good deal simpler, and a host of new opportunities present themselves.

“It will appear predicatively valid almost no matter what.”

But that’s just not true. In the voting example, it would imply that elections are uncallable almost no matter what. But in reality, election results are predictable more often than not. In almost all cases, p=.5 is a silly model that can be rejected a priori, but if we agree to nonetheless take it seriously (e.g. as an approximation for .499<p<.501) the data often force its rejection anyway. Are you denying that countless examples exist where p=.5 can be rejected?

• October 29, 2013Brendon J. Brewer

“Every statistician I’ve ever met took a failure to reject the null as partial evidence for the null.”

That’s actually correct, if the data is more probable if the null is true than it is if the null is false (to do this you need a model that includes both the null and the alternative in a bigger hypothesis space). Bayesian inference 101.

Not entirely correct, because we have no calculation for deciding how probable the data are if the null is false – we can only calculate this for specified alternative models. So we can have evidence favouring the null over a particular alternative, but the same data will also favour many other models over that alternative.

• October 29, 2013Brendon J. Brewer

“we have no calculation for deciding how probable the data are if the null is false”

Yes, that needs to be defined. I thought I made that clear in my comment.

• October 30, 2013Vilnis

Guys, guys…
I’m not an statistician but let me say that all of you are too clever to accept that you are way too clever.

• October 30, 2013Joseph

You’re not getting what I’m saying. Rather than say it all over again, I’ll try some different angles in following posts. Beginning with the post “Max Planck and the Foundations of Statistics”. I think though you’re underestimating how radically you’ll need rethink the half Bayesian/half Frequentists amalgam of crap that’s taught in stat departments.

• October 30, 2013Joseph

Although Konrad, maybe I’ve got idea which may clarify some ideas. Suppose that in those 100 flips there is something, either manmade or natural, constraining to flips to be:

Last 70: tails

or variations which differ by no more than three flips from this one.

I think we’d all agree this in no way physically resembles what Statisticians have in mind when they think of p=.5 as a physical model.

And yet, if this is the physical mechanism generating coin flips, then we would NEVER reject the null p_0=.5 at the alpha =.05 level.

Moreover, if we make a prediction using say a 95% confidence interval for f, then the true f will be in that interval. In other word, the p_0=.5 model will always appear to be predicatively successful.

You might object that we could easily see something fishy was going on here. But I can easily find an equally small subset of S_n, whose elements have flips in a different order that looks kind of “jumbled”, in which we’d never know.

I wouldn’t make that objection, because I don’t think the ordering issue is what we’re discussing here – if order matters, models that make an explicit assumption to the contrary (e.g. as I formulated above) are clearly not appropriate.

You are describing a situation with a fixed small sample size (100). It’s not at all surprising that p=.5 cannot be rejected with this sample size; equally important is that an infinite number of alternative models (including p=.3, which is the best-fitting one among those that ignore order) cannot be rejected either. That’s why it makes no sense to say that the data support p=.5: there are infinitely many alternative models that are supported even better. The only sensible conclusion is that any of a large number of models remain viable – in this example, the data are uninformative – a state of affairs all (competent) statisticians are comfortable with.

The standard Bayesian approach to prediction in this context is to average over all models under consideration, using a prior on model space – my impression is that you would consider this part of the “half Bayesian/half Frequentists amalgam of crap” – could you clarify?

• October 31, 2013Joseph

I think your second paragraph gets at the crux of where we disagree.

“You are describing a situation with a fixed small sample size (100). It’s not at all surprising that p=.5 cannot be rejected with this sample size;”

The sample size is irrelevant. Replace 100 with N=”every coin toss that will ever be made under given conditions”. N could be 1 billion. Let the first 30% be heads and last 70% be tails. Repeat everything except now using this N. Everything I said still holds true.

“equally important is that an infinite number of alternative models”

You’re still not getting me. You want to interpret these as models and you propose your favorite interpretation. I don’t care what your favorite interpretation of them as a physical model is, because I claim it isn’t a model at all!

“That’s why it makes no sense to say that the data support p=.5: there are infinitely many alternative models that are supported even better.”

No. Surprisingly it DOES make sense to say the data support p=.5! What happens if you do so? If you did and then used p=.5 to create 95% CI’s (or Bayesian intervals) you would get intervals which always contained the true frequency and hence always appeared to be predicatively accurate.

So how can I say p=.5 is predicatively accurate and claim it isn’t a model at all? Well because p=.5 means nothing physically. It’s simply a way to tell the machinery what subset of S_n to count over (loosely speaking).

In this case p=.5 is saying “count over all of S_n”. If you do this and then only make predictions which are compatible with the vast majority of S_n, then it’s no surprise those predictions turn out to be true even if the sequences actually come from a tiny subset of S_n.

“The only sensible conclusion is that any of a large number of models remain viable – in this example, the data are uninformative – a state of affairs all (competent) statisticians are comfortable with.”

If you say the data is uninformative then that’s a very interesting conclusion. As stated at the beginning I could make N=1,000,000,000 or N=”every flip that will ever be made” and repeat the whole exercise and get the exactly same result. Are you saying that no amount of coin flip data could ever be informative?

Ah, ok, I hadn’t realized you’re using that poor a test. I guess I assumed you would use something more sensible, such as a likelihood ratio test with p=.5 as null and p a free parameter as alternative – in that case, extra data would eventually lead you to reject the null. But instead you are testing a model on its own rather than against a null – the silliness of this is discussed by Jaynes (e.g. section 5.5. of PTTLOS). What’s relevant here is that this test is so underpowered that (as you point out) it never rejects the model in the limit of an infinitely large data set. With such an underpowered test it’s even more ridiculous to say that failing to reject the model counts as supporting it – such a claim could only be true in a very loose (and not very useful) sense.

You go on to give a sense in which the model _can_ be said to be supported. To paraphrase: if the model produces predictions that pass a specific not-very-stringent quality test, we say it is supported.

I suppose that’s fair enough (you’re free to define the notion of support any way you like), but it’s not a comforting interpretation of “supported”. If quality of predictions is what you care about, it should bother you that the predictions are severely suboptimal in many senses. Using this model for virtually any sort of prediction other than specifying a 95% interval will be a disaster. And no competent statistician would do so.

Re whether it is a model: clearly (even according to your own description) the statisticians you are criticising interpret it as a model. Your argument is that there exists an alternative interpretation in which it is not a model – but here you are being a bit loose with the “it” under discussion – it’s something nebulous which you are referring to by the label “p=.5″, but it’s unclear whether the thing you name this way is the same as the thing statisticians name this way. Exactly which conceptual object is it that you claim is not a model? If you are referring to a specific prediction methodology, everyone will agree because a prediction methodology is not a model. (Different people will define “model” differently, but if we need a concrete definition we could say a model is a function mapping potential future observations to probabilities. Or that it is a mechanism for simulating future observations. Or that it is a fictitious simplified description of how observed phenomena come about. Or something else. Take your pick.)

• November 2, 2013Joseph

I screwed up the previous explanation, which may have derailed the conversation a bit.

The p=.5 model is very special. For any 5% test, only 5% of S_n, if observed, would cause you to reject H_0: p=.5 (this isn’t true for other values of p!). This fact is independent of how big n gets. Think about that for a moment.

People think choosing p=.5 in the binomial model is some kind of physical statement. They think the truth of this physical statement is then verified when they make predictions using 95% intervals and those predictions turn out to be true.

What’s actually driving everything here is a simple counting argument of the form “if almost all of S_n leads to X, then predict X”. That’s all there is to it. Nothing more. There’s no physical content to this. It’s equivalent to saying “we have no idea where in S_n the next sequence will be, so we’re only going to make predictions which hold true for almost everything in S_n”.

The fact that this is only a simple counting argument is totally obscured by all this talk of probability models. While using p=.5 in the binomial model may seem like a probability statement, it’s actually just a slick way of performing the counts needed for the counting argument. It’s just an indirect way of calculating which predictions are going to be true for almost all elements of S_n. That’s why the choice of .5 is so crucial, because if you use other values for p in the binomial model it won’t count the elements of S_n correctly.

This might seem like a mere reinterpretation, but it’s not. That “physical statement” which most people have in their mind when they use p=.5 is something like “the coin is fair and there is something called randomness in nature which makes each element of S_n come up equally often if you were to flip it long enough”.

To see the difference between these two interpretations, suppose that physical picture is radically false. Suppose there are unknown laws of physics confining those coin flips to a tiny subset of S_n. Most statisticians imagine this would totally screw things up because the true frequency distribution is radically different from the assumed binomial model with p=.5. But – and this key – it DOESN’T screw up the counting argument.

Why? Because we only made predictions of the form “if almost all of S_n leads to X, then predict X”. If that tiny subset is in the majority part of S_n compatible with X, which is what usually happens, then our prediction will turn out to be a good one. In fact the prediction will be correct 100% of the time.

I agree with pretty much all of that. You are describing a baseline (uninformed) information state with no physical content and pointing out that even this can lead to reasonable predictions. This is what Jaynes called the “poorly informed robot” (Chapter 9 of PTTLOS). I agree that what one gets out of this is _surprisingly_ good.

But that is not to say we can’t do _even better_ by incorporating additional information, in the form of physical claims about the system under investigation, when such information can be obtained. In most coin-tossing applications, it makes sense to add a simple exchangeability assumption which (with sufficient data and if the observations are split 30-70) will _further improve_ the quality of the predictions that can be made. If your prediction takes the form of an interval, the improvements will typically be in the form of making the interval tighter while keeping its reliability constant. Face it: your prediction that f_true will lie between .31 and .69 is not very tight.

“People think choosing p=.5 in the binomial model is some kind of physical statement.” – I agree: this corresponds to a physical assumption. That it happens to leads to the same calculations as one would perform in the uninformed information state is a coincidence that can lead to much confusion.

“They think the truth of this physical statement is then verified when they make predictions using 95% intervals and those predictions turn out to be true.” – That would be a glaring error in reasoning, and I still don’t believe it is representative of competent statisticians. We are talking about a case where multiple competing hypotheses can be constructed, all of which would make predictions that turn out to be true. Why would a statistician think that one of these should be preferred over the others? I’m more inclined to think that they interpret the data as validating their prediction _methodology_ – this is a much weaker claim, because it doesn’t require them to think that their methodology is optimal or “correct” in any sense, just that it is useful. In my experience, statisticians often take a relativist perspective in these situations: “your methodology is useful for your applications, my methodology is useful for my applications, let’s leave it at that” (e.g. Andrew Gelman often makes this kind of comment). I am not a big fan of this strategy, but it is hard to criticise.

• November 3, 2013Joseph

““They think the truth of this physical statement is then verified when they make predictions using 95% intervals and those predictions turn out to be true.” – That would be a glaring error in reasoning, and I still don’t believe it is representative of competent statisticians.”

I think Konrad were much closer to being on the same page then before, but you still don’t quite get the point of what I’m saying.

Suppose someone does fail to reject the p=.5 null and makes the “glaring error in reasoning” that you’re talking about (which I stand by my claim is exactly what everyone does: namely when they fail to reject this null, they’ll use .5 to construct 95% CI’s for future frequencies).

You say this is a “glaring error of reasoning”. On one level that’s true. If anything it’s an understatement. To the extent that they think this is confirming their mental physical picture of the situation it could so easily be wrong that it’s almost certainly an error.

But that doesn’t mean it’s wrong for them to do this. Exactly the opposite in fact. Assuming p=.5 is exactly what they should be doing. Why? Because underneath all the metaphysical clutter there is a very simple counting argument driving these inferences. And if you look at this whole situation from the point of view of that counting argument, then using anything else other than p=.5 is far more likely to lead you astray than to improve your predictions (unless of course your data is extreme: such as or something).

So what’s the point of me saying all this? It’s that you’re far better dropping all the model talk and simply thinking of everything in terms of that very simple, concrete counting argument. It clears up all the mysteries and never leads you astray, and never involves any metaphysical angst.

If you do have some evidence the only a subset of S_n is possible or likely, then that simple counting argument can be used on that subset without any major differences. In practice carrying this out will require a “distributon” on S_n, but this “distribution” is conceptually and numerically so different from a frequency distribution it’ll confuse the hell out of anyone who retains too much Frequentist intuition and doesn’t understand the point I’m trying to make here.

I think you’re missing my point about the counting argument being different from p=.5 even though it leads to the same calculations. Instead, the counting argument corresponds to working with a uniform (or other uninformative but symmetric) probability distribution on p. Working with p=.5 is based on the assumption that we know something strong and specific (our information state is peaked). Working with an uninformative distribution is based on _not_ knowing anything specific (our information state is flat). But it so happens that 0.5 is the mean of any of these symmetric distributions, and because of this many calculations end up being the same.

I seem to remember a thread sometime this year on Andrew Gelman’s blog where he linked to one of his papers that got all confusing because these two situations (knowing that p=.5 vs not knowing p and averaging over the ignorance to get .5) were conflated. Can’t find it now, unfortunately.

• November 4, 2013Joseph

I’ve been talking this entire time about p=.5 in the binomial model. Sometimes I said it explicitly. I wasn’t referring to other models where Pr{heads} = .5.