The Amelioration of Uncertainty

## If we observed p_hat = .46, why do we use p=.5?

I aim to commit statistical sin. I’m going to accept the null hypothesis for no other reason than because I “failing to reject it”. Having tarnished my reputation with that, I’ll finish by ignoring the only data available and base everything on non-informative priors which prominent authorities assure us don’t even exist. Let the debauchery begin.

Consider a typical classroom coin flip experiment. The teacher does no physics and takes no measurements. Instead they flip a coin 100 times, observe the frequency of heads , and use it to create a 95% probability interval for the frequency in the next 10,000 coin flips. Barring some bad luck, the utility of Statistics will be confirmed for the students when they see their frequency is in the interval predicted.

After the first 100 flips we get . Using the binomial model we’d fail to reject with a p-value of .2. But as statistical harpies remind us constantly, this is not the same as accepting p=.5, and moreover, probabilities are to be equated with frequencies.

So should the observed frequency .46 be used to construct the 95% interval instead of .5?

Hell no! Remember we’ve done no physics here. So we have no idea which element of we’ll see in the next ten thousand flips. What we do know is the binomial calculation implies the following:

Since the sequence the class is about to observe is going to be in somewhere, using dramatically increases the opportunities for the demonstration to fail. Those impressionable students may conclude statistics is a waste of time after all.

Frequentists think the cold hard facts of frequencies trump any Bayesian philosophical nuances. In their mind if frequencies f do exist in a problem, then p=f and it’s all over for Bayesians but the crying. Examples like this previous post showing otherwise seem to have no effect on them.

Here we have a different example. The only data on the coin is the frequency of heads. Yet we’d be damned fools to equate it with the probability of heads and are better off using a value based on a data-free ignorance prior over . Frequentists in practice aren’t unwise enough to follow their own philosophy or texts. They’d quietly drop the data and subversively accept the null like the rest of us rapscallions.

Perhaps their use of p=.5 comes from their magical ability to intuit that each element of would come up equally often over flips. How they stumbled upon this curious “fact” remains a mystery. Maybe it was written on the back of the Ten Commandments.

Lucky for them they never try this legerdemain using . They’d fail to reject the null with a p-value of .38, but since the 95% interval constructed from p=.48 is only consistent with 2% of they’d have some explaining to do unless they got very fortunate.

If Statistics can get this knotted over the binomial distribution, then how could it ever be untangle in real applications? The mess is avoided entirely by understanding what probabilities really are. Not only are they conceptually different from frequencies, they’re usually unequal even when all we have to go on is frequencies.

If we are ignorant as to which element of will show up, then we’d better only make predictions consistent with the vast majority of . That’s all the Bayesian non-informative prior and procedure achieves. For the life of me I can’t figure out why this is so hard to understand or why so many think it’s metaphysical nonsense.

UPDATE: Just to emphasize, accepting is an extremely good idea even though it has a p-value .2. Accepting is a very bad idea even though it has a p-value .38.

November 6, 2013
• November 6, 2013Jake

With 46 successes in 100 observations, I get a 95% CI for p of [0.36,0.56]. Could you explain how you get the coverage result of only 0.00000008%?

• November 6, 2013Joseph

Sure, the problem wasn’t asking for a Confidence Interval for p.

Rather it was asking for an interval for the future frequency f of heads with 95% probability assuming a binomial distribution with p=.46. This is just a straightforward probability computation using the Binomial Distribution with known parameters. In other words we want a and b such that:

Given an interval (a,b) with 95% probability, we can then ask how many elements of would actually give us an f in that interval. Denote a sequence in by and denote the frequency of heads in this sequence by . Then:

• November 6, 2013Joseph

Also, I didn’t mention it in the post, but I was using the usual equal tails values for a and b. In other words:

and

• November 6, 2013Daniel Lakeland

Implicitly you’ve assumed that in actual fact all elements of the set of possible sequences are equally important when you divide by . If we had some physical knowledge of the coin, like say a certain bend to it, we would want to use a different measure. We may not be anywhere near as certain (prior probability) as we are in the “best case” fair coin.

In well tossed coin flip experiments (newly minted coin tossed at least 4 feet high, rotating at least 3 times per second an angular momentum vector nearly in the plane of the coin and allowed to bounce off a large flat unobstructed hard floor for example) we know that Binomial with with being “quite small” is a good model for the heads counts of two faced coins. Observing something else in a short run like 50 or 100 flips doesn’t convince us otherwise.

But if we knew the coin was bent, and Persi Diaconis was flipping the coin, we would have reason to believe something different, but wouldn’t know what. ie. the prior on p would be much less sharply peaked.

It’s well known that many Frequentist results are equivalent to Bayesian results with some kind of flat prior. In many real world problems the Bayesian prior is pretty flat because we just don’t understand the problem as well as a physical coin flip. I still think the philosophical separation of probability and frequency is important in those cases, but I don’t think starting from the physical coin flip argument is as convincing for those other cases.

• November 6, 2013Joseph

“Implicitly you’ve assumed that in actual fact all elements of the set of possible sequences are equally important ”

No I’m not. I’m really not. I’m assuming that if we have no idea what the next sequence in will be then we need to stick to predictions or inferences which are consistent with almost all of .

That’s what the baysian blather does. That’s all it does. It really is that simple.

All you have to do to see it is just let go of the idea that probabilities have anything to do with frequencies.

• November 6, 2013Brendon J. Brewer

“If we observed p_hat = .46, why do we use p=.5?”

A certain kind of prior information (Jaynes’ poorly informed robot) implies that’s the right thing to do. If you had different prior information then continuing to use p=0.5 would not be correct.

• November 6, 2013Joseph

Brendon,

(1) The prior is so strong that even when (frequency) data is collected, we’d be extraordinarily wise to just use the prior going forward.

(2) I don’t think Frequentists get the meaning of your statement at all, and I think Bayesians only get about half of it’s meaning.

A more concrete and more usable version is to say this: If we have no idea what sequence in will come up, we’d better stay with predictions which will be true for almost every sequence in .

Or in a different context: if we have an element of that we haven’t observed directly, and we have no idea which one it is, then we should only make inferences which will be true for almost every sequence in .

• November 6, 2013Daniel Lakeland

“If we have no idea what sequence in equation will come up, we’d better stay with predictions which will be true for almost every sequence in equation.”

What I meant in my previous statement was that the “if we have no idea what sequence” has to hold. If we are pretty confident that every 3rd time we flip we are going to get a heads due to some specifics of the flipping machine then we could ignore huge swaths of and is not an appropriate reference.

• November 7, 2013Daniel Lakeland

The whole trick to modeling as I see it is to incorporate your knowledge into the model so that whatever is “left over” is largely ignorance, at which point simple uninformative priors and binomial likelihoods and similar things work because they are more or less an effective account of ignorance.

• November 7, 2013Joseph

Daniel,

I only think that’s half the story. Bayesians have been concentrating strongly on that half and missing the other half (which is more useful going forward).

The Bayesian machinery isn’t just taking account of ignorance. It’s doing something very specific with it. Namely it’s only making predictions/inferences which are very insensitive to what the specific truth actually is.

As evidence that Bayesians don’t really get this, I’ll cite the last paragraph in the previous post:

“‘if we have no idea what sequence’ has to hold. If we are pretty confident that every 3rd time we flip we are going to get a heads due to some specifics of the flipping machine then we could ignore huge swaths of and is not an appropriate reference.

Actually, most of the time it is. Suppose we predict that 95% interval (a,b) using p=.5. This will be consistent with 95% of . Now suppose that we do know in fact that only some subset is actually possible.

Is that (a,b) prediction bad all of sudden? Well, most of the time it’s perfectly good. If Q lies entirely inside that 95% of consistent with (a,b), the this prediction will always seem like a good one.

The only scenario that changes this is if we know Q overlaps heavily with that 5% portion of .

Knowledge of Q would allow us to predict other things better, but as long as that accuracy implied by the length of the (a,b) was good enough for what we were doing, most of the time we could happily ignore Q even if we knew about.

This directly corresponds to the fact that even if we know the position and momentum of every particle in an Ideal Gas, if all we wanted to know how V was related to T, we could just ignore the information and get pV=nRT using stat mech.

Because of comments like this one and a previous one from Shalizi, I’ve been thinking about a post about how effective a tool throwing away information is in practice. It’s almost entirely underused because frequentist intition makes people think there’s only one unique probability distribution.

• November 7, 2013Daniel Lakeland

Throwing away information can be useful to simplify things, but there are plenty of cases where we want to eke out every drop of information. From the terminology in your post a while back the “truthfulness” criterion still holds in your example, but the “informativeness” is not necessarily sufficient.

I think it’s fine to throw away information when the resulting model is “informative enough for our purposes”. We may not really need to know say the concentration of a pollutant to 3 significant figures, 5% accuracy might be plenty good enough. On the other hand, if our simplified throwing away info model only gives us accuracy to 30% then we need to rethink.

In most practical problems I have been involved in the goal is to figure out how to incorporate more information so we can get more informative results.

On the other hand, I’ve seen people working on stochastic models designed to give information about things like the fluctuations in strength caused by heterogeneous materials like concrete, and they wind up with very very complex models that are probably too informative. They inform you about what your assumptions about the concrete imply but those assumptions (a generative model for heterogeneity) are themselves not terribly accurate.

The Bayesian answer is to use a posterior over p. With a uniform prior and 100 tosses the posterior will still be pretty flat: ever so slightly peaked at .46, but with .5 still in the high-probability region. Now consider three approximations to this answer:

1) delta function at .46
2) delta function at .5
3) uniform distribution

Of these, answer 3 is the best approximation to the true posterior. This is the maximum entropy solution, which makes sense if we assume that the data tell us nothing about the system (i.e. past observations are not predictive of future observations). Even when this assumption is false, it’s still a decent approximation whenever we have insufficient data to get a sharply peaked posterior.

Answers 1 and 2 are terrible, with 2 even (slightly) worse than 1. The post points out that 1 performs poorly for predicting future observations. But it also points out that (contrary to what one might expect, given that it is a terribly inaccurate description of the available information) 2 gives good prediction results (for this particular setup). But this is _not_ because 2 is a good model. Rather it is because, in this particular setup, the predictions of 2 are identical to those of 3 (even though the model is very different).

So, no, the correct answer is not to use p=.5. It just so happens that doing this gives us the right answer for the wrong reason.

• November 8, 2013Joseph

Konrad, we seem to have the same gift for making mountains out of mole hills.

We start out not knowing which element of we’ll see. Technically we do learn a little something from the 100 flips, but it’s so little we could ignore it in this case. So we’re basically left having no idea which element of we’ll see.

If we have to pick an interval for that future , we’d better pick one that includes almost all the elements of . So after doing some counting we find an interval which will be consistent with 95% of . If is in that 95% majority we’re going to look like Nostradamus.

That’s it. That’s all this problem requires. We don’t know what the future will be so we just predict things that will be true almost no matter what.

In practice, when we do this counting it can be tricky. Maybe not so much for this problem, but in general it is tricky to do these counts and make full allowance for all the evidence. So to get these counts we need a more general tool or mathematical machinery.

The trick works like this. First since we intend to count elements of we put a uniform distribution on . This distribution has none of the meanings that either Bayesians or Frequentists attach to it. It’s simply a way to count elements of and nothing else. It has no other meaning or purpose.

That distribution on then induces a Binomial(p=.5,n=10,000) distribution on f. Again, this distribution has none of the usual meanings or connotation. It’s simply a tool for indirectly counting elements in . So to get an interval which is consistent with 95% of all I have to do is find an interval that contains 95% of the distribution’s mass computed using Binomial(p=.5,n=10,000).

Note two things:

1: The p=.5 “model” doesn’t accidentally work. It works for a damned good reason (specifically, it’s a math trick for counting elements of ). The reason can be easily and massively exploited in other problems.

2: You can add all the Bayesian and Frequentist baggage to this you want, but the argument as written is still just as valid without it. You’ve achieved nothing by adding this baggage except to confuse something that’s actually incredibly simple.

So why add a bunch of crap that isn’t needed?

• November 9, 2013Brendon J. Brewer

“The p=.5 “model” doesn’t accidentally work. It works for a damned good reason ”

Except when it doesn’t, which is all the time.

• November 9, 2013Brendon J. Brewer

“The trick works like this. First since we intend to count elements of equation we put a uniform distribution on equation. This distribution has none of the meanings that either Bayesians or Frequentists attach to it.”

Attaching the meaning of prior beliefs to these prior probabilities helps because it tells you where you need to look if you want to use something non-uniform (which you should use in many many situations).

• November 9, 2013Joseph

Brendon,

“Work” has different senses. In that context it specifically meant “correctly counted “. Or you could even have taken it to mean “it works as well as anything can when all we know is “.

Obviously it doesn’t mean “always makes accurate predictions”. Our state of knowledge precludes the possibility that we could guarantee that.

“Attaching the meaning of prior beliefs to these prior probabilities helps because it tells you where you need to look if you want to use something non-uniform”

Only sometimes in some contexts. Even in those contexts all Frequentists and most Bayesians are seriously confused about what it really means. What that “prior belief” really means is “we think is in the high probability manifold of our prior”. In other words, it’s a conceptually identical generalization to what I said above.

Here’s a relevant sense of “work” (I’m restating my previous point): accepting p=.5 means we will be unduly surprised if we get a very high proportion of heads in the next 1000 tosses. If the experiment is rigged in such a way that heads are heavily favoured, we will be unable to detect this. Our methodology fails.

On the other hand, not accepting p=.5 but instead averaging over all possible values of p means we will _not_ be unduly surprised by a very high proportion of heads in the next 1000 tosses, and we _will_ be able to infer that the experiment is rigged. Our methodology works.

Here’s another relevant sense of “work”: the primary aim in most real applications is finding out whether (and by how much) the experiment is rigged. We can say that a methodology that doesn’t even attempt to address this question doesn’t “work”.

• November 10, 2013Joseph

I chose the problem in the hopes it would be simple enough to illustrate a point without getting bogged down in irrelevant (to the point) details.

That clearly didn’t work, since everyone complains mightily that I had the gall to solve the problem stated rather than their favorite problem.

I have zero problems with the Bayesian solution and whole hardheartedly think it’s the way to go. In this case though, the data from the first 100 flips isn’t that informative so to an approximation easily good enough to illustrate the point, it can be ignored.

I chose n=10,000 and f=.46 (rather than f=.01 or f=.99 something) specifically for that reason. Frequentists will immediately dismiss the Bayesian solution, and Bayesian will misunderstand it. So I removed it from prominent view to drive the point home. Since this was done in the defense of Bayes, hopefully it will be forgiven.

I agree that the distinction between uniform p and the full Bayesian solution is not important (the data are uninformative, so we can ignore them). But I think it’s critically important that p=.5 is conceptually distinct from both of those. It is what most frequentists have in mind when they make the predictions in question, but not what most Bayesians have in mind when they make the same predictions.

• November 25, 2013Troll

“The Bayesian answer is to use a posterior over p. With a uniform prior and 100 tosses the posterior will still be pretty flat: ever so slightly peaked at .46, but with .5 still in the high-probability region.”

Check your intuitions! That description doesn’t seem to adequately describe this plot: https://dl.dropboxusercontent.com/u/17357243/Beta.png

“Of these, answer 3 is the best approximation to the true posterior. This is the maximum entropy solution…”

It may be *a* max ent solution, but it is not Joseph’s max ent solution. See next point.

“But this is _not_ because 2 is a good model. Rather it is because, in this particular setup, the predictions of 2 are identical to those of 3 (even though the model is very different).”

How are the predictions identical? In 3, as I understand it, you’ll be drawing p uniformly, and then generating 10k flips using that p. That is going to look *nothing* like using p=0.5. If you don’t believe me, take a look at sims from such a process (prepending the sampled “p” to each Bernoulli sequence of N=10):
{0.11688, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
{0.580666, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0}
{0.956551, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0}
{0.383224, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0}
{0.158837, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
{0.932887, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
{0.000377, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
{0.401438, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0}
{0.359329, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1}
{0.308137, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0}

The number of “extreme” sequences with all 0s or all 1s is going to be vastly greater than the number of such sequences if you used a fixed p=0.5 (where I’d need a thousand simulated sequences with N=10 before I should expect to see all 0s or all 1s). Max ent applied to your model parameter p is very different from max ent applied to the sequences themselves (the latter being equivalent to using p=0.5).

“So, no, the correct answer is not to use p=.5. It just so happens that doing this gives us the right answer for the wrong reason.”

Ironically, you too have the right answer for the wrong reason. p=0.5 gives us the right answer because it was the right answer by construction. Joseph, what happens if the true value is p=0.46 and the initial trial came out as 50 heads and 50 tails?

• November 25, 2013Joseph

Dear Troll,

“Joseph, what happens if the true value is p=0.46 and the initial trial came out as 50 heads and 50 tails?”

If you follow the classical paradigm as I did in the post, then you’ll be hosed. You’ll accept p=.5 and after 10,000 flips your interval for f will almost certainly miss the actual frequency seen.

But as Konrad I think insisted somewhere we really should be doing the Bayesian solution. That is using the data from the first 100 flips to get a then using it to get predictions for the next 10,000,

If you do that in this case you should be fine. It will spread the high probability region out over an area of more consistent with frequencies anywhere near .5. This should make for safer predictions as long as the frequencies aren’t wildly different from .5.

A more extreme example would be: what if it were a good idea to use p=.01 for the 10,000 flips, but we got a 50/50 split in the first 100 tosses? Well, then we’re hosed no matter what, because there’s nothing in our information to suggest using p=.01.

I take it as a basic principle that it’s not our goal in inference to get the right answer. That goal is impossible in general to anyone who isn’t the Oracle of Delphi. Rather our goal in inference is to do the best we can from the information provided. It’s a more modest goal, but it’s one achievable by real humans. Sometimes the information is crap and “the best” just wont be very good.