The Amelioration of Uncertainty

If we observed p_hat = .46, why do we use p=.5?

I aim to commit statistical sin. I’m going to accept the null hypothesis for no other reason than because I “failing to reject it”. Having tarnished my reputation with that, I’ll finish by ignoring the only data available and base everything on non-informative priors which prominent authorities assure us don’t even exist. Let the debauchery begin.

Consider a typical classroom coin flip experiment. The teacher does no physics and takes no measurements. Instead they flip a coin 100 times, observe the frequency of heads equation, and use it to create a 95% probability interval for the frequency in the next 10,000 coin flips. Barring some bad luck, the utility of Statistics will be confirmed for the students when they see their frequency is in the interval predicted.

After the first 100 flips we get equation. Using the binomial model we’d fail to reject equation with a p-value of .2. But as statistical harpies remind us constantly, this is not the same as accepting p=.5, and moreover, probabilities are to be equated with frequencies.

So should the observed frequency .46 be used to construct the 95% interval instead of .5?

Hell no! Remember we’ve done no physics here. So we have no idea which element of equation we’ll see in the next ten thousand flips. What we do know is the binomial calculation implies the following:

    equation

Since the sequence the class is about to observe is going to be in equation somewhere, using equation dramatically increases the opportunities for the demonstration to fail. Those impressionable students may conclude statistics is a waste of time after all.

Frequentists think the cold hard facts of frequencies trump any Bayesian philosophical nuances. In their mind if frequencies f do exist in a problem, then p=f and it’s all over for Bayesians but the crying. Examples like this previous post showing otherwise seem to have no effect on them.

Here we have a different example. The only data on the coin is the frequency of heads. Yet we’d be damned fools to equate it with the probability of heads and are better off using a value based on a data-free ignorance prior over equation. Frequentists in practice aren’t unwise enough to follow their own philosophy or texts. They’d quietly drop the data and subversively accept the null like the rest of us rapscallions.

Perhaps their use of p=.5 comes from their magical ability to intuit that each element of equation would come up equally often over equation flips. How they stumbled upon this curious “fact” remains a mystery. Maybe it was written on the back of the Ten Commandments.

Lucky for them they never try this legerdemain using equation. They’d fail to reject the null with a p-value of .38, but since the 95% interval constructed from p=.48 is only consistent with 2% of equation they’d have some explaining to do unless they got very fortunate.

If Statistics can get this knotted over the binomial distribution, then how could it ever be untangle in real applications? The mess is avoided entirely by understanding what probabilities really are. Not only are they conceptually different from frequencies, they’re usually unequal even when all we have to go on is frequencies.

If we are ignorant as to which element of equation will show up, then we’d better only make predictions consistent with the vast majority of equation. That’s all the Bayesian non-informative prior and procedure achieves. For the life of me I can’t figure out why this is so hard to understand or why so many think it’s metaphysical nonsense.

UPDATE: Just to emphasize, accepting equation is an extremely good idea even though it has a p-value .2. Accepting equation is a very bad idea even though it has a p-value .38.

November 6, 2013
21 comments »
  • November 6, 2013Jake

    With 46 successes in 100 observations, I get a 95% CI for p of [0.36,0.56]. Could you explain how you get the coverage result of only 0.00000008%?

  • November 6, 2013Joseph

    Sure, the problem wasn’t asking for a Confidence Interval for p.

    Rather it was asking for an interval for the future frequency f of heads with 95% probability assuming a binomial distribution with p=.46. This is just a straightforward probability computation using the Binomial Distribution equation with known parameters. In other words we want a and b such that:

    equation

    Given an interval (a,b) with 95% probability, we can then ask how many elements of equation would actually give us an f in that interval. Denote a sequence in equation by equation and denote the frequency of heads in this sequence by equation. Then:

        equation

  • November 6, 2013Joseph

    Also, I didn’t mention it in the post, but I was using the usual equal tails values for a and b. In other words:

    equation

    and

    equation

  • November 6, 2013Daniel Lakeland

    Implicitly you’ve assumed that in actual fact all elements of the set of possible sequences are equally important when you divide by equation. If we had some physical knowledge of the coin, like say a certain bend to it, we would want to use a different measure. We may not be anywhere near as certain (prior probability) as we are in the “best case” fair coin.

    In well tossed coin flip experiments (newly minted coin tossed at least 4 feet high, rotating at least 3 times per second an angular momentum vector nearly in the plane of the coin and allowed to bounce off a large flat unobstructed hard floor for example) we know that Binomial with equation with equation being “quite small” is a good model for the heads counts of two faced coins. Observing something else in a short run like 50 or 100 flips doesn’t convince us otherwise.

    But if we knew the coin was bent, and Persi Diaconis was flipping the coin, we would have reason to believe something different, but wouldn’t know what. ie. the prior on p would be much less sharply peaked.

    It’s well known that many Frequentist results are equivalent to Bayesian results with some kind of flat prior. In many real world problems the Bayesian prior is pretty flat because we just don’t understand the problem as well as a physical coin flip. I still think the philosophical separation of probability and frequency is important in those cases, but I don’t think starting from the physical coin flip argument is as convincing for those other cases.

  • November 6, 2013Joseph

    “Implicitly you’ve assumed that in actual fact all elements of the set of possible sequences are equally important ”

    No I’m not. I’m really not. I’m assuming that if we have no idea what the next sequence in equation will be then we need to stick to predictions or inferences which are consistent with almost all of equation.

    That’s what the baysian blather does. That’s all it does. It really is that simple.

    All you have to do to see it is just let go of the idea that probabilities have anything to do with frequencies.

  • November 6, 2013Brendon J. Brewer

    “If we observed p_hat = .46, why do we use p=.5?”

    A certain kind of prior information (Jaynes’ poorly informed robot) implies that’s the right thing to do. If you had different prior information then continuing to use p=0.5 would not be correct.

  • November 6, 2013Joseph

    Brendon,

    Sure but two additional points:

    (1) The prior is so strong that even when (frequency) data is collected, we’d be extraordinarily wise to just use the prior going forward.

    (2) I don’t think Frequentists get the meaning of your statement at all, and I think Bayesians only get about half of it’s meaning.

    A more concrete and more usable version is to say this: If we have no idea what sequence in equation will come up, we’d better stay with predictions which will be true for almost every sequence in equation.

    Or in a different context: if we have an element of equation that we haven’t observed directly, and we have no idea which one it is, then we should only make inferences which will be true for almost every sequence in equation.

  • November 6, 2013Daniel Lakeland

    “If we have no idea what sequence in equation will come up, we’d better stay with predictions which will be true for almost every sequence in equation.”

    What I meant in my previous statement was that the “if we have no idea what sequence” has to hold. If we are pretty confident that every 3rd time we flip we are going to get a heads due to some specifics of the flipping machine then we could ignore huge swaths of equation and equation is not an appropriate reference.

  • November 7, 2013Daniel Lakeland

    The whole trick to modeling as I see it is to incorporate your knowledge into the model so that whatever is “left over” is largely ignorance, at which point simple uninformative priors and binomial likelihoods and similar things work because they are more or less an effective account of ignorance.

  • November 7, 2013Joseph

    Daniel,

    I only think that’s half the story. Bayesians have been concentrating strongly on that half and missing the other half (which is more useful going forward).

    The Bayesian machinery isn’t just taking account of ignorance. It’s doing something very specific with it. Namely it’s only making predictions/inferences which are very insensitive to what the specific truth actually is.

    As evidence that Bayesians don’t really get this, I’ll cite the last paragraph in the previous post:

    “‘if we have no idea what sequence’ has to hold. If we are pretty confident that every 3rd time we flip we are going to get a heads due to some specifics of the flipping machine then we could ignore huge swaths of equation and equation is not an appropriate reference.

    Actually, most of the time it is. Suppose we predict that 95% interval (a,b) using p=.5. This will be consistent with 95% of equation. Now suppose that we do know in fact that only some subset equation is actually possible.

    Is that (a,b) prediction bad all of sudden? Well, most of the time it’s perfectly good. If Q lies entirely inside that 95% of equation consistent with (a,b), the this prediction will always seem like a good one.

    The only scenario that changes this is if we know Q overlaps heavily with that 5% portion of equation.

    Knowledge of Q would allow us to predict other things better, but as long as that accuracy implied by the length of the (a,b) was good enough for what we were doing, most of the time we could happily ignore Q even if we knew about.

    This directly corresponds to the fact that even if we know the position and momentum of every particle in an Ideal Gas, if all we wanted to know how V was related to T, we could just ignore the information and get pV=nRT using stat mech.

    Because of comments like this one and a previous one from Shalizi, I’ve been thinking about a post about how effective a tool throwing away information is in practice. It’s almost entirely underused because frequentist intition makes people think there’s only one unique probability distribution.

  • November 7, 2013Daniel Lakeland

    Throwing away information can be useful to simplify things, but there are plenty of cases where we want to eke out every drop of information. From the terminology in your post a while back the “truthfulness” criterion still holds in your example, but the “informativeness” is not necessarily sufficient.

    I think it’s fine to throw away information when the resulting model is “informative enough for our purposes”. We may not really need to know say the concentration of a pollutant to 3 significant figures, 5% accuracy might be plenty good enough. On the other hand, if our simplified throwing away info model only gives us accuracy to 30% then we need to rethink.

    In most practical problems I have been involved in the goal is to figure out how to incorporate more information so we can get more informative results.

    On the other hand, I’ve seen people working on stochastic models designed to give information about things like the fluctuations in strength caused by heterogeneous materials like concrete, and they wind up with very very complex models that are probably too informative. They inform you about what your assumptions about the concrete imply but those assumptions (a generative model for heterogeneity) are themselves not terribly accurate.

  • November 8, 2013konrad

    The Bayesian answer is to use a posterior over p. With a uniform prior and 100 tosses the posterior will still be pretty flat: ever so slightly peaked at .46, but with .5 still in the high-probability region. Now consider three approximations to this answer:

    1) delta function at .46
    2) delta function at .5
    3) uniform distribution

    Of these, answer 3 is the best approximation to the true posterior. This is the maximum entropy solution, which makes sense if we assume that the data tell us nothing about the system (i.e. past observations are not predictive of future observations). Even when this assumption is false, it’s still a decent approximation whenever we have insufficient data to get a sharply peaked posterior.

    Answers 1 and 2 are terrible, with 2 even (slightly) worse than 1. The post points out that 1 performs poorly for predicting future observations. But it also points out that (contrary to what one might expect, given that it is a terribly inaccurate description of the available information) 2 gives good prediction results (for this particular setup). But this is _not_ because 2 is a good model. Rather it is because, in this particular setup, the predictions of 2 are identical to those of 3 (even though the model is very different).

    So, no, the correct answer is not to use p=.5. It just so happens that doing this gives us the right answer for the wrong reason.

  • November 8, 2013Joseph

    Konrad, we seem to have the same gift for making mountains out of mole hills.

    We start out not knowing which element of equation we’ll see. Technically we do learn a little something from the 100 flips, but it’s so little we could ignore it in this case. So we’re basically left having no idea which element of equation we’ll see.

    If we have to pick an interval for that future equation, we’d better pick one that includes almost all the elements of equation. So after doing some counting we find an interval equation which will be consistent with 95% of equation. If equation is in that 95% majority we’re going to look like Nostradamus.

    That’s it. That’s all this problem requires. We don’t know what the future will be so we just predict things that will be true almost no matter what.

    In practice, when we do this counting it can be tricky. Maybe not so much for this problem, but in general it is tricky to do these counts and make full allowance for all the evidence. So to get these counts we need a more general tool or mathematical machinery.

    The trick works like this. First since we intend to count elements of equation we put a uniform distribution on equation. This distribution has none of the meanings that either Bayesians or Frequentists attach to it. It’s simply a way to count elements of equation and nothing else. It has no other meaning or purpose.

    That distribution on equation then induces a Binomial(p=.5,n=10,000) distribution on f. Again, this distribution has none of the usual meanings or connotation. It’s simply a tool for indirectly counting elements in equation. So to get an interval equation which is consistent with 95% of equation all I have to do is find an interval that contains 95% of the distribution’s mass computed using Binomial(p=.5,n=10,000).

    Note two things:

    1: The p=.5 “model” doesn’t accidentally work. It works for a damned good reason (specifically, it’s a math trick for counting elements of equation). The reason can be easily and massively exploited in other problems.

    2: You can add all the Bayesian and Frequentist baggage to this you want, but the argument as written is still just as valid without it. You’ve achieved nothing by adding this baggage except to confuse something that’s actually incredibly simple.

    So why add a bunch of crap that isn’t needed?

  • November 9, 2013Brendon J. Brewer

    “The p=.5 “model” doesn’t accidentally work. It works for a damned good reason ”

    Except when it doesn’t, which is all the time.

  • November 9, 2013Brendon J. Brewer

    “The trick works like this. First since we intend to count elements of equation we put a uniform distribution on equation. This distribution has none of the meanings that either Bayesians or Frequentists attach to it.”

    Attaching the meaning of prior beliefs to these prior probabilities helps because it tells you where you need to look if you want to use something non-uniform (which you should use in many many situations).

  • November 9, 2013Joseph

    Brendon,

    “Work” has different senses. In that context it specifically meant “correctly counted equation“. Or you could even have taken it to mean “it works as well as anything can when all we know is equation“.

    Obviously it doesn’t mean “always makes accurate predictions”. Our state of knowledge precludes the possibility that we could guarantee that.

    “Attaching the meaning of prior beliefs to these prior probabilities helps because it tells you where you need to look if you want to use something non-uniform”

    Only sometimes in some contexts. Even in those contexts all Frequentists and most Bayesians are seriously confused about what it really means. What that “prior belief” really means is “we think equation is in the high probability manifold of our prior”. In other words, it’s a conceptually identical generalization to what I said above.

  • November 10, 2013konrad

    Here’s a relevant sense of “work” (I’m restating my previous point): accepting p=.5 means we will be unduly surprised if we get a very high proportion of heads in the next 1000 tosses. If the experiment is rigged in such a way that heads are heavily favoured, we will be unable to detect this. Our methodology fails.

    On the other hand, not accepting p=.5 but instead averaging over all possible values of p means we will _not_ be unduly surprised by a very high proportion of heads in the next 1000 tosses, and we _will_ be able to infer that the experiment is rigged. Our methodology works.

    Here’s another relevant sense of “work”: the primary aim in most real applications is finding out whether (and by how much) the experiment is rigged. We can say that a methodology that doesn’t even attempt to address this question doesn’t “work”.

  • November 10, 2013Joseph

    I chose the problem in the hopes it would be simple enough to illustrate a point without getting bogged down in irrelevant (to the point) details.

    That clearly didn’t work, since everyone complains mightily that I had the gall to solve the problem stated rather than their favorite problem.

    I have zero problems with the Bayesian solution and whole hardheartedly think it’s the way to go. In this case though, the data from the first 100 flips isn’t that informative so to an approximation easily good enough to illustrate the point, it can be ignored.

    I chose n=10,000 and f=.46 (rather than f=.01 or f=.99 something) specifically for that reason. Frequentists will immediately dismiss the Bayesian solution, and Bayesian will misunderstand it. So I removed it from prominent view to drive the point home. Since this was done in the defense of Bayes, hopefully it will be forgiven.

  • November 10, 2013konrad

    I agree that the distinction between uniform p and the full Bayesian solution is not important (the data are uninformative, so we can ignore them). But I think it’s critically important that p=.5 is conceptually distinct from both of those. It is what most frequentists have in mind when they make the predictions in question, but not what most Bayesians have in mind when they make the same predictions.

  • November 25, 2013Troll

    @Konrad

    “The Bayesian answer is to use a posterior over p. With a uniform prior and 100 tosses the posterior will still be pretty flat: ever so slightly peaked at .46, but with .5 still in the high-probability region.”

    Check your intuitions! That description doesn’t seem to adequately describe this plot: https://dl.dropboxusercontent.com/u/17357243/Beta.png

    “Of these, answer 3 is the best approximation to the true posterior. This is the maximum entropy solution…”

    It may be *a* max ent solution, but it is not Joseph’s max ent solution. See next point.

    “But this is _not_ because 2 is a good model. Rather it is because, in this particular setup, the predictions of 2 are identical to those of 3 (even though the model is very different).”

    How are the predictions identical? In 3, as I understand it, you’ll be drawing p uniformly, and then generating 10k flips using that p. That is going to look *nothing* like using p=0.5. If you don’t believe me, take a look at sims from such a process (prepending the sampled “p” to each Bernoulli sequence of N=10):
    {0.11688, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
    {0.580666, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0}
    {0.956551, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0}
    {0.383224, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0}
    {0.158837, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
    {0.932887, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
    {0.000377, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
    {0.401438, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0}
    {0.359329, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1}
    {0.308137, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0}

    The number of “extreme” sequences with all 0s or all 1s is going to be vastly greater than the number of such sequences if you used a fixed p=0.5 (where I’d need a thousand simulated sequences with N=10 before I should expect to see all 0s or all 1s). Max ent applied to your model parameter p is very different from max ent applied to the sequences themselves (the latter being equivalent to using p=0.5).

    “So, no, the correct answer is not to use p=.5. It just so happens that doing this gives us the right answer for the wrong reason.”

    Ironically, you too have the right answer for the wrong reason. p=0.5 gives us the right answer because it was the right answer by construction. Joseph, what happens if the true value is p=0.46 and the initial trial came out as 50 heads and 50 tails?

  • November 25, 2013Joseph

    Dear Troll,

    “Joseph, what happens if the true value is p=0.46 and the initial trial came out as 50 heads and 50 tails?”

    If you follow the classical paradigm as I did in the post, then you’ll be hosed. You’ll accept p=.5 and after 10,000 flips your interval for f will almost certainly miss the actual frequency seen.

    But as Konrad I think insisted somewhere we really should be doing the Bayesian solution. That is using the data from the first 100 flips to get a equation then using it to get predictions for the next 10,000,

    equation

    If you do that in this case you should be fine. It will spread the high probability region out over an area of equation more consistent with frequencies anywhere near .5. This should make for safer predictions as long as the frequencies aren’t wildly different from .5.

    A more extreme example would be: what if it were a good idea to use p=.01 for the 10,000 flips, but we got a 50/50 split in the first 100 tosses? Well, then we’re hosed no matter what, because there’s nothing in our information to suggest using p=.01.

    I take it as a basic principle that it’s not our goal in inference to get the right answer. That goal is impossible in general to anyone who isn’t the Oracle of Delphi. Rather our goal in inference is to do the best we can from the information provided. It’s a more modest goal, but it’s one achievable by real humans. Sometimes the information is crap and “the best” just wont be very good.

Leave a Reply or trackback