The Amelioration of Uncertainty

Does Mayo’s “Error Statistics” fix classical statistics?

Taking a break from Statistical Mechanics I noticed Corey Yanofsky, whom I respect a great deal, is starting a blog. Corey plans to explore Dr. Mayo’s Severity Principle, which he describes as the “strongest defense of frequentism I’ve ever encountered.” A similarly great geometer, Cosma Shalizi, is even more effusive.

I believe Dr. Mayo misunderstands error distributions and the basic facts concerning them (see here and here), but philosophy can be argued endlessly. It’s more productive to examine Corey’s (and Mayo’s) claim that “the severity principle scotches many common criticisms of frequentism”.

The key paper is this one, which uses the Severity function SEV(H,T,X) to answer 13 common criticisms (“howlers”) of frequentist methods. Here H is a hypothesis, T a test, and X the data. SEV is often abbreviated SEV(H) when T and X are understood. Unfortunately Mayo based this entire discussion on the NIID example and its sufficient statics. For this case,


Which is just the posterior probability! With a slight verbal change, Mayo’s paper is a more convincing defense of Bayesian posteriors than most Bayesians can muster. To illustrate consider howlers #2 and #3,

(#2) All statistically significant results are treated the same.

(#3) The p-value does not tell us how large a discrepancy is found.

Mayo considers equation and shows how equation gives a sense of the discrepancy equation. This is identical to observing equation, or equivalently equation, and using it to judge the size of the discrepancy, which is how Laplace did it two centuries ago.

So is SEV or Bayes fixing classical statistics?

To put SEV to a “severe” test we need an example where they differ. Since Bayes Theorem automatically uses any sufficient statistics if present, let’s use non-sufficient statistics and see what happens.

Suppose equation and we take two observations: one with equation and one very precise one with equation. The actual observations turn out to be equation and equation.

Intuitively, we’d drop the inaccurate data point and say equation is very close to zero to within a few parts in a billion. This is exactly what the Bayesian calculation does since the posterior distribution equation is Normal with


For example, the Bayesian gets equation to a ridiculous number of significant figures since .1 is a billion standard deviations away from the accurate measurement.

Using the natural, but non-sufficient, test statistic equation and data equation, yields an entirely different outcome. With equation we get,


which implies, according to Mayo, the data is good evidence for equation.

This is far from the only embarrassment for SEV, but I won’t run up the scoreboard by mentioning others. At this rate “Error Statistics” will turn everyone into Bayesians and where’s the fun in that? What’s weird though is that no one thought to do an elementary check on it. It’s almost as though “SEV” was accepted on religious grounds.

I don’t want to be entirely negative, so let me finish on a positive note by echoing Corey & Cosma: Dr. Mayo’s SEV is the strongest defense of frequentist statistics out there.

UPDATE: Mayo considers it a key selling point of Error Statistics over Bayesian methods that you can use any T to probe a hypothesis H. By probing H with multiple T’s you get a better sense of whether it’s true. Regardless of what T you use you should get results consistent with the truth or falsity of H for reasonable data consistent with H being true/false.

All I did was evaluate this claim. For some T, the SEV answer is identical to the Bayesian posterior. So I looked at the first T that makes SEV differ from the posterior. What I found is that this T get’s it exactly wrong. Note: it doesn’t produce a weakened or less useful form of the intuitive conclusion; it completely contradicts the correct answer. It says the data does provide strong evidence for H, when in fact it doesn’t. This doesn’t mean SEV performs poorley compared to Bayes. It means SEV’s wrong regardless of what the Bayesian answer is.

September 12, 2013
  • September 12, 2013Corey

    Even for frequentists, the natural statistic is the precision-weighted mean, not the raw mean.

  • September 12, 2013Joseph

    That’s not the point. The point is SEV only works when it agrees with the Bayesian posterior. It’s the posterior that’s fixing classical statistics, not SEV.

    If I used a sufficient statistic then I just get the Bayesian answer. The fact is as soon as you move away from the Bayesian posterior, SEV doesn’t merely become inadequate, it’s extraordinarily wrong.

  • September 12, 2013Joseph

    Incidentally Corey, I took you to mean on your blog that you intend to show how “severely testing hypothesis” falls naturally out of probability theory to the extent that it makes sense at all, kind of in the style of Polya showing how probability theory naturally mimics the way humans think. Is that where you’re going?

  • September 12, 2013Corey

    I don’t find much value in criticisms of error statistics that don’t address how it would actually be practiced. In this specific case, I would expect Mayo to point out that your choice of statistic fails to satisfy the criterion laid out in E.S. Pearson’s Step 2.

    The challenge is to find a probability model (and a statistic satisfactory according to step 2) with the following properties:

    - the probability model is simple enough that a statistically literate person’s common sense of the can easily grasp it;
    - the severity analysis and the Bayesian analysis disagree;
    - one or the other of them offends common sense.

    I think a normal model with known variance and optional stopping is the way to go here, because it allows us to force severity and Bayes to disagree even when using the exact same information. Contrast this to your example above, in which it takes a straw statistic to make severity produce garbage.

  • September 12, 2013Corey

    Your prediction is close. Mayo has offered several phrasings of her Severity Principle. I seem to recall that they can be divided into two categories: the first implicitly assumes a dichotomous datum, and the second implicitly assumes a real-valued datum. The difference between the two is that the continuous versions uses phrases like “more extreme” (i.e., tail areas); the dichotomous versions leave it out. I planned on identifying the dichotomous version as suitable for informal situations where black-and-white reasoning might be a useful quick-n-dirty heuristic; and of course, it accords with Bayes. The continuous version I would call “formal” in the sense of intended for use in actual statistical analyses that call on formal probability models.

  • September 12, 2013Joseph


    I don’t see the problem. There is nothing either in the explanation or mathematics of Mayo’s development which says we can’t use that estimator. Just the opposite actually. It should work without any problems according to Error Statistics and Mayo’s logic, but in reality is an incredible disaster.

    I’m well aware that in real life anyone applying SEV would spot the absurdity and would just restrict it to those cases where it agrees with the Bayesian posterior. That was kind of my point in fact.

    Bottom line: either Mayo’s reasoning is sound and this estimator should work, or her reasoning is unsound. It’s clearly the later.

  • September 12, 2013Joseph

    Out of curiosity, what if I repeated this little exercise with the Cauchy distribution where there are no sufficient statistics?

    What estimator appears natural now? Which ones are approved by the Severity Principle? Which ones aren’t allowed all of a sudden?

  • September 12, 2013Corey

    My brain’s model of an error statistician asserts that the specification of a sensible statistic is logically prior to the use of severity in much the same way that the specification of the joint prior for data and hypotheses is logically prior to Bayesian updating. A prior distribution that is inappropriate on the prior information will lead to a posterior distribution inappropriate on the posterior information; in likewise fashion, a choice of statistic nonsensical according to Step 2 will lead to a nonsensical severity curve.

  • September 12, 2013Corey

    Cauchy! — you go right for the jugular don’tcha, you sonuvabitch. ;-)

    I’m assuming that the specified statistic needs to work of an arbitrary sample size (or equivalently, there’s a family of statistics, one for each possible sample size). I’d need to look at the sampling distribution of a bunch of estimators with a bunch of sample sizes to try to come up with a good one. (My gut suggests that posterior mean under a flat prior might be good.) I wonder what Spanos would say…

  • September 12, 2013Joseph

    “you go right for the jugular don’tcha, you sonuvabitch.”

    It’s a Marine thing.

  • September 13, 2013Antonio

    Some time ago I’ve read Mayo’s rebuttal of Birnbaum’s theorem (SP+CP ~ LP) and I admit to have understood nothing. Too discorsive and obscure for me. I would be glad if I will can read something about it here.

  • September 13, 2013Anon


    What is the posterior

    P(mu>0.1 | given Ybar)

    If you calculate SEV with Ybar you have to calculate the posterior with Ybar – or else you are comparing judgments with different sets of information.

  • September 13, 2013Anon

    More clearly, suppose you only know that raw average of the two measurements, how would you calculate the posterior?

    Because that is what you did in the SEV. You calculated it supposing you only knew the raw measurement.

    But if you only knew the raw measurement, your posterior would also not be zero.

  • September 13, 2013Joseph


    I think you and some others are missing the logic here. I’m not comparing SEV to Bayes. I’m comparing SEV to the intuitively correct answer. Intuitively, there is no evidence at all for equation and SEV should have implied that for any test statistics used. There is absolutely nothing in the derivation or motivation of SEV to suggest it wouldn’t work for this statistic.

    The fact is SEV should have given results consistent with the intuitive answer, but it draws the exact opposite conclusion. It gets it exactly 180 degrees wrong. It’s just flat out not right. There’s no escaping it.

    The fact that Bayes automatically extracts the most it can from the data and automatically gets an answer consistent with the intuitively correct one is just gravy, but not essential. Even if that wasn’t the case, SEV is still wrong.

  • September 13, 2013Joseph


    I’m probably not the person to talk to about Mayo’s critique of Birnbaum’s result. I don’t think Bayesians really have a dog in that fight, but my current understanding is that Mayo is basically right. You need an additional assumption to get LP:

    SP+CP+??? ~ LP.

    Everything turns on ??? and whether it’s true or not. People can argue about that endlessly and it doesn’t seem very productive. I was thinking about doing something tangentially related though.

    Corey seems like he has much more to say about stopping rules, so maybe he’s planning on taking it up.

  • September 13, 2013Anon

    My point is this:

    We have to know how to best combine the evidence from both tests.

    The most efficient way to combine both tests, without losing information, is to come up with a sufficient statistic. In that case, the SEV will be in accord to the common sense.

    Now, you are saying that SEV will violate the common sense. It will when you ignore information. So will Bayes. If you ignore the information that you had tests with different precisions, the posterior will not be 0.

  • September 13, 2013Joseph

    “It will when you ignore information.”

    Throwing away info might limit the usefulness of SEV. For example, intuitively there may have been strong evidence for H, but since SEV is using less information it concludes “H hasn’t been well probed by T”.

    But that’s not what happened at all. What actually happened is that SEV concluded “There is strong evidence for H”. Regardless of whether you use all the info or part of it, this is wrong.

    “In that case, the SEV will be in accord to the common sense.”

    Yes, but it’s also in accord with the Bayesian posterior. The entire point was to see what happens when it disagrees with the Bayesian posterior. What happens is that it’s wrong.

    All you’re really saying is you shouldn’t use SEV unless is agrees with the Bayesian answer, because it’s nonsense otherwise. In that I think we are both in agreement.

    Also please note: Mayo brags about how SEV will work with any test statistic. In her mind it’s a key point in favor of Error Statics over Bayesian methods that you can use different T’s to probe H in different ways. It should have been consistent with the truth for this T, but it isn’t. End of story.

  • September 13, 2013Joseph


    Based on our converstation, I added an update to the post.

  • September 13, 2013Anon

    “Throwing away info might limit the usefulness of SEV. For example, intuitively there may have been strong evidence for H, but since SEV is using less information it concludes “H hasn’t been well probed by T”. But that’s not what happened at all. What actually happened is that SEV concluded “There is strong evidence for H”. Regardless of whether you use all the info or part of it, this is wrong”.

    But if you throw away info the bayesian answer will not be humble either. If you feed the likelihood of Ybar, ignoring the info that Y1 and Y2 were measured with different precisions, it will say that mu> >.1 has high posterior probability.

    In your example, the SEV works how it is supposed to work with the info you are giving it.

    When you sum Y1+Y2 this is equivalent to saying that you have two observations with the same (in)precision. Given the evidence from two observations with the same (in)precision, it is not absurd to say that you have pretty good reason to believe that mu is higher then 0.1.

    Now, if someone tells you that each observation has a different precision, then you should take this into account. And your assessment of the evidence will change.

    How is this differente from bayes? If someone gives you Ybar, you will say that 0.1 is probable. But if afterwards someone says that Y1 and Y2 came from different distributions, you take this into account and change your beliefs.

  • September 13, 2013Joseph


    See the update to the post. But, you’re getting technical facts confused here. If you feed less info into the Bayesian machine it will increase the uncertainty of every hypothesis. In general it will move probabilities of any hypothesis closer to .5 (or whatever that hypothesis would have under the prior) which is the point of maximum uncertainty. Sometimes it will move them a lot, sometimes a little.

    This is all well and good. But it will NOT start assigning a very high probability to equation. You’re not going to get equation = .96, which implies there is high confidence the hypothesis is true.

    For Bayesians, throwing away data increases uncertainty, it doesn’t decrease it. I say again: SEV doesn’t give a weakened answer, but one still consistent with the truth. It gives the wrong answer.

  • September 13, 2013Anon

    You have:

    Ybar = mu1 + e,
    e~ norm(0,.25)

    And you observe ybar=1

    Assume a very uncertain normal prior on mu1:

    mu1~normal(0, 1000)

    then the posterior would be

    mu1|ybar ~ normal(1,0.25)


    P(mu>0.1 | Y=1)=0.964

  • September 13, 2013Anon

    I have ignored the term 1/1000, but you get the point. If you only have the observation ybar, the bayes analysis will say that you are pretty sure that the mean is above 0.1

  • September 13, 2013Joseph


    That’s not the bayesian calculation. Given equation define two new variables equation and equation. Use it to get equation. Then,


  • September 13, 2013Anon

    What? We are assuming that you only know ybar. You do not know the individual variances, you only know the variance of ybar. How would you do the calculation with that information? Give a concrete example.

  • September 13, 2013Corey

    Anon, your calculation of the posterior given only equation is correct; it follows from equation .

    Anon, you seem to know a lot about it. Let’s clarify the key point: given the complete data, what stricture of the severity approach (or error statistics writ large) forbids the use of equation in the severity calculation?

  • September 13, 2013Corey

    Whoops! I used \approx (equation) instead of \sim (equation). Darn it.

  • September 13, 2013Corey

    Joseph, you don’t need a change of variables and integration (although it doesn’t do any harm). It follows straight from the stability property of the normal distribution that $(Y_1 + Y_2) \sim N(2 \mu, 1)$. If you actually work through your $\int dB P(A,B| \mu)$, you’ll arrive at the same place.

  • September 13, 2013Anon

    Well, in this case it seems that it is the same justification for both baysesian and frequentist.

    Given Y1 and Y2, no bayesian would use only ybar!

    How would they justify it?

    They would say that when you use ybar you lose important information.

    This is no different for the frequentist. Given that you know the variables had different precisions, you should use it to get the best inference possible.

  • September 13, 2013Corey

    Anon, Bayesians would say, “Condition on all of the available data.” This doesn’t seem to be a tenet of frequentism in cases with no sufficient statistic.

    In particular, it seems to me that severity requires a reduction of the data to a single real statistic. In the space of possible data, this reduction defines level sets of equal statistic value; Pearson would have us choose a statistic such that by moving across the level sets, we become “more and more inclined on the information available, to reject the hypothesis tested in favour of alternatives”. Is this always possible? If not, in cases where no acceptable reduction exists does error statistics simply balk?

  • September 13, 2013Joseph


    Actually, I take back the last couple of comments and apologize. You’re right that with the average statistic that SEV is getting the Bayesian posterior condition on part of the information.

    This really changes my intuition for what’s happening with SEV (as well as what Bayes does when you throw away information). It really improves my understanding at least. If SEV uses all the information it gets the Bayesian posterior. If it throws away some of the information it’s getting the Bayesian posterior conditional on that reduced information. Very interesting!

    But this is where philosophy matters. Suppose an Error Statistician does as Mayo advocates and conducts multiple probes of $H: \mu>.1$, one using the sufficient statistic and one using the average. They would have to conduct multiple tests to get a more complete picture if no sufficient statistic were available. What will they conclude?

    For the sufficient statistic they will get a very low SEV for H. Indicating “H has not passed a serve test”. When they use the average they will get “H has passed a severe test”. According to Error Statistics the former doesn’t mean H is wrong, it only means it hasn’t passed a stringent test. Once they combine that ambiguous result with a strong pass, they’ll conclude H is true.

    The Bayesian looking at the two sets of numbers merely says “My best guess based on the best information available is that H is false”

    This means that if H is false and a Baysian gets a small P(H|K), and there is any functional reduction of the data for which k=F(K) and P(H|k) is large, then the Error Statistician is liable conclude H is true from the same data K.

    Moreover, if the distribution has no sufficient statistic, there is no test statistic which uses all the information, so every test statistic will effectively be a P(H|k). Who knows what will happen, especially if the amount of data is large, so that any given statistic likely throws lots of the information away.

    They’re interpreting the numbers wrong, and getting the wrong final answer because of it. They could patch this up by requiring that only sufficient statistics should ever be used, but there’s no frequentist justification for doing that, and it would only sometimes work.

    All these problems are solved instantly by just using the Bayesian posterior and interpreting it as a probability. The Bayesian posterior really is what’s fixing those frequentist tests.

  • September 13, 2013Anon

    “Condition on all of the available data.”

    That is valid for all frequentists and bayesians. Actually, let’s rephrase that. That should be the case for all scientists.

    I’m not defending applied classical statistics as it is done today. A lot of theoretical statisticians just do is this: make up a statistic, do a taylor expansion, figure out the asymptotic distribution; or make up a statistic, simulate it and figure out the distribution.

    Then applied researches start using the statistics.

    Just because you have the distribution of a statistic, it does not mean it is useful. The measure of your statistic can make no sense at all to the problem you are facing.

    And, as far as I can see, this is a problem BOTH for bayesians and frequentists.

    If the only information you have available is a nonsensical statistic (like ybar), then you will get nonsensical results.

  • September 13, 2013Anon

    “The Bayesian looking at the two sets of numbers merely says “My best guess based on the best information available is that H is false””

    The scientist facing two tests should do the same. One of the tests treat all observations as if they came from a N(mu, 0.25). The other test treat each observation with their respective precision. Which one use all information available?

    If you only knew that both observations came from N(mu, 0.25), then your best guess would actually be that mu>0.1. But that is because you are constructing the evidence (the statistic) in this way.

    My guess (I say guess because I have not thought of all examples and have not demonstrated it)is that when there is no good information available (that is, when the statistic is not good), both frequentists and bayesians will very likely be wrong.

  • September 13, 2013Anon

    Now, just repeating, I’m not defending applied classical statistics as it is done today. It is a complete mess. I point to Gigerenzer or McCloskey surveys on that.

  • September 13, 2013Joseph


    That’s not all the information they have available. Behind the scenes (from a Bayesian perspective) we can see that they’re only using part of the information k=F(K), but they don’t say “H is warranted by k” they say “H is warrented by K” for the very good reason that they do actually have K.

    A bayesian can see that they’re really doing the former and not get tripped up, but they will claim the later and be wrong.

    Once SEV deviates from P(H|K), it gives nonsense. There is no getting around this. If you want to artifically strict SEV to cases when it’s equal to P(H|K) then great! Me too!

    But frequentist don’t want to make this restriction.

  • September 13, 2013Anon

    “Once SEV deviates from P(H|K), it gives nonsense. There is no getting around this. If you want to artifically strict SEV to cases when it’s equal to P(H|K) then great! Me too!”

    I don’t think your example proves your claim, for two reasons:

    i) when we use ybar, both SEV and P(H|K) – with uncertain prior – are giving nonsense (nonsense because we actually know the truth in the example).

    ii) we could put a strong prior on mu, and we could make P(H|K) as arbitrarily far from SEV as we want. For example, we can calculate SEV with the wheigthed mean, that would provide strong evidence for mu0.1)=90% even if we used all information available.

  • September 13, 2013Anon

    hat would provide strong evidence for mu0.1)=90% even if we used all information available.

    (don’t know what happened, but there was a problem in the coment above, some words are missing)

  • September 13, 2013Anon

    Ok, the same problem happend again!

  • September 13, 2013Corey

    Anon, if you use the < character, it appears raw in the HTML and your browser interprets it as the opening of an HTML tag. Words following the open tag disappear because your browser wants to treat them as markup, not text. If you write &lt; it will appear as <.

  • September 13, 2013Joseph


    This is a serious stretch. If you use a prior for mu which contains equation in it’s high probability manifold (i.e. the prior is consistent with the data) then the P(H|K) is going to be an improvement over anything SEV can do. You could for example, have SEV(H) implying an H which we know from the prior information is impossible. That just makes the posterior look better and the fact that you can use a prior inconsistent with the truth to make things worse is irrelevant.

    Seriously, there is nothing, absolutely nothing, in the Severity concept which says we can’t use that test statistic to test how warranted H is by K. In fact, they insist that SEV has exactly this flexibility. It’s a major selling point for them. They don’t want SEV restricted in the way you or I would.

    Now after the fact, we can see it was a dumb statistic, and from a
    Bayesian perspective we see it’s not using all the information, but IT’S PERFECTLY OK BY THEIR OWN CRITERION.

  • September 13, 2013Anon

    We could see it was a dumb statistic before the bayesian calculation, since there was another statistic that used the full information set available.

    Now, maybe you should come up with a different example where both bayesian and frequentist will use the same info (that is, the bayesian likelihood and the frequentist likelihood will be the same) and while the frequentist gets nonsense, the baysesian does not get a nonsensical result.

  • September 13, 2013Anon

    I have to go now, but I will come back later to read your thoughts on these topics, (Joseph and Corey). So please elaborate further, I’m enjoying our discussion. Best.

  • September 13, 2013konrad

    I agree with Joseph: if the SEV claim is that it works with any statistic, then he only needs to demonstrate one choice of statistic that breaks it. We can _imagine_ (without needing to demonstrate) scenarios where we don’t know a priori which statistics make “sense” and which don’t – this is why working on _any_ statistic is a selling point in the first place.

    Regarding the claim that “condition on all of the available data” is valid for all frequentists and Bayesians: no, false on both counts.

    First Bayesians: we can condition on whatever we like. Whenever we condition on something different, we are calculating a different quantity – this is fine provided we keep track of these different quantities and don’t start using them as if they were the same thing. Thus p(x|I_0) reflects what we would believe about x if the available information were I_0, while p(x|I_1) reflects what we would believe about x if the available information were I_1. There are many reasons we might want to calculate both of these quantities. E.g. we might be trying to judge what would be most useful to learn, so as to decide which future experiments to perform. Or we might be trying to judge how sensitive our model is to neglecting certain parts of the available information (perhaps to save on computational costs). But what we don’t do is to calculate p(x|I_0) and then proceed as if what we calculated is actually p(x|I_1).

    Next, frequentists: the whole hypothesis testing setup is based on the idea that you can choose whatever test statistic you like (where a test statistic is a summary of the data, i.e. a subset of the available information). A more skilled modeller may come up with a better statistic than a less skilled modeller, but the point of the framework is that it is supposed to safeguard even the less skilled modeller against incorrect conclusions. Thus a poor choice of test statistic may lead to an underpowered test, but should still provide a guarantee against false positives. When this is not the case, the whole foundation crumbles. The idea that frequentism does _not_ force you to use all the available information is pretty central.

  • September 13, 2013Corey

    konrad, I agree that we can condition on different things to see what can be inferred in different states of information, and that this can answer interesting and/or instrumentally important questions. My point is that if the question is, “What can be inferred from some specific set of data?” — as it almost always is, in science — then in general we need to condition on all of the data, not just a lower-dimensional function of it.

  • September 13, 2013konrad

    Agreed – if we condition on a different set of information, we are answering a different question (namely, what can be inferred from _that_ information?).

  • September 13, 2013Anon

    “First Bayesians: we can condition on whatever we like. Whenever we condition on something different, we are calculating a different quantity”

    You can condition your test whatever you like too, and you are calculating a different quantity.

    If you assume that your measurements have the same variance, as Joseph did, then you will conclude that the data is good evidence for the discrepancy. Of course, that is a wrong assumption, but you can do it. Just like you can do the baeysian posterior with the same assumption and get the same results. Both approaches will fail.

  • September 13, 2013Anon

    “The idea that frequentism does _not_ force you to use all the available information is pretty central.”

    If anyone has said that, that person is the one to blame. All relevant information must go to testing, including background information.

    Of course, there are a lot of theoretical statisticians developing tests that makes no sense, both Bayesians and Frequentists. That does not mean you should use it.

    For example, Bayes Factors. Or “Full Bayesian Tests”. Or testing precise hypothesis with priors on point nulls. People develop this kind of stuff. And the results that come out of it can be as nonsensical as you want.

  • September 15, 2013konrad

    Anon, I think you are missing the point. Sure, if one makes an incorrect assumption one should not be surprised to get an incorrect answer. But the point is that the frequentist test described in the post _does not_ make the assumption that the measurements have the same variance. It just constructs a test based on a statistic, and one does not need to make any assumptions about measurement variances to construct a test based on a statistic. So the point is that one gets an incorrect answer _without_ making an incorrect assumption – this is why the methodology is problematic.

  • September 16, 2013Anon

    When you test with the statistic Ybar, you are making a wrong judgement, just as if you had update your prior with Ybar.

    When you sum both variables without taking account of the different precisions, you are acting as if both of them have the same precision. You chose to ignore this information – so, yes, you are acting as if both observations have the same importance, and this is an wrong assumption.

    Now, if you think it is wrong updating your prior with Ybar, because you know you have more information than is contained in Ybar, you cannot justify testing with Ybar either. You are losing important information in both cases, so if you claim one methodology is wrong because it could use Ybar and get a nonsensical result, the other methodology is also wrong because mathematically it coul also use Ybar to update the prior and also get a nonsensical result.

    But it is easy to see that the problem in both cases is not the methodology, but the wrong application of it.

  • September 16, 2013Joseph



    You are seriously getting this wrong. It’s like looking at two brothers, one who saves his money and is rich and the other who spends his money and is poor, and then claiming “see they both have money problems because if the rich brother put his money in a pile and burned it he’d be poor too”.

    The Bayesian calculation, without effort or special notice, and without making any choices about estimators, automatically uses all the info. Even if there are no sufficient statistics.

    SEV does not. And if there aren’t any sufficent statistics, then SEV never can.

    But that’s not the worst of it for SEV, because they themselves don’t see their procedure as “throwing away information”. That understanding makes perfect sense to me from a Jaynesian perspective, but they want to expressly deny that perspective. They view SEV applied to that bad estimator as a perfectly legitimate way to probe the hypothesis. They believe the data (all of it) is showing that H passes a severe test. They are simply wrong in this.

    Moreover, I think you and most others are greatly underestimating how real a problem this would be in practice. In large scale simulations involving multiple complicated data sources, the sample average and sample variance are often the only statistics ever used. Nate Silver mentioned something about this in a recent talk when he basically said sample average is king. In that realistic setting it would be a highly non-trivial problem to identify when this is occurring and would be practically impossible in most cases. The fact that the Bayesian posterior doesn’t suffer from this problem would be a huge practical advantage.

    But there is no getting around the basic point. According to their Frequentist ideology SEV should work for this statistic. It doesn’t. Anyone who doesn’t like that is free to artificially (according to their frequentist principles) restrict it to cases where it matches the simple posterior, which is perfectly fine with me.

  • September 16, 2013Anon

    Ok, I think you are right when you claim that people do not see this in general as wrong application. I do, but in practice you are rigth that most people don’t. In this case in particular, since it is obvious, people would see it as wrong. But in other cases people wouldn’t.

    I have faced this problem with unit root “tests”. I have shown people that their tests are irrelevant, because the metric chosen is not appropriate for their problem, EVEN if the test has high severity and low type I error – the problem is that when testing, they are making “hidden” assumptions about the data, losing important information. They think that since they have the statistic sample distribution, that is all that matters – but it is not. Most people don’t get this, and do a complete nonsense, for example, appliying all different kinds of testing but having no idea how to combine the evidence -usually saying, ok, this test was significant, this test was not etc.

    Now, even with what I have exposed above, my take is that – as far as I can see – the bayesian approach would suffer the same problem. Maybe you should come up with a different example that ilustrates this situation Nate Silvers points out.

  • September 16, 2013Joseph

    A straightforward mindless idiot application of bayes theorem doesn’t suffer from this problem. It only suffers from it if the Bayesian goes way out of their way to screw it up.

    And incidentally, I don’t think the Bayesian’s answer with the reduced statistic is really “wrong”. It’s “right” in the sense that if all you really knew was the reduced statistic then the Bayesian answer is making a reasonable guess. The goal of Bayesian statistics, after all, is to make the best guess possible from a given state of information.

    A given state of knowledge doesn’t always contain enough information to really be useful. So sometimes when you make a best guess from uninformative information those guesses don’t agree with reality. There is no way to avoid this other than using more information.

    Of course, Mayo as a frequentist doesn’t think about the problem this way. She’s openly admitted she has no idea why Jaynes is concerned about “information” and views it as at best an unnecessary veneer laid over statistics and at worst utter nonsense. Which in retrospect is why SEV screws things up.

    So the Bayesian answer is making the best guess possible given the information fed into it. When you feed more information in, you get better guesses. I don’t see how this is a failure of Bayesian Statistics and all I can recommend is that if you have relevant information be sure to us it in your analysis.

    It would be nice if we could take “nothing” and make accurate inferences about the real world. Indeed I could think of all kinds of ways to use such an oracle if it were possible. But it isn’t. Inferences are based on information and the quality of inferences has to depend on that information somehow.

  • September 16, 2013Anon


    The rationale for severity is that, were you hypothesis false, with high probability you would have had a statistic that fits less well the hypothesis than the one you actually have.

    In the present case:

    1) the measure of fit, of distance, is well defined – the bigger the sample mean, the less it fits with the hypothesis that mu = 0. The only problem is that we do know that measurements have different precisions, so we should take this into account (but I’ll get back to this later)

    2) your error probabilities are correct. The distribution of the sample statistic is correctly derived.

    So your example is correct, and we correctly could say that the hypothesis H (mu >0.1)severely passes test T (96% severity) with outcome y (y1=2, y2=10^-10).

    Now, why could we say that? Because this test result is “reliable”, in the sense that only 4% of the time we would get it wrong if the truth were mu<0.1. Unfortunetaly, in the present case, we are in that 4% of the time. Because the imprecise measurement gave us a result of 2, an outlier when mu=0, and ~~ because we did not take into account the different precisions of the measurements ~~ the noisy measure, the outlier, dominantes our test.

    But does that mean that the rationale is nonsensical? No, it doesn't. If this were the only test available for us, it would indeed be rational to believe the result of the test, since it is realible procedure – only 4% of the time we would get it wrong. And this a procedure that would lead us to discover the error, if we were in error. We could repeat our measurements and, even with this simple mean, we would see that this first result was an outlier, thus learning from error.

    But, we do know another statistic test ~~far superior ~~ than this one. For example, the power of the test with ybar is pretty low for ranges where the power of the test with the wheigthed average is 100%. Given this knowledge, one should use the more reliable test, in the same way that you should update the prior with Y1 and Y2, and not with Ybar.

    So my point here is that this example does not invalidate the rationale for severity assessment, even though it does warn people that the SEV, like the p-value, is not an absolute number that you can just calculate blindly without proper knowledge of the problem.

  • September 16, 2013Joseph

    “But does that mean that the rationale is nonsensical? No, it doesn’t”

    Uh, yes it does. What you’re saying is almost right, but SEV gets it wrong for a very specific reason. It takes into account information which is basically irrelevant to the problem (the first measurement), which the Bayesian calculation is effectively ignoring.

    The error that SEV is making is the same as the one being joked about here:

    Which was ridiculed by Mayo on the grounds that no Frequentist would be stupid enough to make this mistake. The dice roll is clearly irrelevant to the Sun exploding. Yet that’s exactly the mistake SEV is making here! The first measurement is effectively irrelevant to the question but SEV is taking it into consideration anyway.

    SEV is getting this wrong specifically because it’s trying to judge things using the just the sampling distribution, rather than keeping the data fixed and using the posterior. The Rational of using the sampling distribution in the way you describe is invalid except in special instances when it gives the same result as the posterior, in part because it’s liable to include information which is irrelevant to the question at hand. So yes, the reasoning is wrong.

  • September 16, 2013konrad

    Anon, you claimed that using Ybar in a frequentist framework implies an assumption that all measurements have the same variance. On what do you base this claim? Even if I were to agree that it implies some assumption (which I don’t), why would it imply that _particular_ assumption rather than, say, the assumption that the measurement variances are different but unknown?

  • September 16, 2013Anon


    I’m saying that choosing to ignore the known variances is equivalent to treating both observations as having the same precision.

    Imagine a situation where you do not know each variance in particular, only the variance of ybar. Then the test result is Ok and it actually agrees with the Bayesian posterior with an very flat prior, as we have seen.

    But in this case you do know the variances. If you do use the variances, you have more reliable test, with 100% power to very small discrepancies. So, you have two tests. One of them is equivalent to assuming equal variances for both observations (which you know it is not true). The other one uses all information you have and it is more reliable (it has more power to the same type I error). If you choose to use the first test, you are acting as if you did not know the variances were different, when you actually do. This is akin to choosing to update your prior with ybar when you do know y1 and y2.

  • September 16, 2013Anon

    But I’m not saying that it implies a particular assumption, it implies all assumptions that are equivalent to treating both observations as having the same precision.

  • September 16, 2013konrad

    I am unclear on which notion of equivalence you are using here – it seems to be one in which an assumption can be “equivalent” to a method (i.e. to “treating both obserations as…”) and I’m not sure the issue can be fixed by simple rephrasing. There is clearly a difference between assuming the precisions are the same and assuming they are different but unknown – these assumptions cannot be equivalent to each other for any sensible definition of equivalence (e.g. in the Bayesian framework they would imply different models). So, which of these two assumptions is equivalent to “treating both observations as having the same precision”, and why? Why not the other assumption? Is it accurate to say that ignoring the precisions is treating them as being the same (and why)? More generally, how do we tell whether a given assumption is “equivalent” to a given method?

    (Avoiding questions of this type is exactly why the frequentist framework is set up so that a test statistic can be used _without_ committing to an assumption.)

  • September 17, 2013Anon

    Imagine two worlds:

    A) You have only observations in which the error term is distributed N(0, 0.25).

    B) You have two sets of observations Y1 and y2 (that is, you have two separte error terms with different precisions).

    In A you have only that informations, that is, observations with error N(0, 0.25). So, if you want to do bayesian calculation, you can only uptade your prior with the observations with errors N(0, 0.25). If you want to test, you can only test with this precision. Your test is not very powerful, it has only 7% power to discrepancies as big as 0.1, for example, considering a significance level of 5%.

    In B you have more information. Now the bayesian analysis can uptade the prior considering the different precisions, which gives a more accurate answer. And yoy can also use a test considering the different precision, which is much more powerful then the other test (it has 100% power to a discrepancy as big as 0.1 and significance of 5%).

    Now, if you are in world B, you could also act as if you were in world A, that is, you could act as if you had only observations with error term N(0, 0.25). That does not mean you should do that, but you could, either to use an inferior information to uptade your prior or to use a inferior test. That is the notion of equivalence. If you choose to ignore the information, you are acting as if you were in the world without that information.

  • September 17, 2013konrad

    Ok, one more try before I give up:

    1) In frequentist analysis (specifically, hypothesis tests controlling false positive error rate), a test is either valid or it is not. If it is valid, it provides a guarantee against false positives. Among valid tests, some may be inferior to others – an inferior test is one which has weaker power (while still retaining the same guarantee against false positives). The point is that if a valid test gives a positive result you can believe it, and do not need to go in search of a more powerful test because the one you used is already powerful enough to detect the signal in your data set. The only way you would need to replace the test with a different one is if it is not valid – for this to be the case there needs to be an actual error in the methodology.

    2) You are not addressing my questions at all. Specifically, I raised the possibility where the precisions are unequal but unknown. So we are in your World B, but we cannot plug the precisions into the calculation because we don’t know them.

  • September 17, 2013Corey

    @Joseph, in reply to “THEY DON’T SEE IT AS A WRONG APPLICATION!”

    \lt;dons error statistician hat>

    You keep repeating this, and it keeps not being true. The very first thing I asked Mayo for was rules/guidelines for choosing statistics. She never really answered me, but she alluded to what the answer would look like, both on her blog and in the paper for hers you cite (see the first two occurences of the word “agreement”).

    \lt;doffs error statistician hat>

  • September 17, 2013Joseph


    See page 164 of that paper for a definition of what passing a sever test means. S1 and S2 are both satisfied in this case just like the examples in which I took it from.

    The frequentist rational would apply to any statistic. Some statistics may be more useful than others in the sense that some are a more complete probe than others, but there is absolutely nothing in the frequentist rational which suggests you should get blatantly wrong answers that contradict what a simple look at the data implies. The examples she actually cites require the low p-value for H_0 and high SEV in order to justify the statement “H passes the test with high severity”. Those were provided in this case.

    See for example Anon’s comment above which includes:

    “But does that mean that the rationale is nonsensical? No, it doesn’t. If this were the only test available for us, it would indeed be rational to believe the result of the test, since it is reliable procedure ”

    He’s saying this because frequentist understanding of the problem leads them to believe the procedure would only fail 4% of the time. That’s the criterion Mayo and other frequentists want to use. They believe it’s perfectly legitimate and their belief causes them to get this one wrong. It may only fail 4% if their assumption about the frequency of future events actually holds true, but regardless of whether it does or not, it fails in a way that’s obvious from a simple intuitive look at the data.

    Note: it doesn’t just fail because they got unlucky. It fails because it trivially contracts what a simple look at the data implies.

    If Mayo wants to partially reject this frequentist understanding and to restrict the range of test statistics down to those which imply the Bayesian result, then great. I’m all for it. But then how exactly does she claim SEV fixed the relevant howlers when Laplace was using mathematically identical procedures on the identical problem two centuries ago?

    Look, frequentists have a great advantage here. Every time a problem is found with their procedures they can patch it up with another intuitive ad-hocery . Then we find another problem, and they patch it up again. Always moving ever closer to the Bayesian result, but never acknowledging it. So I wouldn’t be surprised if Mayo wants to shift the goal posts in this way. But I got plenty more examples were that came from.

    Incidentally, for the NIID case what do you suppose the equation equals?

  • September 17, 2013Joseph


    Look at the definition for passing a severe test on 164 and then look at the example mid page on 169. Then think about the frequentest rational for these procedures. Where in any of that do you see even a hint of the idea that if you choose the wrong T, you’ll can get H passing a severe test even though H is over a billion standard deviations from the obviously correct area observed just by inspecting the data?

  • September 17, 2013Joseph

    Or Corey, here’s another way to look at it. Where in the philosophical justification for severe tests on 164 does it explain that one of these test statistics is legitimate while the other shouldn’t be used?

    (a) equation
    (b) equation

  • September 17, 2013Corey

    Joseph, it’s in (S-1): “for a suitable notion of accordance”. Annoyingly, what is and is not a suitable notion of accordance is never made explicit, although Pearson’s Step 2 and the “agreement” quotes I cited give an inkling. I feel pretty confident that something reasonable and objective involving first- and/or second-order stochastic dominance could be defined to give a partial preference order on test statistics.

  • September 17, 2013Corey

    Regarding equation: I'm pretty sure that equation is only meant to be used with one-sided composite hypotheses — at least, I've never seen Mayo or Spanos use it with anything else. Once again, they've failed to fully explain an important point; in this case, they've failed to specify the domain of equation.

    In the NIID case? Well, you haven't specified the stopping rule, so the question is underspecified.

  • September 17, 2013Anon


    You are mistaking your lack of knowledge about the method as a faiiure of the method. Not every test is equal, there are more powerful tests, there are measures that are relevant to the problem and measures that are not.

    You are also contradicting yourself. You say that when only Ybar is available it is not rational to believe the test result, when we have showed that it is with a flat prior and you have agreed with that. So if you think one method is flawed, the other is also flawed too, because it would lead to the same conclusion in the same circumstances…we can easily see that the problem in your example is not the method, but the wrong application of it – I could use the same example to “prove” that Bayesian analysis is wrong when updating with Ybar. And you could easily see that it is blatantly wrong to criticize a method wrongly applying it.

    Now, where is my answer to Konrad? I have answered him earlier and I can’t find my answer.

  • September 18, 2013Joseph


    “it’s in (S-1): “for a suitable notion of accordance”. Annoyingly, what is and is not a suitable notion of accordance is never made explicit,”

    It is made explicit in the examples. See mid-down page 169. I met S-1 the same way Mayo did. Both S-1 and S-2 are explicitly satisfied.

    “I’m pretty sure that equation is only meant to be used with one-sided composite hypotheses”

    So SEV can’t even handle an absolutely simple and necessary generalization to trivial problems. How again does it fix classical statistics or serve as the foundation to applied statistics?

    This wasn’t an innocent question either, because as soon as you start to define things like that then Cox’s theorem starts exerting itself.


    “Not every test is equal,”

    I’ve said this over and over again: I know not every test is equal. Not every test is equally useful for SEV or other frequentists. But there is absolutely nothing in the philosophy behind SEV to suggest that it will explode with the wrong T. You should be able to use any T, it’s just that some will be deeper probes than others.

    “You say that when only Ybar is available it is not rational to believe the test result, when we have showed that it is with a flat prior and you have agreed with that.”

    I agreed with that from a Bayesian perspective. The creators of SEV don’t agree with that Bayesian perspective and think it’s completely wrong and nonsensical. Within the frequentist/SEV world that calculation shows H has passed a sever test, end of story. So they take data which is clearly showing H can’t be true, and use that data to conclude “H has passed a severe test”. How much more wrong would they have to be before you admit they get it wrong?

    Whether or not a Bayesian can patch SEV up enough to work in this problem is completely irrelevant.

  • September 18, 2013Anon

    Reposting the answer to Conrad:

    “The point is that if a valid test gives a positive result you can believe it, and do not need to go in search of a more powerful test because the one you used is already powerful enough to detect the signal in your data set.”

    Konrad, what you have said above is not true, both in the mathematical point of view and in the methodological point of view.

    If a test gives a positive result that does not mean that “the one [test] you used is already powerful enough to detect the signal in your data set”. The power of a test to detect a discrepancy as big as 0.1 does not change whether you have a positive or negative result. In our example, the power of the test to detect a 0.1 discrepancy is ALWAYS 7% when alpha is 5%. So this test is always a poor test, in the sense that it is not reliable to correctly detect the 0.1 discrepancy, irrespective of what result it gives.

    Second, the logic of tests and designing experiments is to: (i) find the more accurate test possible, always searching where you experiment could be in error; and, (ii) to actually repeat your experiments and improve it whenever possible, controlling the sources of errors (both sistematic and non sistematic errors) so to put your theories under stringent scrutinity.

    Let’s suppose the 0.1 discrepancy is relevant to find out, if it exists. Then if one uses Ybar, one should be seriously questioned why she is doing experiments that will have only 7% chance of finding an effect this big when it exists. And, more seriously, should be questioned why she isn’t using the other much more reliable test that has 100% chance to detect the same effect.

    When you have two results from two instruments with different reliabilities (power, measures), you will trust the more reliable one (the more powerful one, the more adequate measure to your inquiry).

    So this claim is incorrect: “The only way you would need to replace the test with a different one is if it is not valid”. As I have said in all my comments before, you can have formal valid tests that are useless for most practical problems. (even bayesian tests)

    “You are not addressing my questions at all. Specifically, I raised the possibility where the precisions are unequal but unknown.”

    Yes, I am. You asked about the concept of equivalence, and I have tried to make clear the concept of equivalence in general. In the specific case that you have unequal unkown variances, this would not be equivalent to equal known variances, because you would have to estimate the variance from the data – that is, you would have even less information.

  • September 18, 2013Joseph


    If you consider equation for these two estimators, then one of them will give results consistent with what the data actually implies, and one of them will directly contradict the data.


    There is nothing in the philosophical justification for SEV to indicate one of these shouldn’t be used. They may not both be useful, but neither is disallowed.

    So what happens if we use both?

    Now if you do a simple mindless Bayesian calculation conditional on both of them you’ll get the truth.

    What happens if you looked at SEV for both of these? According to the official philosophy, a low value of SEV doesn’t mean H is wrong, it just means that it didn’t pass that particular test with high Severity. So an error statistician would look at both estimators and say,

    “H didn’t pass one test with high severity, but it did pass the other test with high severity, therefore taken together the data provides some descent evidence for H”

    A carpenter ignorant of any mathematics above arithmetic would have gotten this right, just like the Bayesian, but an Error Statistician get’s it wrong.

  • September 18, 2013Joseph

    That last comment, by the way is directly related to Cox’s theorem, sinc it is directly exploiting the fact that the Severity Principle isn’t combining evidence in a way consistent with the product rule.

  • September 18, 2013Anon

    “So an error statistician would look at both estimators and say,

    “H didn’t pass one test with high severity, but it did pass the other test with high severity, therefore taken together the data provides some descent evidence for H””

    Look, I grant you this: people do that in practice. Not error statisticians, or statisticians, but scienstists. They are social scientists that have no clue what they are doing, practicing a nonsense cult.

    So, when you say that people could do this you are right. If you take a sample of empirical papers in social science, you will see people doing 5 10 different tests and having no clue what to infer, just claiming “This was significant, this wasn’t”. And they think are doing science.

    And my biggest problem with Mayo is that she does not fight against that. She actually fight against people that show this problem, like Gigerenzer or McCloskey! What a contradiction! She prefers to discuss against bayesians, who are the least of our problems!!! Our biggest problem is that very important and inteligent researches are doing nonsensical signicant testing and, thus, making a bad name of what could be a sensible approach.

    Now, back to the theory.

    “So what happens if we use both?”

    You have two tests. T1 and T2. The first one is very unreliable. It will give you only 7% of the time the correct answer when there is a discrepancy as big as 0.1 The second one is perfectly precise. It has 100% power to detect what you want to detect. You are going to choose the best instrument to your inference. Or, even better, you can combine both tests into a more powerful test. There is no mistery in this. The logic in error assessment is to analyse how you could be in error, and avoid it.

  • September 18, 2013Joseph

    “Look, I grant you this: people do that in practice. Not error statisticians, or statisticians, but scienstists. ”

    Anon, this is not a mistake. It’s an official part of their philosophy. That’s the way you’re supposed to do it according to them. They brag about it even.

    Also, note it’s not a question of the results being consistent with the truth. Anyone can be fooled by misleading data. At issue is the consistency between their results and what a simple intuitive look at the data reveals.

    The data isn’t misleading here. They’re just process it wrong.

    Once again, the fact that you or I can use our Bayesian understanding to patch up their methods is irrelevant to whether their methods work on their own. If you want to restrict SEV to instances in which it agrees with Bayes I can’t argue with that. If you say “it would be stupid not to restrict them in that way”, well I can’t argue with that either. If that’s the best defense of SEV that anyone can come up with, then why waste anymore time with SEV?

  • September 18, 2013Joseph


    Consider this quote from page 159:

    “This is an important source of objectivity that is open to the error statistician: choice of test may be a product of subjective whims, but the ability to critically evaluate which inferences are and are not warranted is not.”

    or on 160:

    “Standard statistical hypotheses, while seeming oversimple in and of themselves, are highly flexible and effective for the piece-meal probes the error statistician seeks.”

    In other words, Mayo is bragging about how we can conduct piece-meal probes (tests) of hypothesis and then evaluate the “severity” of those tests. When you actually carry out what she considers a major selling point of the Error Statistics program you get answers that directly contradict the data!

  • September 18, 2013Corey

    Joseph wrote: “Consider this quote from page 159:

    “This is an important source of objectivity that is open to the error statistician: choice of test may be a product of subjective whims, but the ability to critically evaluate which inferences are and are not warranted is not.””

    I concede the point — especially given the parenthetical remark that precedes your quote: “(Even silly tests can warrant certain claims.)”. I thank you for correcting my mistake.


    If you’re interested in being on the right side of disputes, you will refute your opponents’ arguments. But if you’re interested in producing truth, you will fix your opponents’ arguments for them.

    Black Belt Bayesian

    I still maintain that severity can be, if not fixed, then improved, by considering the choice of the test statistic along the lines I’ve given above before any given “post-data” severity assessment is carried out.

  • September 18, 2013Joseph


    I think they can be improved too. According to Cox’s theorem, any improvement either is equivalent to using probabilities for hypothesis or still has problems.

    So any Frequentist whose dead set against assigning probabilities to hypothesis can put Bayesian critics in a kind of Zeno’s Paradox. Each time a problem is brought up they can fix it but without going full Bayesian. Then the Bayesian has to find where the new methods break. With each iteration it becomes harder and harder to find where they breakdown because each iteration brings it closer to the Bayesian result.

    Remember SEV is not the first iteration of this process. SEV is there to fix previous frequentist iterations. SEV is already much closer to Bayes than previous efforts like p-values. To see this note:



    So the only thing preventing equation from being something like equation is that it doesn’t satisfy the product rule. So that’s where all the action is from here on out.

    That’s one approach to fixing SEV. Another way is to go full Bayesian. So it’s not like I’m lacking ideas for fixing it.

  • September 18, 2013Corey

    Joseph, severity aims to quantify how well a test has probed an hypothesis for a certain error. This gives it an out — it’s not aimed at being a generic measure of plausibility per se, so it’s an open question whether it will fall afoul of Cox’s theorem. You’re right that the product rule is where the action is, though — it’s severity’s dependence on tail areas derived from the sampling distribution (as opposed to likelihood) that gives me a wedge to distinguish Bayes and severity in optional stopping.

  • September 18, 2013Joseph


    My initial thinking was that SEV could save itself and avoid Cox’s Theorem, by agreeing with Bayes sometimes and then when it disagreed, it could say “this case is ambiguous, no determination can be made” or some sort of subterfuge like that.

    But my thinking has changed quite a bit. Anon’s insight that when T wasn’t a sufficient statistics you’re basically getting equation, at least in some cases, clarified quite a bit (that’s not how an Error Statistician would interpret that number, but numerically that’s what it’s equal to).

    Now I don’t see how SEV avoids Cox’s Theorem. The Severity of a hypothesis like equation is already a problem. I guess an Error Statistician can just say “it’s undefined for this hypothesis”, but then the applicability of SEV in practice would be almost nothing. Especially if you restricted T to sufficient statistics.

    Presumably equation whenever equation and also equation as equation, so SEV(H) can’t be arbitrary and it has to philosophy accord with the intuitive Severity Principle. I don’t see how it does this.

    But even worse, it’s clear that SEV must have some sort of consistency property. Presumably, if equation are informationally equivalent to equation in the sense that given one set of numbers you can derive the other set of numbers, then surely you must have that the conclusions from equation shouldn’t contradict the conclusions from equation, otherwise your final conclusions would depend entirely on which test statistics the statistician chose to use. This will be especially important when there are no sufficient statistics, so that no single T captures everything in the data.

    Also, it looks like you can get equation to be high and equation to be high as well. Presumably this should never happen?

    In addition, you can’t simply restrict T to sufficient statistics, since they don’t always exist. So it looks like SEV(H,T,X) will have to be reinterpreted in some way, because under the current interpretation a non-sufficient statistics can reverse the judgment from a more informative statistic.

    Taken together, especially with that consistency requirement, I don’t see how SEV avoids the wrath of Cox’s Theorem.

  • September 18, 2013Anon

    I will elaborate further trying to explain why testing does not need to be consistent as Joseph has put and why this is ok IF you interpret test how they should be.

    But befor thrat, let me just say something. Mayo is right saying that silly tests can warrant some claims. Some silly tests can sure warrant some claims with high accuracy.

    In our example, even our Ybar test can warrant us that mu<10 FOR SURE.

    So, even this silly statistic that nobody would use – given the information we have in the example- would provide us a very good test against mu<10.

    I' m saying this because I do not know what you are discussing sometimes. If you are trying to discuss what did Mayo want to say in this or that passage, this is a silly discussion, and I do not want to go that road.

    The interesting discussion is what frequentist reasoning. properly done, can accomplish, and in this case it is clear that frequentist reasoning would tackle the problem correctly, without any bayesian aid.

  • September 18, 2013Anon

    Sorry for the typos.

  • September 18, 2013Joseph


    Mayo’s stance on this is clear. It’s repeated in several papers and probably 10 times on her blog, that a big advantage of Error Statistics is that you test hypothesis piecemeal with lots of statistics and you examine those tests to see whether the hypothesis passes them severely. So this isn’t just a matter of interpreting a few words. This is a key part of the Error Statistics philosophy.

    As to the more important point about whether consistency is required, first let me say that “consistency” doesn’t mean they get the same answer. You could for example, have one test be inconclusive and another go in favor of H. I would consider those consistent with each other.

    So there’s plenty of wiggle room here. But how could you ever tolerate having equation as well as equation? That’s saying the data provides strong evidence for both a hypothesis and it’s negation.

    How is the statistician to know which one to use?

  • September 18, 2013Corey

    I like the consistency requirement approach — I’d be interested to see some examples. But for myself, I’m going to stick to the optional stopping approach to force a distinction between Bayes and severity.

  • September 18, 2013Anon

    Joseph, in our very example we have the violation of consistency, because

    SEV(mu>0.1, Ybar) = 96%

    SEV(mu<0.1, weighted mean)=100%

    And I have explained how the statistician can know how to choose one result over the another.

    Now, to make this explanation general to more circumstances I have to come with other examples (like embeded hypothesis) that would not be trivial to calculate the power function and etc, this will take a while.

  • September 18, 2013Corey

    Anon, it would be more interesting to have a case where two apparently equally good statistics conflict. In the present case the “correct” statistic is too obvious.

  • September 19, 2013Joseph


    I was thinking about something simpler. Just take two data points from equation and suppose they’re equation and equation.

    Let the two estimators be equation, then these estimators are identical and everything is perfectly symmetric, but we get


    I don’t see how SEV avoids a reinterpretation.

  • September 19, 2013Anon

    This case is even simpler, since you can just combine both observations to get a most powerful test.

    Informally, the result of T1 is what the evidence would indicate so far given only the observation Y1; the result of T2 is what the evidence would indicate so far given only the observation Y2. And if you combine both observations, you have a more precise test that tell you what the evidence indicates so far given both Y1 and Y2.

  • September 19, 2013Joseph

    I agree, but SEV needs to be reinterpreted before Error Statisticians will agree.

    Here is the exact wording of the Severity Principle:

    Severity Principle (full). Data x0 (produced by process G) provides
    good evidence for hypothesis H (just) to the extent that test T severely
    passes H with x0.

    That most powerful test is just the Bayesian (automatically generated) sufficient statistics. So what you’re saying is if you interpret statistics in terms of “information content” (which Mayo in particular has stated she thinks this is a bizarre and unnecessary Jaynesian dead end) then if you do the best job you can of extracting and using information in the data, you get the Bayesian result. Plus you need to interpret it in the Bayesian way, because if you take the above Severity Principle as stated, without reinterpretation, then it leads to nonsense.

    Sometimes Anon, I can’t tell whether you’re trying to defend Bayes or SEV.

  • September 19, 2013Anon

    I will try to elaborate this further, but I will take a while. I have to finish a paper due next week and I think I will elaborate my arguments as a longer pdf note, with simulation of some tricky examples that people see as classical statistics giving contradicting answer (and that are not trivial to solve, that is, common sense does not help at first, just after you analise the problem carefully with its frequentist properties). I think this is a worthy discussion, even though sometimes I think most of the discussion problems are communication problems (damn Wittgenstein!)

  • September 19, 2013Anon

    But just to defend Mayo (and my goal is not to defend her)

    Her words are ok, see

    Severity Principle (full). Data x0 (produced by process G) provides good evidence for hypothesis H (just) to the extent that test T severely passes H with x0.

    So what can we say to our problem?

    Data Y1 provides good evidence that for H:mu>0 for H mu>0 severely passes T1 with Y1.

    Data Y2 provides good evidence that for H:mu<0 for H mu0 if we only knew Y1. And Y2 is good evidence that mu<0 if we only knew Y2!

    Now, with Y1 and Y2

    Data (Y1,Y2) does not provide good evidence neither for H:mu0 because these do not severely pass T with (Y1, Y2).

    So severity is ok here. It tells what is suppose to tell with the evidence you feed it. If you pretend you only have Y1, it is going to tell you (correctly) that Y1 is good evidence that mu>0 – and indeed it is. And so on.

  • September 19, 2013Anon

    ok, the text there is all messed up because it misunderstood the signs as HTML code.

    The thing is, in words.

    Data Y1 provides good evidence for the hypothesis that the mean is greater than zero because this hypothesis severely passes T1 with Y1.

    Data Y2 provides good evidence for the hypothesis that the mean is less than zero because this hypothesis severely passes T2 with Y2.

    And does this makes sense? Yes, if we only knew Y1 it is ok to consider it evidence that the mean is greater than zero. And if we only knew Y2 it is ok to consider it evidence that the mean is less than zero.


    Data (Y1, Y2) does not provide good evidence for the hypothesis that the mean is different than zero because this hypothesis does not severely passes T with (Y1, Y2).

    And since you have the full data – and not just partial data – that is what the full evidence tells you, with the best test available.

    So severity is ok here. It tells what is suppose to tell with the evidence you feed it. If you pretend you only have Y1, it is going to tell you (correctly) that Y1 is good evidence that mu>0 – and indeed it is. And so on.

  • September 20, 2013Joseph


    This is my last comment. See the latest post for more. I understood your point a long time ago, but what you don’t seem to understand is that this really does change SEV and what’s worse (from a frequentist perspective) it directly pushes SEV toward being a posterior probability. In particular, as soon as you start requiring things like:

    the same information can’t be used in two legitimate Severity analysis to draw contradictory conclusions

    then this drives you right towards digesting data using Bayes Theorem. For example if you have two test statistics {T_1, T_2} which are informationaly equivalent to {T_3, T_4} in the sense that either pair can be used to calculate the other, then we shouldn’t get that {SEV(H,T_1,X), SEV(H,T_2,X)} directly contradicts the results from {SEV(H,T_3,X), SEV(H,T_4,X)}.

    Now you may not be bothered by equation, and I’m certainly not, but Error Statisticians are going to be all kinds of upset. You’re saving SEV by making it posterior probability!

Leave a Reply or trackback