Taking a break from Statistical Mechanics I noticed Corey Yanofsky, whom I respect a great deal, is starting a blog. Corey plans to explore Dr. Mayo’s Severity Principle, which he describes as the “strongest defense of frequentism I’ve ever encountered.” A similarly great geometer, Cosma Shalizi, is even more effusive.
I believe Dr. Mayo misunderstands error distributions and the basic facts concerning them (see here and here), but philosophy can be argued endlessly. It’s more productive to examine Corey’s (and Mayo’s) claim that “the severity principle scotches many common criticisms of frequentism”.
The key paper is this one, which uses the Severity function SEV(H,T,X) to answer 13 common criticisms (“howlers”) of frequentist methods. Here H is a hypothesis, T a test, and X the data. SEV is often abbreviated SEV(H) when T and X are understood. Unfortunately Mayo based this entire discussion on the NIID example and its sufficient statics. For this case,
Which is just the posterior probability! With a slight verbal change, Mayo’s paper is a more convincing defense of Bayesian posteriors than most Bayesians can muster. To illustrate consider howlers #2 and #3,
(#2) All statistically significant results are treated the same.
(#3) The p-value does not tell us how large a discrepancy is found.
Mayo considers and shows how gives a sense of the discrepancy . This is identical to observing , or equivalently , and using it to judge the size of the discrepancy, which is how Laplace did it two centuries ago.
So is SEV or Bayes fixing classical statistics?
To put SEV to a “severe” test we need an example where they differ. Since Bayes Theorem automatically uses any sufficient statistics if present, let’s use non-sufficient statistics and see what happens.
Suppose and we take two observations: one with and one very precise one with . The actual observations turn out to be and .
Intuitively, we’d drop the inaccurate data point and say is very close to zero to within a few parts in a billion. This is exactly what the Bayesian calculation does since the posterior distribution is Normal with
For example, the Bayesian gets to a ridiculous number of significant figures since .1 is a billion standard deviations away from the accurate measurement.
Using the natural, but non-sufficient, test statistic and data , yields an entirely different outcome. With we get,
which implies, according to Mayo, the data is good evidence for .
This is far from the only embarrassment for SEV, but I won’t run up the scoreboard by mentioning others. At this rate “Error Statistics” will turn everyone into Bayesians and where’s the fun in that? What’s weird though is that no one thought to do an elementary check on it. It’s almost as though “SEV” was accepted on religious grounds.
I don’t want to be entirely negative, so let me finish on a positive note by echoing Corey & Cosma: Dr. Mayo’s SEV is the strongest defense of frequentist statistics out there.
UPDATE: Mayo considers it a key selling point of Error Statistics over Bayesian methods that you can use any T to probe a hypothesis H. By probing H with multiple T’s you get a better sense of whether it’s true. Regardless of what T you use you should get results consistent with the truth or falsity of H for reasonable data consistent with H being true/false.
All I did was evaluate this claim. For some T, the SEV answer is identical to the Bayesian posterior. So I looked at the first T that makes SEV differ from the posterior. What I found is that this T get’s it exactly wrong. Note: it doesn’t produce a weakened or less useful form of the intuitive conclusion; it completely contradicts the correct answer. It says the data does provide strong evidence for H, when in fact it doesn’t. This doesn’t mean SEV performs poorley compared to Bayes. It means SEV’s wrong regardless of what the Bayesian answer is.