The Amelioration of Uncertainty

## Cox’s Theorem and Mayo’s Error Statistics

This post will take a different tack. Rather than criticize the Severity Principle, I will attempt to patch it up. But as we try to fix problems with SEV, we’ll run up against Cox’s Theorem:

A measure of evidence like SEV will either be equivalent to using probabilities or have serious problems.

The mathematics of Cox’s Theorem isn’t in doubt, but it’s not clear the conditions of the theorem apply to SEV. So this makes for an interesting struggle between two philosophies put to mathematics.

If Cox’s Theorem does apply, then as Frequentists to patch up their methods, they should move closer to posterior probabilities. Each new proposal should produce new problems, which in turn require new fixes. And with each iteration Frequentist methods get closer to being probabilities for hypothesis.

SEV is not the first iteration. SEV was created to fix previous iterations like p-values and is much closer to posterior probabilities as a consequence. Specifically we know that:

Which brings me to the first fix. Suppose we have two data points from an IID Cauchy distribution with pdf:

(1)

The values realized when are and . Given this data what can we say about ? Care must be taken because there are no sufficient statistics for the Cauchy distribution. No matter what test statistic we use there will be information in the data, not included in T, which may be relevant to . Therefore if we don’t think carefully about how to combine results from different tests, there will always be data which makes the results from a single test look absurd.

So consider the following three test statistics:

All three have exactly the same sampling distribution and none of their values is equivalent to knowing the full data . With these we get,

Let’s interpret these results according the Severity Principle (page 162):,

Severity Principle (full). Data x0 (produced by process G) provides good evidence for hypothesis H (just) to the extent that test T severely passes H with x0.

So depending on the test, we can support or or neither. If you performed all three you wouldn’t know what to think, but if you performed plus one of the others you’d wrongly say there’s strong evidence for either or .

If you performed and you’re really confused, because they directly contradict each other, but they’re perfectly symmetrical problems. Anything claimed about one test applies with equal force to the other. There’s no difference you could use to decide which is right.

To make the problem more acute suppose someone looked at two tests while someone else used . Since can be used to calculate , they should arrive at consistent conclusions, but one is incomprehensible and the other is ambiguous. Even worse, the less complete data , is the one that gives better results.

As a way out, let’s introduce a notion of “informational content”. Instead of interpreting SEV as above substitute:

Partial information is supportive evidence for H just to the extent that if we only knew then H passes a severe test based only on that information.

In this way those Severity results above aren’t in contradiction. They’re evaluating H using different states of information, so it’s no big surprise they’re sometimes inconsistent. If all you knew was then is a reasonable inference to draw.

In this way we should replace with and understand that any conclusions draw from it are actually conditional on as though were unknown. We are still left with the problem of how to combine test results, which is essential in this case because no single test static uses everything of value in . But it clears some things up.

This brings SEV one step closer to and immediately drives us towards the question “how to exploit all the information in the the data” since you can’t easily compare results based on different information. This gives Bayesians home field advantage, but I don’t see any way around that.

If Error Statisticians find this acceptable, then maybe it’s also acceptable to introduce the following principle?

Two different legitimate Severity analyses, based on the same information, shouldn’t contradict each other.

Maybe that’s the subject of another post.

September 19, 2013
• September 20, 2013Corey

I really like how this series of posts takes the statement of the severity principle at its word and explores the consequences. It’s possible that even Mayo would agree with your rephrasing, since she often insists that severity is to be evaluated relative to a test.

On this specific example, treating the data symmetrically basically demands the use of as a univariate data summary. (Since is a location parameter, blah blah blah…) Even the posterior mean is equal to . It’s more interesting to try Cauchy with three data points. Sample median, sample mean(!), MLE, posterior mean or median under a flat prior… I’m sure you can get some contradictions out of that set of statistics, especially you can make the data as extreme as you like without anyone blinking an eye.

• September 20, 2013Anon

##
Just to make it clear, the statistics used were:

sample (y1, y2) from cauchy.

T1 = max(y1, y2)
T2 = mean(y2, y2)
T3 = min (y1, y2)
##

From what I have calculated here, if I got it right, all three tests have almost zero power to detect negative mu’s (it tends to zero as mu becomes more negative). So all three should not be used to probe any hypothesis concerning mu<0.

From these three tests, you could say that they provide good evidence that mu<6.

• September 20, 2013Joseph

Corey,

I think you could take that idea and really run with, especially sense there is no “mean”, only a median. So I bet you could get all kinds of anomalies by looking at sample means versus sample medians.

I think though, I’m going to look at cases, possibly with a different distribution, in which t1,t2 are informationaly equivalent to t3,t4 but SEV comes to different conclusions. This is going to be directly related to Bayes theorem and it can’t be avoided because Mayo has made such a big philosophical deal out of piecemeal testing of hypothesis.

In a similar way, the Cauchy is ideal for examining SEV(H,T,X), since for any single statistic, you can create data X which contain information highly relevant to H, but which is being left out of the statistic t=T(X). This becomes a bigger problem, as you note, when there are lots of data points. This will force Error Statisticians to squarely face how evidence from multiple tests are to be combined, and that drives everything towards bayes theorem.

• September 20, 2013Joseph

Anon,

I don’t understand your comment. SEV for the T3 used strongly implies . So how can you say it has zero power to detect this, when it did actually detect this? T3 tends to negative infinity as becomes more negative.

The right answer according to the Severity Principle, would be for SEV to say “neither nor passed a severe test”. If SEV always said that, then it wouldn’t need reinterpreting. It only needs to be reinterpreted because it sometimes directly contradicts this.

• September 20, 2013Anon

Sorry, you are right, there was a problem in my code. Let me run it again.

• September 20, 2013Joseph

Anon,

I chose the pdf and data to both be symmetric about 0 for a reason. That way the evidence for is exactly the same as for . So either SEV (or anything else like a posterior probability) has to say "there's strong evidence for both", which is a direct contradiction, or "there is week evidence for both".

The only one that makes sense is "week evidence for both". So making everything symmetical allows us to see intuitively what the correct answer should be.

• September 20, 2013Anon

Ok,

The problem was I was doing one sided tests, so it would not, obviously, have power to the other side.

With the code.

In this case the mean is actually the least powerful test when compared to the other two in both directions. So we should not use it to detect discrepancies in neither side.

Now, the min is the best test for negative discrepancies from H0, so we should use it to infer which H1′s have been probed and which ones have not in that direction.

And the max is the best test for negative discrepancies from H0, so we should use it to infer which H1′s have been probed and which ones have not in that direction.

Let’s do this:

To detect possible discrepancies in the negative side, given our options, we should use the min.

Depending on the threshold you use, you can be more severe or not on the evidence requiered. Using 0.9 as a threshold, we could say that the evidence is not sufficient to claim
mu 1 but we can’t rule out mu in the range [0, 1].

Combining both tests we can say that we have evidence so far which is consistent with mu in the range [-1, 1].

Note that this process is consistent, that is, when n -> infinity, power for the min and max tend to 1, power of the mean tend to stay close to zero (which shows that it is a bad test) and SEV for these discrepancies tend to zero, leading us to say that mu is an ever closer range to zero.

• September 20, 2013Anon

Ok, this HTML thing is annoying. I always forget, we can’t use the symbols.

Rewriting.

The min is the best test for negative discrepancies from H0, so we should use it to infer which H1′s have been probed and which ones have not in that direction.

And the max is the best test for positie discrepancies from H0, so we should use it to infer which H1′s have been probed and which ones have not in that direction.

The min does not provide good evidence for mu less than -1.

The max does not provide good evidence for mu greater than 1.

Combining both tests we can say that we have evidence so far which is consistent with mu in the range [-1, 1].

Note that this process is consistent, that is, when n -> infinity, power for the min and max tend to 1, power of the mean tend to stay close to zero (which shows that it is a bad test) and SEV for these discrepancies tend to zero, leading us to say that mu is an ever closer range to zero.

• September 20, 2013Anon

Ok, this is embarassing, I had another typo in my code. I can’t do this in a rush, I better revise this before writing!

• September 20, 2013Anon

This is a good example Joseph.

It seems that all statistics (and tests) are pretty lousy and are not consistent when n -> infinity.

What would be the maximum likelihood statistic for mu, do you have a code for that?

• September 20, 2013Anon

I liked this post, I think we are getting somewhere here.

I will have to work better on my code to understand the tests properties on H1 and when n increases.

But, before that, let me throw some food for thought.

We can see that when n=2 neither test is better then the other. The min is most powerful for mu [-infinity, 0], max is the most powerful for mu [0, +infinity].

So neither test is better to probe all H1′s, like in the other examples we were working on.

So we should not expect consistency of SEV and we should interpret each test in its best domain.

That says to us that [-1, 1] cannot be ruled out.

But that is considering that those were good tests to begin with.

Now, from what I have been working here in the simulations, it seems that actually all 3 tests are pretty bad when n grows to infinity. If that is the case, we actually should not take any inferences from them.

• September 20, 2013Joseph

Anon,

Of course it’s not a problem. You can also use latex (with the dollar signs) just by beginning the post with including the brackets.

The key thing about not having sufficient statistics is that there’s always potentially relevant info in the data which isn’t being used by the statistic.

So for any T, there will be X and X’, such that they imply very different things for some H, but T(X)=T(X’).

That’s why I say, Error Statistics is forced to consider multiple tests and, more importantly, has to have a consistent way of combining results from those tests.

In Bayesian statistics, we don’t just say P(H|t1,t2) = F[ P(H|t1), P(H|t2) ] for some function F[]. And we don’t do this for a good reason: it’s easy to show that it leads to nonsense.

But that’s exactly the situation the Error Statistician is in. They’re trying to get an overall assessment from t1,t2 in terms of some F[ SEV(H,t1), SEV(H,t2) ]. F[] hasn’t been explicitly stated, and has only been used in a qualitative way, but it doesn’t matter. Any F[] is wrong. But that’s the essence of Mayo’s piecemeal testing approach.

• September 20, 2013Anon

Joseph, I think my code is ok now.

Well, our test statistic min(sample) is terrible.

Is seem that power converges to the significance level 5% (or less) to detect a discrepancy as negative as you want.

(I have simulated up to n=10.000, my code is still too slow)

So there is not much we can really do with this test, because it is not going to lead us right anywhere.

• September 20, 2013Anon

Severity calculation would be meaningless here.

• September 23, 2013Anon

Joseph, could you state here one bayesian solution. What prior would you use with the cauchy?

I did not have time to work yet on this problem, I’m not used to work with cauchy, but I will, as soon as I can.

• September 23, 2013Daniel Lakeland

Anon: you have the likelihood for IID cauchy as where is the cauchy PDF function. You can use any prior that makes sense based on the information you have before you see the data. If you’re trying to calculate an example, try and give to emulate being a bit off the right value but allowing for a lot of uncertainty prior to seeing data.

• September 23, 2013Daniel Lakeland

You’ll also need a prior for the scale parameter. you could call it something like exponential with mean 1/200 also. Stan will do the sampling for you with about a 10 line Stan model file.

• September 23, 2013Daniel Lakeland

err, sorry exponential with mean 200, scale 1/200

• September 24, 2013Anon

• October 12, 2013Corey

Just FYI: Mayo’s stated several times in places I’m too lazy to link that and can both be low. That said, I conjecture that if and exhaust the parameter space then does indeed follow.

• October 12, 2013Joseph

I know she said that, but look at the equation right under the graph at the top of page 172 of her own paper:

http://www.phil.vt.edu/dmayo/personal_website/Error_Statistics_2011.pdf

Actually Corey I think we are overestimating how seriously Mayo takes her own mathematics. If you notice she never makes any definite mathematical statement or comment on her blog above intro to stats/freshman calculus level. The closest she comes to using any advanced math is propositional calculus, which I was able to teach to my 10 year old. Remember that time she had no idea what the delta function was on Wasserman’s blog? I believe she has a week freshman calc level of understanding of mathematics at best.

I don’t think that’s a disqualification. Mathematical Statisticians have been whacking away at the baysian/frequentists divide for a long time without making much progress, probably because the issues are almost entirely philosophical. So she has at least as good a chance as anyone else of making a contribution in my opinion, maybe even a greater chance.

Nevertheless, it’s important to remember that her math background must be extraordinarily week when compared to say a typical graduate student from one of the hard sciences.

• October 12, 2013Corey

Joseph,

I agree with your assessment. To judge Mayo’s math knowledge I go by the math content in . It’s pretty clear that her co-authors are responsible for most of the explicit math in her jointly authored papers (particularly the ones with Spanos).