## Bayes is the Foundation of Reproducible Science

While working on the definition of probability post, I saw Gelman’s advertisement subtitled “Can we use Bayesian methods to resolve the current crisis of unreplicable research?” No doubt Gelman has reasonable and constructive points to make. So let me be unreasonable: Bayes has within it the capability to destroy Frequentist methods when it comes to reproducible science.

In the usual view a distribution is the shape of a histogram of many x’s observed for fixed . This restricts us to cases were many x’s are possible, but even worse, it presupposes there is a stable shape to the histogram. This is an extraordinarily strong physical assumption usually, which is usually wrong.

Thus right from the outset Frequentism works against reproducible science. It’s likely most statisticians of Gelman’s type think Bayesians can tweak this picture a little, help out here and there, but not fundamentally challenge it. There is an alternative though.

That can be thought of as merely a way to locate those x’s compatible with . Loosely speaking, any x in the high probability manifold is consistent with the parameters. If we’ve modeled this right then the we actually see will be there as well.

This is meaningful even if only one ever exists, which allows us to create distributions for one-off events. But it allows far more than that. Suppose we have an which has very small variance over . Then the expected value of this function,

(1)

has an interesting meaning under this interpretation. The small variance means almost all x’s compatible with lead to . Since is one of those x’s compatible with the observed parameters, we’ll find,

In other words, there will appear in the laboratory to be a functional relation or law of nature connecting the inputs to the outputs of this analysis:

(2)

But there’s more. The small variance is predicting this relationship will be highly reproducible. If we repeat this experiment using the same parameters then we’ll get a different . But this new value will still be in the high probability manifold, which means we’ll usually get .

But there’s still more. This reproducibility usually holds even when the histogram of all those doesn’t resemble at all. It can be wildly different and unstable in fact. The only requirement is that the histogram stay inside the high probability manifold. Most such wild deviations of the histogram from the distribution will make (2) more reproducible not less!

That’s the fundamental technical fact most statisticians just can’t seem to digest and it’s retarded applied statistics more than any other single failing.

Strictly speaking though, (2) is just a prediction. The relation holds for most x’s in the high probability manifold but not quite all. You might even say the law it represents is “highly probable” rather than required. If you check it in the laboratory, you’ll either verify it or you’ll discover something even more important.

Any consistent failure of (2) means those ‘s are being confined by Mother Nature to a very small set of exceptional cases in the high probability manifold. So you’ve just discovered an important new physical effect. By incorporating this new effect you’ll get still better which allows accurate predictions of even more reproducible relationships .

Bayes is the most powerful tool we have for the prediction and discovery of reproducible results. All you need do to exploit it is to forget you were ever told probabilities are frequencies.

December 13, 2013konrad

link

There is also a frequency-based route to discover additional constraints: whenever the probability distribution differs from the frequency distribution this difference already points at an additional constraint (assuming the histogram is based on enough points that you can’t explain the difference as a sampling effect). In applications where the frequency distribution can be constructed this should be a more powerful indicator than any specific F (F is less informative than the frequency distribution, so you need to be luckier to find a violation of F than to find a difference between the probability and frequency distributions).

This is really just what we do in ordinary likelihood methods:

1) construct a model P(X|theta) – this is the probability distribution

2) evaluate the probability of the observed data P(D|theta) – this is related to the frequency distribution, if D contains repeated observations of X

3) if the observed data fit poorly (e.g. because D does not span the high-likelihood region of X), expand the model.

December 13, 2013Joseph

link • author

Konrad, f doesn’t have to be a scalar. It could be a vector, which means it could itself represent a histogram. The space of the X’s would then usually be a product space of some kind.

This is just a very special case of what I was talking above, and to the extent it’s correct it isn’t something different. Note though, although it deals with frequencies, it’s not really Frequentist. That probability distribution on the product space ISN’T a frequency distribution and the comments in the paragraph beginning “But there’s still more …” apply in full force.

December 13, 2013konrad

link

Agreed – this is the opposite of frequentism because it emphasizes the difference between frequency and probability. But do you agree that a standard likelihood approach focussing on model expansion (the sort of methodology championed by Gelman) can be seen as fitting into your framework?

December 14, 2013Joseph

link • author

(response 1)

Konrad,

My reply will be long winded, so I’ll split it up into a series of comments. I thought about turning it into a post, but even my general interest posts don’t draw much interest so this I’ll keep it in the comments.

The first problem is that there are several nearly mathematically identical procedures in stats, which actually have very different goals and justifications. Stat is plagued by this phenomenon like no other subject. One reason for this is that there are essentially an infinite number of tricks for constructing a P(x) which puts x_{obs} in the high probability manifold.

Obviously one strategy is to conjure a P(x) and check to see where x_{obs} is. If it’s not in the high probability region, then replace P(x) with a more diffuse P’(x) that has a bigger high probability region. Keep doing this until all the x_{obs} are in the high probability region. That’s one fast and loose strategy which will obviously work sometimes.

In some instances you could cut to the end and simply expand the high probability region to the maximum amount possible. This strategy usually goes under the name “The Maximum Entropy Principle”.

A related strategy would be to simply observe a large number of x_{obs} and match the high probability region of P(x) to the high density region of the x’s. Then hope that new x’s will be in the same area as the old one. Most statisticians think this will only work if the histogram of x’s remains approximately stable over time and matches P(x), but that’s not so. It’s sufficient that the x_{obs} stay in the same general area.

Unfortunately, to make this work reliably you have to know that the x_{obs} will stay in the same area, which usually requires significant background domain knowledge. But people are most tempted to apply this strategy precisely when they don’t have much domain knowledge. Hence the crises of non-reproducible results in life and social sciences.

The Maximum Entropy Principle can, in some instances, avoid this conundrum by increasing the high probability region so much the x_{obs} almost have to be located there, or even are guaranteed to be located there. Whenever you can pull this off it’ll tend to give much better results than the simply praying new x_{obs} look like the old ones (which is what most applied statisticians do in effect).

There is a tradeoff though. The more spread out P(x) is the less likely it is that the variance of F(x) will be small. In other words, maxent distributions tend to make fewer definite predictions like (2), but those predictions tend to be more reliable in the laboratory.

Sometimes though, our object of study is not the individual x’s, but rather the frequency histogram. So it’s worth looking at that special case. Being a frequency rather than a generic variable gives some definite structure to the problem, which can be exploited to draw some general conclusions.

That’s the subject of the following comments.

December 16, 2013Rasmus Bååth

link • my site

Really looking forward to that article on probability your working on Hoping it will be something I can refer people to when I make the claim that probability and a relative frequency are never the same “thing”.

December 16, 2013konrad

link

Joseph: I don’t think you answered my question: do you think model expansion (e.g. Gelman-style) counts as a special case of the constraint-discovering process you described in the original post?

The idea is (a) that models are expanded by essentially comparing probability distributions to frequency distributions and adding model parameters when the two are appreciably different, and (b) an expanded model allows us to learn more constraints, of the type discussed in the original post.

Your overall framework will be palatable to a much larger audience if you can relate it to standard methodology.

December 16, 2013Joseph

link • author

Konrad, no I haven’t. Christmas chaos has delayed my follow-on comments. They are coming. It’s embarrassing thought to hear you call it “my” overall framework. I got this post directly from Jaynes.

December 17, 2013Joseph

link • author

Konrad, I think I’ll put it out as a post. It’s definitly worth laying out in detail and it keeps getting longer and longer.

December 17, 2013Daniel Lakeland

link • my site

Looking forward to it. Merry Christmas, keep sending out those blog post presents.

December 17, 2013konrad

link

Yep, looking forward to it. Joseph: it’s your framework in the sense that you are currently putting time and effort into promoting it.