## The Wages of Philosophical Sin are Research Death

The last post about Mayo’s Severity Principle got me thinking about that xkcd cartoon which generated so much hate-and-discontent among Frequentists. I didn’t care for it because all Frequentists can say in response is “we’re not that dumb”; which is a reasonable point and instantly sends the debate in worthless directions.

Nevertheless, I exploited the joke in my last post. In the cartoon the dice are obviously irrelevant to the sun exploding. So to confuse the issue I create an inaccurate data point that was relevant in principle, but by so little it should be treated like those dice rolls.

Since Frequentist procedures use only the sampling distribution, I made the conclusions sensitive to sampling variations in that irrelevant data point. The Bayesian posterior avoids this by considering one outcome (the data) and looking at different . So without having to think about estimators at all, let alone the “best” estimator, the posterior somehow magically throws away the irrelevant data.

But this is still something of a mystery. Remember Bayesians didn’t invent posteriors to achieve this goal. Unlike ad-hoc Frequentist procedures which are designed for a specific purpose, the Bayesian procedure is a mindless application of the trivial product rule. So how did this property creep into Bayes if no one put it there?

To understand it better consider maximum entropy:

Where is a Lagrange multiplier introduced to satisfy the constraint. One consequence is that if the constraint is already satisfied by the reference distribution then and drops from the problem.

This is handy, since you needn’t worry about whether a constraint is relevant or a repeat of knowledge already in . Just include it and it’ll drop out automatically if it’s not needed. The similarity between this phenomenon and that Bayesian posterior is obvious.

One way to interpret the Lagrange multiplier is as a “relevance measure”. The further away from the zero it gets, the more relevant is for determining . This is nothing new since multipliers are often interpreted as measuring the strength of the constraints. Note the used in the previous post maximize the entropy subject to a second moment constraint; where the multiplier is given by

If we rewrite that Bayesian Posterior using these “relevance measures” we get the following:

Bayes Theorem is weighting the data by “relevance” and the new “total relevance” is their sum! It’s already clearer how Bayes gets this right, but more importantly, this hint is enough to create some very interesting theoretical and practical research ideas. The wages of philosophical virtue are research life.

Now back to the Frequentists as they try to patch SEV up enough to serve as the foundation to problems Laplace thought easy 200 years ago.

September 17, 2013Brendon J. Brewer

link • my site

“Now back to the Frequentists as they try to patch SEV up enough to serve as the foundation to problems Laplace thought easy 200 years ago.”

One of the saddest things I’ve seen was a smart research student in statistics presenting a vast and detailed project on all the various methods that had been proposed for solving a particular problem. There were all sorts of elaborate methods presented, like using a p-value as a test statistic for calculating another p-value.

Jeffreys could have written the general solution in about 1/3 of a page.

September 18, 2013Joseph

link • author

That’s certainly been my experience as well. The most glaring example happened in Iraq. Basically we had no access to any research libraries or any programming languages other than Excel VBA (which is almost unusable for any task other than simple Excel automation).

We had some stuff to do and the frequentists just threw up their hands saying “it’s a major Ph.D. research problem and we have neither the time or journals to do it”. I worked it out in Bayesian way over a couple of evenings in my downtime and (just barely) got VBA to do it.

I’m really curious how often this happens, because I’m not at all convinced it’s a universal phenomenon. I wouldn’t be surprised if people had anecdotal evidence suggesting the opposite.

I was really more thinking about people who waste their entire careers creating new estimators and figuring out their properties. Wasserman mentioned an example the other day: what are the confidence intervals for kernel-density estimates? That’s about as pure an example as you can get of the wrong philosophy causing talented researchers to waste huge amounts of time on completely irrelevant problems. The CI intervals for kernel-density estimates are never never never never ever relevant or even meaningful.

September 18, 2013Daniel Lakeland

link • my site

RE confidence on kernel density estimates… well I don’t know what to say about “confidence” intervals really, but I can say that I’ve seen people in seismology who were interested in the frequency with which certain sizes of earthquake events occur on a fault. I think it’s fair to say that most faults are in equilibrium on human observational timescales, at least with respect to small events and so this histogram is more or less a stable property of the fault (unlike in many of your examples where you rightly point out that frequencies are not in general stable physical properties).

They were looking at histograms of events on one fault vs another and trying to determine if there were real “gaps” in these histograms that would indicate that certain sized events are somehow precluded from occurring. There were physical reasons to believe in so called “characteristic sized events” which would soak up strain energy that might otherwise be released in somewhat larger events. One guy was looking at the ratio of histogram counts between two nearby faults, and it wasn’t at all convincing that the differences between faults were anything other than random noise. He himself basically agreed with this.

It seems reasonable to ask “are there differences in the frequency with which various sized events occur between these two faults” and to try to answer that using say (from a bayesian perspective) credible intervals on differences in log density.

September 18, 2013Daniel Lakeland

link • my site

Note, this is actually a really interesting problem. Given a set of samples construct a Bayesian estimate of the pdf of the generating process. I suppose we could put a gaussian process prior on the log density with a “mean” function that captures our asymptotic expectation in the tails, and then write a likelihood for the samples assuming the samples are IID from the density, and hand the whole thing to Stan, it would be an interesting comparison between that method and the kernel density estimate.

September 18, 2013Joseph

link • author

Daniel,

Frequencies are always fine. If you’re trying to reason about frequency distributions then create a which contains the true frequency distribution you care about in its high probability manifold. You’re off and running.

Kernel-density estimations don’t do that. But more importantly, the vast majority of the time they’re not even trying to do that. The actual goal of kernel-density estimation most of the time is to create a which contain the value of in it’s high probability manifold. It’s not even trying, in truth, to mimic the frequency diagram of a string a future ’s. To the extent to which Frequentists confuse their goals with the real goals, it just hinders and confuses the problem. In practice, I’ve seen this confusion seriously derail uses of kernel density estimation.

From a Bayesian perspective, Frequentists are trying to hijack kernel-densities to create a which is very sharply peaked about and then to patch up this nonsense using confidence intervals. The best thing you can say about it is that it’s a complete waste of time. As Gelman mentioned once, a great deal of mathematical statistics is spent answering irrelevant questions.

September 18, 2013Joseph

link • author

Incidentally, the disconnect between the real purpose of kernel density distributions and frequentist’s goals is a good example why “machine learning” seems disconnected from statistics. Many people don’t consider kernal density estimation as a part of statistics. Others think it’s part of statistics, but doesn’t fit comfortably with the rest of the subject.

From my Jaynesian view of the subject, they make perfect sense. The overall goal is to get distributions which contain in their high probability manifold. ANY METHOD you can dream up for doing that is perfectly legitimate.

This is also related to Hierarchical modeling, which not only seems wrong from a frequentist perspective, but also seems not quite right from most Bayesian perspectives. In fact, it was controversial for a long time in both camps. But again, from my Jayensian perspective ANY METHOD you can dream up that puts the true value you care about in the high probability manifold works.

You’re only limited by your imagination and the kind of information you have available.

September 18, 2013Daniel Lakeland

link • my site

I think KDE is more or less a way to get one sample from from the high probability region, where is some kind of “best frequency distribution” from which your data appears to be IID. If you want to compare the F values you will need to see the uncertainty implied by in other words, to get more samples or somehow observe how much you know about F(x).

Even before I understood very much about Bayesian statistics, that always seemed to me to be more or less the point of KDE. For example I often used KDE to see the distribution of waist circumference in different groups of people in a dataset I was using to help pants manufacturers, and was looking for not only whether the typical circumference was different, but if the *shape* of the distribution was different in the different groups.

Anyway, I agree with you that we’re not really interested in confidence intervals (ie. repeated sampling arguments) about KDE, but we *are* interested in the precision with which we know a model for the PDF of data that is considered IID from some fixed but unknown frequency model and perhaps KDE bootstrapping or whatever could be a decent way to get that without enormous computational complexity.

September 18, 2013Joseph

link • author

Daniel,

The vast majority of the time kernel density estimation is used you simply aren’t interested in at all, in any way.

If you really are interested in then making the assumption the data is being IID drawn from some magical urn/population is equivalent to assuming a very specific relationship between past and future . In the vast majority of applications there is no urn and there is no such fixed “population”. So this very specific connection between past and future is just nonsense.

September 18, 2013Daniel Lakeland

link • my site

At least the vast majority of the time that *I* used KDE I was interested in constructing a F(x) (frequency distribution) which made the actual data have high likelihood under repeated IID sampling assumption (ie. the likelihood function was large)… and then I wanted to see if that F was similar to or different from some other F from some other data in some other situation.

Now it’s fine to argue that often IID sampling is a poor model for the future, but sometimes it’s an ok model too (in the seismology example for instance it probably is, in sufficiently well mixed survey sampling of people it’s not bad, in manufacturing or quality control it can be useful etc). But I acknowledge it’s an important caveat.

So let’s just assume we are in a situation in which we expect a future sample will look a lot like our current sample, and IID isn’t bad. We want to know what the frequencies are going to be for various values. I think we both agree that a Bayesian method would be to somehow construct a probability distribution over frequency pdfs F(x). Samples from the high probability region of this probability distribution will look like frequency distributions, and if we’ve done a good job of constructing the Bayesian probability distribution, future data will look like it comes from some frequency distribution F that is in the high probability region in our probability distribution. And if our future samples look like current samples assumption is right, then future samples will look like they come from this F as well.

such a process is possibly pretty computationally intensive though, so using KDE might let us get one F that would be likely to be in our high probability region without doing the full computation. And maybe bootstrapping would be likely to get us a few more F values that are also in the high probability region. In other words I think of KDE and bootstraps of KDE as a computational shortcut to get a Bayesian idea of what frequency distributions from some full Bayesian analysis look like.

If I’ve described the whole idea clearly enough, I think you will basically agree. Note that I’m perfectly happy agreeing with you that a frequentist who interprets KDE in some other way is perhaps missing the point. However, since KDE is applicable primarily to situations where repeated sampling leads to samples that are exchangeable (ie. where physically things aren’t changing much from one sample to the next) this may be a situation where frequentist methods and bayesian methods naturally converge.

September 18, 2013Joseph

link • author

Daniel,

I don’t object to you saying that , my objection is to assuming is about the size you’d get if they were really drawn from an urn/population when in fact there is no urn/population. If you have information which says they’re going to be close then set based off that information.

So am I right in understanding that seismologists aren’t much interested in the size of the next big earthquake, and they aren’t much interested in specifying that with as little uncertainty as possible, but they are heavily interested in predicting the future frequency of earthquakes by size?

September 18, 2013Daniel Lakeland

link • my site

I see now what you’re saying, and I think I didn’t describe the idea well enough. Suppose you have some conditions A under which you believe that . And you have some conditions B under which . Now you’d like to determine whether or if they are different histograms (ie. if conditions A and conditions B cause different frequency distributions). You aren’t assume that FA and FB are drawn from an urn full of frequency distributions, you’re just trying to find out what the data X_A and X_B tell you about the two frequencies.

Also, seismologists are interested in lots of things. One of the things they’re interested in is the mechanisms of faulting, how they occur, and what triggers earthquakes (the guys I dealt with had strong physics backgrounds). This is related to models of the frictional processes on the fault.

If you look at a particular fault, there could be some physical process that causes this fault to rupture under some particular conditions, and if the conditions need to exceed those conditions in order to store enough energy for a Magnitude X quake, then you might be able to see the frequency histogram fall off somewhat below X, and maybe there would be an “excess” of large events near this threshold slightly below X as well, because the fault tends not to store enough energy to get beyond X. If you can see that in fault A and not in fault B then you can infer things about differences in the frictional properties of the two faults for example.

September 18, 2013konrad

link

“The CI intervals for kernel-density estimates are never never never never ever relevant or even meaningful.”

I agree, provided the KDE is an approximation of a probability distribution. However, I think the general suggestion can be rephrased sensibly (and I think this is basically the point Daniel is trying to make):

If we have a generative model, i.e. a model that describes the data set (conterfactually) as having been produced by sampling from some frequency distribution, then that frequency distribution may be describable using a (possibly high-dimensional) parameter vector. Now we can think of the KDE methodology as producing a point estimate of the parameter vector. And then it makes sense to ask for a posterior distribution of the parameter vector rather than a point estimate. Or for a point estimate and a confidence/credibility interval, if an approximate answer is sufficient.

September 18, 2013Daniel Lakeland

link • my site

konrad, yes that’s a very sensible re-statement of my point.

September 19, 2013Joseph

link • author

Konrad and Daniel,

The last time I created a Kernal Density distribution which anyone cared about, it was a first step in trying to predict the location of IEDs (improvised explosive devices) over the next month in a specific patch of dirt somewhere on planet earth. There were 8 IEDs over that next month.

Where in this do you see any sampling? Where do you see a frequency distribution? Where do you see anything other than people trying to predict future occurrences with as little uncertainty as possible, of individual events which mother and human nature are free to arrange in any way they want?

On a good day, when everything works out perfect, the best you can say for that “sampling from a frequency distribution generative model” blah blah, it that it serves as an acceptable prediction of those future locations with a fair amount of uncertainty. And that is the absolutely best case. And it’s rare. In this example the Kernel Density is just a first step and doesn’t even come close to the best case.

But even in the best case, to the extent that it confuses the real goal with determining the properties of some mystical “population” it tends to gum up or limit the results. In many cases I’ve seen it completely derail any sensible conclusions at all.

If you want to make inferences about then create a good which describes where is located. If you want to make inferences about a frequency then create a good which describes where is located. What you’re talking about doing is using a bastardized version of the later to serve as a taped-up-glued-together version of the former. That might be sort of be acceptable if it worked often, but it doesn’t, and in the process of failing it confuses everything to the point at which it’s nearly impossible to work towards the real goal without tripping up.

And as evidence of that I give you Wasserman who thinks it a major research priority to find confidence intervals for each value of the distribution . I would love to know what exactly that has to do with predicting the location of the IED that’s going to blow you up.

September 19, 2013Joseph

link • author

Also, to build on the last comment, when would that “data generation/sampling from frequency distribution” make sense?

When you have a great many future ‘s and somehow knew accurately ahead of time, but you had absolutely no other information about the ‘s. In that case choose your .

I’m not saying there are no problems that fit that description, but it’s rare in the real world. The point is that frequentists and most bayesians (who retain way to much frequentist intuition) want to interpret every statistical problem as if it was of that type.

September 19, 2013Daniel Lakeland

link • my site

Joseph, I think it’s fair to say that the future isn’t like the past anywhere near as often as statisticians would like. But that doesn’t mean the application of KDE that konrad and I are talking about isn’t relevant.

From the seismology example, you sample earthquakes for 10 years, you split the series into the first 5 years and the second 5 years. You want to know if conditions changed on the fault such that the second 5 years produced a different distribution of earthquake sizes than the first. You generate the two KDEs and graph them. you see a difference in the distribution. That, combined with your physical theory about why such a difference should have existed could easily constitute evidence for your theory.

I also saw a guy who was treating mice with cocaine, and then monitoring how they spent their time. where in the cage they were. He created a bastardized KDE that he sort of cooked up on his own, and then tried to detect which behaviors were different (ie. drinking, nesting, eating, mating, each of which tended to occur in different parts of the cage).

I think in most legitimate uses of KDE it is a way to get a bastardized MAP estimate for a generative model, and then compare that estimate to estimates under other conditions, so that uncertainty in the generative model become relevant.

I’m happy to agree that a “full” bayesian analysis of this problem might be a better way to go, but KDE is the kind of thing we can push a button and get without a couple hours of writing a bayesian model and running bayesian simulations. so it has its uses.

September 19, 2013Daniel Lakeland

link • my site

Also another thing I should point out, in the cases I’m talking about, for the most part we’re not interested in using the KDE to predict the future, we’re using two KDEs to detect “non-exchangeability” between two sets of data. In this sense, it has a very very bayesian feel, since the data is fixed (whatever data we have from the various conditions) and we’re interested in an unobservable generative model which is more or less a very high dimensional bayesian parameter vector, and we want to see if there are differences in this parameter vector which some kind of scientific theory predicts there should be…. so I agree with you that DATA + KDE + CONFIDENCE INTERVAL -> PREDICTIONS + UNCERTAINTY is a pretty crappy way to go… but DATA IN DIFFERENT CONDITIONS + SEVERAL KDES + SCIENTIFIC THEORY -> TEST OF SCIENTIFIC THEORY is a fairly Bayesian sort of a thing even if KDE is a severe shortcut on a more full bayesian analysis.

September 19, 2013konrad

link

Joseph, you are not arguing against our proposed interpretation of KDE so much as against the use of generative models as a description of physical systems. It sounds like you are doing this because you are _only_ interested in prediction problems.

But, to echo what Daniel is saying, in basic science (as opposed to applied science) prediction problems of the type you are describing are _far less common_ than problems where the aim is merely to describe and characterize some system. In such problems, prediction is usually not a realistic aim because we know far too little about the system – e.g. in the seismology example, attempts to predict the time and location of future earthquakes are spectacularly unsuccessful, but that doesn’t mean we can’t learn a great deal from earthquake data about how the system actually works.