## When did statistics jump the shark?

Statistics jumped the shark the moment they adopted the following definition, (Gelman & Hill, page 13):

A probability distribution corresponds to an urn with a potentially infinite number of balls inside. When a ball is drawn at random, the “random variable” is what is written on the ball.

for situations where no such “urn” or “population” existed. To explain why requires an answer to the question: when will data appear to be drawn from a frequency distribution ?

The answer is given by the Entropy Concentration Theorem. If results from maximizing the entropy , subject to constraints

Then almost any data which satisfies

(1)

will look approximately like .

If a physical system imposes (1) on the data, then each new data set will appear as though they’re “random draws” from the same “population” . Of course, there is no randomness here; it happens because almost every possibility results in that outcome. Almost no matter what special causes produced , as long as (1) is satisfied, the data will likely fool anyone ignorant of the Entropy Concentration Theorem into thinking it’s a “random draw” from some infinite magical urn.

This is not a rare phenomenon. Since all the common distributions, Normal, Uniform, Poisson, Gamma, Binomial and so on, are maximum entropy distributions subject to just such constraints, this effect covers most of the undergraduate curriculum in statistics. This is why so much of common statistics is tied to “average” type estimators like (1).

This also explains how deterministic algorithms can produce “random numbers”. To get a “random” sample from N(0,1) all you need do is create any sequence which satisfy and . Your work is mostly done since virtually any such sequence will look approximately like a N(0,1), and the fact that the sequence isn’t “randomly” generated, whatever that’s supposed to mean, is completely irrelevant.

The original sin of Frequentist statistics is to misinterpret this special case and then insist that all statistics be so misinterpreted. In most real social sciences applications for example, the next observations will not satisfy (1) again. The future doesn’t resemble the past that way very much. That’s the chief reason p-values and CI’s get things so horribly wrong in the social sciences.

From a Bayesian perspective, frequency distributions like are just a fact, a one-off event, to be observed, predicted, or inferred as desired, no different from any other datum. That we can sometimes predict these well is explained in essence in the posts Noninformative Priors and Data Science is inherently limited. Basically the Entropy Concentration Theorem shows the mapping is suprisingly insensitive to the inputs whenever they satisfy (1). Being otherwise ignorant about those inputs thus doesn’t stop you from accurately estimating .

It would be a kindness if Statisticians stopped fantasizing about mystical forces called “randomness” governing phantom “populations” drawn from fabled “urns”, and just got back to the real world.

August 20, 2013Daniel Lakeland

link • my site

I think it’s fine to think of a “random variable” as a big sequence of draws from the distribution. I actually really like Per Martin-Löf’s definition of a uniform random bitstream as any stream that satisfies some “ultimate” computable statistical test based on the assumptions implied by Bernoulli trials. We can then define random number generators as mappings from uniform bitstreams to random variables measured at some finite precision .

The KEY though, is not to forget what is the random variable, and what is the data.

If you measured data did these *data* come from simple repeated sampling of a constant and a random bitstream? In some instances this is an OK approximation (like maybe survey sampling, or a manufacturing assembly line).

In general, the Bayesian viewpoint only allows us to always take this viewpoint about *parameters* though. Parameters are unobserved, when we specify a likelihood we are saying “this is our tolerance for deviations between predictions and actual data” and then that tolerance implies that the parameters have a probability distribution, and we can summarize that probability distribution by sampling repeatedly from it.

Sometimes if we assume some kind of “repeated sampling” generates the *data* then we can automatically get a likelihood/tolerance without much thought. But in other circumstances, we need to choose the tolerance purely in order to force our predictions to be close to our outcomes so we can see which values of the parameters were reasonable to believe in. And in these cases, we need not interpret the likelihood as anything like a repeated sampling data generating mechanism.

August 20, 2013Daniel Lakeland

link • my site

Another way to put that is: a lot of people think “the likelihood” is kind of an automatic thing, just take whatever distribution generated the data, together with the actual data, and consider it as a function of its parameters. But this is highly misguided and comes from Frequentist intuition. There is no scientific basis to believe that the data is generated by some single “distribution”. In realistic cases we need to describe features of our tolerance for errors between prediction and measurement, and doing so need not have any a-priori choice of distribution associated with it. In fact, we can probably get good results from lots of distributions.

In my dissertation I have a problem where I use a modified gaussian process to describe the tolerance for errors of an ODE and a bunch of measurements (a highly multivariate timeseries problem). The covariance function was non-stationary to reflect my knowledge that the model predicts better at certain times (when kinetic energy is high) than at other times (when kinetic energy is dominated by thermal noise).

In other situations it might be fine to use a bunch of triangular distributions, uniform distributions, epanechnikov distributions, t distributions, things that change at different times and in different ways, with dependencies described in all sorts of ways. You’re looking to describe what you believe is reasonable about the difference between predictions and reality. This is as subjective if not more subjective than the typical prior.

August 21, 2013Brendon J. Brewer

link • my site

“Another way to put that is: a lot of people think “the likelihood” is kind of an automatic thing, just take whatever distribution generated the data, together with the actual data, and consider it as a function of its parameters. But this is highly misguided and comes from Frequentist intuition.”

Agreed. The “sampling distribution” is not a fact or a “data generation mechanism” or anything like that, unless you’re trying to infer something about someone’s Monte Carlo code. is a model for some agent’s prior beliefs about the data’s relation to the parameters. Bayesian inference is nothing more than describing prior beliefs on the joint space of parameters and data and then updating by deleting all that is now known to be false.

Randomness has nothing to do with anything. I pretty much only use the word random to describe code that calls a “randon number generator”. Other than that it’s a useless term.

August 21, 2013Joseph

link • author

Well the post was about frequencies not probability distributions.

Frequentists believe their goal is to determine stable limiting frequency patterns. Bayesians like me believe our goal is to pin down individual facts with as little uncertainty as possible.

Of course, stable limiting frequencies don’t exist very often – Mother Nature is very uncooperative in that regard – which severely limits the applicability of their viewpoint.

The irony is that if the “individual fact” we wish to pin down is a frequency distribution, stable or not, then the Bayesian viewpoint can not only take care of the is special example easily, but it provides a great deal more insight into what’s really happening than any Frequentist was ever able to intuit.

Frequentist philosophy isn’t even that great when you really are concerned about frequencies.

August 21, 2013Corey

link

“Almost no matter what special causes produced , as long as (1) is satisfied, the data will likely fool anyone ignorant of the Entropy Concentration Theorem into thinking it’s a “random draw” from some infinite magical urn.”

Quoted for truth. But even though the math is clear, I’d like to have a better sense of how “…a physical system imposes (1) on the data…” works. Can you give a few examples?

August 21, 2013Joseph

link • author

Corey,

I think it rarely does. That’s why so much of standard frequentist methods fall flat on their face. Take a typical example where a statistician observes a unimodal frequency distribution and tries to model it with a .

After some rigmarole they will set to be the sample mean and to be the sample variance, so the data satisfies conditions like in (1).

Then they check the fit to see if it passes a test for normality. Essentially, they’re testing to see how close the theoretical entropy is to the empirical entropy (I’m skipping some steps here but that’s the long and short of it), thereby verifying that the data isn’t one of those minority cases which contradict the Entropy Concentration Theorem for mean and variance constraints.

Once done, now they think they’ve modeled the “data generation mechanism”. In truth they’ve probably done just the opposite, for reasons given in that post “Data Science is inherently limited”. They’re probably very, very ignorant of what caused the data to be the way it is, since the frequency pattern they observed could be caused by almost anything that leads to those conditions (1), even if it only accidently leads to those conditions being satisified (data always satifies some constriants no matter what).

But having fooled themselves into thinking they’ve learned something, they confidently proceed under the effective assumption that new data will also satisfy the same constraints (1) with the same values . Their p-values and CI’s are all dependant on that assumption being true. It’s almost always false.

Having said that, if you search hard enough you can find examples where conditions (1) are being physically constrained. In my oldest son’s school they try to balance student abilities in each class, so it’s not such a surprise when test scores from each class satisfy the same for example. You can certainly find examples in physics where the order-of-magnitude of frequency variations can be set by thermodynamic considerations or some such.

A more productive avenue is to think about the ways in which conditions (1) can fail to hold for new data. Your choices aren’t simply (a) they hold or (b) they’re completely different. There is a whole world of possibilities between those two.

For example, for new data the conditions (1) can hold but with different . Or as time goes along, some condition which initially had a zero Lagrange multiplier in the maxent procedure starts to become important (the Lagrange multiplier starts to increase), which effectively introduces a new constraint not initially present. There are a lot of interesting possibilities, many of which haven’t been explored much.

August 21, 2013Joseph

link • author

Also Corey, I forgot to mention the most important example where it does hold. The energy of a closed system is constant as a function of time. So the microstate adheres to the constraints:

These have the form of (1). New “data” is generated each moment as the equation of motion evolves putting the system in a new state .

August 21, 2013Daniel Lakeland

link • my site

From the standpoint of building a model of something, we often want to say that what’s going on is

Where the are measurements (data) and the are state variables (big “vectors” or more complex objects), some of which may be observed “covariates” and others of which may be unobserved “parameters”, and is there to acknowledge that we aren’t going to get exact agreement between the predictions and the results. could be anything from a nonlinear PDE to a simple linear regression type formula.

If our model is a good one, values will be “small” (the discrepancy will be small as a fraction of the typical size of a measurement). Our model can incorporate all sorts of things in it which help to provide continuity of predictions between previous values and future values. Nature is always free to break our model, but when the model is good it continues to have small values and their general size will remain stable. It is under these conditions that modeling as coming from some fixed distribution is a reasonable model. If there are time or spatial variables involved, we might model as a process (like a gaussian process or something else) so that we can incorporate the connections in time.

When our model isn’t extremely good, then it does a bad job when goes to some region of state space, and then the values need not look like they come from the same distribution as when the model was predicting reasonably well.

I think it’s this kind of situation where your “constraints” come from, in essence when the model is predicting well, it will constrain the discrepancies to have some properties, like a mean, or a variance, or a particular value for a particular quantile, or something. But nature is always free to do what it does and it doesn’t have to continue to obey our model equation for all future states so treating our discrepancy as repeated draws from a distribution can easily break.

August 21, 2013Joseph

link • author

Daniel,

You’re retaining way too much frequentist intuition. Modeling “ as coming from some fixed distribution” means describing the frequency distribution of as mentioned in this post.

To get a model useful in the example you described we don’t need to know the frequency distribution , we need a probability distribution . The entire point of “What do we need to model?” and “IID doesn’t mean what you think it does” was to show that what you need is a probability distribution such that is in it’s high probability manifold.

is a completely different animal from . The frequency distribution implied by the actual errors in the data isn’t usually going to look anything like one of the marginal distributions , nor should it.

I don’t know how else to say it but to reitierate: knowing where is in space is a seperate question from knowing where is in function space. You need to think carefully about which one of these you want know. If you want to model the former (as you do in your comment) then you need a distribution such that is in it’s high probability manifold.

If you want to model the later (which is not what you want for your example) than you need a probability functional such that is in it’s high probability manifold. The subject of this post implicitly shows one way to do that.

Essentially if satisfy constraints (1) then will be very close to the maxent solution . So if is a kind of functional which is sharply peaked about then you have good reason to believe is in the high probability manifold (assuming those equations (1) really are valid).

Either way all probability distributions are about using knowledge K to locate with as little uncertainty as possible (i.e. in as small a region as possible as described by ). They are not frequency distributions!!! If you want to model frequencies for some reason then let and have at it.

August 21, 2013Daniel Lakeland

link • my site

I think you misunderstood my point. I agree that the probability distribution for in my equation needs to be a bayesian probability distribution and not the frequency distribution for anything. For example, it could be a different probability distribution for each and have a different parametric form each time, and each marginal one would look different, and hopefully more narrow than the frequency distribution of the ensemble etc etc.

But if the model is good, then it will take care of predicting and the *actual* values will generally be constrained to be “near zero” (this is implied by our definition of “good” I guess).

Under these conditions, the ensemble of values that actually occur in some data set will look like draws from *some* maxent frequency distribution with constraint that the mean = 0 just because the model induced that kind of constraint. In other words, nature doesn’t necessarily naturally induce constraints of the form (1) that often but the *process of modeling* can. If the model is good, it can induce other constraints on this ensemble too, keeping them relatively tightly distributed around 0 for example.

So while the frequency distribution of the ensemble of and the bayesian probability distribution of the individual values need not be the same, in general the *goal of modeling* which is to drive our Bayesian probability distribution of the errors towards a delta function at 0 individually, will also drive the frequency distribution of the ensemble of errors towards zero in some other way.

August 21, 2013Joseph

link • author

Got it. That makes sense to me.

Here’s something to think about though. If you really know and that the are individually small what does that tell you about the location of in space? The answer may or may not be related to the distribution which a frequentist will want to use: namely,

which is what they really mean when they say “randomly drawn from a fixed population” and which results from their confusion about what’s really going on.

August 22, 2013Daniel Lakeland

link • my site

So, if we assume that distribution for the it’s still a consistent Bayesian probability distribution (in that the true is in the high probability region), it’s just a lot less informative than it would be if we used all the potential information we might have about the likelihood of each error separately right? That’s more or less the point of your previous post about what IID means if I understand correctly.

The frequentist insistence on only using distributions of this form is essentially a kind of conservatism since these max-ent distributions are in some sense as vague as possible. When Bayesians take that frequentist intuition and use it in their Bayesian calculations, they are more or less dodging the question of how to construct a likelihood that uses all the information they might have available. When Andrew Gelman for example talks about how the typical Frequentist rails against the prior and lets the likelihood camel pass through the eye of the needle, this is one aspect of what he’s talking about I think.

August 22, 2013Joseph

link • author

Daniel,

One way to get a distribution such that is in the high probability manifold is to observe where the ‘s have been in the past, then create a probability distribution who’s high probability region matches where old ‘s have occurred. In practice, they do this by getting the frequency distribution (histogram) of the old ‘s and match a maximum entropy distribution to it.

This will work if the new ‘s occur in the same areas as the old ones, or if you get a string of new data, that they satisfy the same conditions (1) that the old data did. There are two problems or limitations with this though.

One of those you mentioned. Namely since the high probability region has to be big enough to encompass most of the past ‘s then this represents a low information/high entropy special case. If you have any information telling you something about specifically you can beat it using a lower entropy distribution with a smaller high probability region. To the extent Frequentists find this new distribution difficult or impossible to interpret as a frequency distribution, their philosophy is a hindrance.

A bigger problem however is that the future usually doesn’t follow the past that much, and so isn’t in the high probability region constructed from the past data. Just ask your typical stock trader, they’ll tell you all about it. To the extent Frequentists believe they’ve successfully modeled a “data generation mechanism” or a physical property called “randomness” or stable limiting frequencies and thus believe the future will be like the past, their philosophy is a hindrance.

The future resembles the past in the way Frequentists would want basically whenever something physically forces new data to satisfy at least approximately the same equations (1) that the old data had. Without that, there’s little hope of it being true.

August 27, 2013Corey

link

I think a key concept missing from this discussion is exchangeability. The dividing line between probability and frequency becomes much clearer when it is realized that only if uncertain variables are exchangeable does expected frequency equal probability. Once that’s in sight, it becomes obvious why frequency obeys the probability axioms, and why identifying frequency with probability is inadequate, i.e., we often have information which breaks exchangeability.

August 27, 2013Daniel Lakeland

link • my site

Corey,

I never did really “get” exchangeability. Probably because I didn’t really find a comprehensive enough explanation. Your suggestion led me to look more carefully at it, and http://www.uv.es/~bernardo/Exchangeability.pdf seems like a pretty good resource.

I think you’re right, Exchangeability is a kind of symmetry property. If some subset of observations is exchangeable, to me it means that we have no information with which to modify our probability model for the individual observations (our likelihood). If this is true, we *should* pick a probability model in which the ensemble of observations is consistent with being a single sample drawn from that distribution. To do anything else would imply that somehow we think that the observations are “atypical” but this breaks the symmetry assumption since it constitutes information about the probability model.

In this sense, we can connect frequency to probability for *observed* values, (ie. in the likelihood) provided we have modeled-in (as best we can) all of our knowledge.

In order for future observed values to continue to be well predicted by such a probability model, we need the model to capture real scientific aspects of the process that generates them which *are* stable in time and hence we would expect that future observations would continue to be exchangeable with past observations.

In Joseph’s finance example, it’s clear that the passage of time (and the corresponding revealing of information about the world, such as prices, demands, and supplies of goods) automatically makes financial observations non-exchangeable.

August 27, 2013Joseph

link • author

Corey and Daniel,

I love all that exchangability stuff, but haven’t found it to be convincing to those not already convinced. Frequentists just view it as a physical assumption. A particularly egregious example comes from (quantum) statistical mechanics:

http://en.wikipedia.org/wiki/Identical_particles#Statistical_effects_of_indistinguishability

Perhaps more importantly, I’m convinced this would be a good avenue for research. I was fully convinced of that even before seeing this paper by Jaynes http://bayes.wustl.edu/etj/articles/applications.pdf which describes a finite version of the definitti representation theorem, where the “probability of a probability” distribution is sometimes negative.

There has to be more to this story than currently understood and important similar examples not yet discovered. I looked a few years ago to see if anyone had followed up on this paper. Besides a few mathematical generalizations, which were needed but don’t address fundamentals, I didn’t find anything.

Incidentally, I found out later that this particular paper was originally publish in a conference proceedings which my Stats thesis adviser edited. He had no idea I’d be interested in it, but when he found out he gave me a copy of the book.

August 28, 2013Daniel Lakeland

link • my site

I never did get that bit in statmech about indistinguishable particles either. It was always a sort of “wave your hands” and then move on kind of a thing. To me it seems to come down to what is a microstate over which we put a uniform (counting) prior distribution. is it a vector of energies (hence each energy is “distinguishable”) or is it the sum of all the energies, hence giving different N values in some sense “equal footing”.

One of the things that I think isn’t mentioned is that in actual applications of statmech, we don’t know N exactly. At best we might typically know it to say 3 or 4 sig-figs. If you put any reasonable uncertainty over N, and then put uniform priors on vectors of N energies, we are being inconsistent, because the number of micro-states consistent with bigger N is vastly larger (like N!). To be consistent with our knowledge of how big N is, we need to incorporate our knowledge only of how much total energy there is and let our separate knowledge of how many N there are remain consistent.

I can’t see how anyone who really thought about it for a few moments could think that the representation theorem, which is a pure probability theory result, could be a “physical assumption”. I’m not saying that people DON’T think that way, only that it *should* be obvious that it has nothing to do with physics

Has anyone written a really good text on probability theory in the context of bayesian statistics? Is there even a probability theory text that discusses DeFinetti’s representation theorem, applications of entropy, concentration theorems, Kolmogorov complexity, etc instead of spending pages and pages on foundations of measure theory??

As I’ve said before: http://models.street-artists.org/2013/03/15/every-scientific-hypothesis-is-a-hypothesis-on-a-finite-sample-space/

I don’t find measure theory to be a convincingly useful framework for thinking about applications of probability, and it seems to occupy a lot of mental energy (and textbook pages) that should be expended on more important issues.

August 28, 2013Joseph

link • author

Daniel,

You brought up so many things!

There must be dozens and dozens of physics papers railing against the indistinguishably lore in stat mech. It’s a part of the lore of Quantum Mechanics now and dutifully gets taught to everyone in statistical mechanics. It wont go away until itself gets subsumed into some greater theory.

Stat mech can and does consider variable N. This was done extensively by Gibbs. You can also include realistic errors on N and E by introducing constraints on the variance of those variables. The vast majority of the time, but not always, it makes no final difference, which is why it isn’t done more.

In graduate school, when I finally got over my hatred for Statistics and decided take it seriously, I made the same mistake that many pure mathematicians do: I though measure theory was the royal road to deep understanding of statistics. I remember taking the measure theory/stats sequence after the main real analysis sequence and using it to help study for the Real Analysis qualifying exam.

It took a while to realize “measure theory=great” but “measure theory in statistics = complete waste of time”. It doesn’t illuminate a singe aspect of statistics and just confuses fundamentals, while achieving nothing of use for applications.

There are no good texts on Bayesian Statistics yet, because the subject has significant development still to go. There are ones I enjoy though like Gelman, Jaynes or that Mackay book which you can get a free pdf version by googling “Information Theory, Inference, and Learning Algorithms”

August 28, 2013Daniel Lakeland

link • my site

Yes I know the Grand-Canonical ensemble and all that. but what I guess I didn’t make clear is that when we say there is a fixed N number of particles, and then go to say that the probability distribution over microstates is vs it actually doesn’t matter unless we’re going to consider systems that have different numbers of N since is just a constant and unrelated to the energies and comes out in the normalizing wash if we only care about a single N.

In reality, we’re making 3 assumptions: (1) there’s an “average energy” scale (kT) (2) The distribution over N is something like normal around a fixed value with a small but nonzero sigma. (3) Any way that we can get a given energy is considered equally likely to occur.

Now we have to consider small variations in N, and when we do that, if we don’t normalize by the N! we can not have both (2), and (3) at the same time. There are so many more ways to get a given energy with larger N that the sum over microstates at a given energy is making an assumption that is inconsistent with assumptions about N in assumption (2). It seems like this is an exchangeability/DeFinetti sort of situation. assumption (2) demands a representation where is like a normal. But this means that P(E,N) must have an appropriate symmetry with respect to N so that different N and same E can be treated similarly a-la (3).

I have Gelman, and MacKay I haven’t yet tackled Jaynes. I was thinking more of a book specifically about probability theory in this context, rather than a book about applications. It doesn’t surprise me that you say no such thing exists. MacKay is pretty interesting. I actually bought the paper version even though the PDF was available.

August 28, 2013Joseph

link • author

Daniel,

There is a huge story to tell here, but it’s not the place. I’ll just say that you’re not really making those three assumptions you’re given.

First, you’re not assuming an average energy scale. The is the Lagrange multiplier for the Energy. If the energy factors into N non interacting particles, as in an ideal gas or an ideal gas in a gravitation field, then it makes a certain amount of sense to talk about “average energy carried by each particle” or whatever. If there is a real interaction potential however it doesn’t really makes sense anymore to assign a portion of energy to each particle separately. But you still have a Lagrange multiplier in this case just the same. It’s not an assumption: it’s there to take care of the energy constraint whatever form the energy has.

Second, Given a microstate in phase space we can form various functions: energy or number of particles or whatever else. you can have whatever distribution on n that you want. Let it be P(n). If you only care about the first two moments of P(n) (the gaussian case essentially) then maximize the entropy subject to the constraints and . You can add more constraints if you need to to get a which reflects what we know about .

Third, you’re not ever assuming every state with a given energy is equally likely to occur. You do sometimes get a from maximizing the entropy which assigns the same probably to any two states that have the same energy, but that doesn’t have to happen. Whether it does depends entirely on what constraints you use in the maxent procedure.

Remember also, the entire goal of the maxent procedure in statmech is to find a baysian which describes the location of by it’s high probability manifold.

August 28, 2013Daniel Lakeland

link • my site

Joseph, I’ll put up a follow up post over at my blog, since we’re off on a tangent here, but It’s a tangent I’m interested in.

I will say a couple things though. From a dimensional analysis perspective kT always has to have units of energy, so it is an “energy scale” even if it’s not an “average”. On the other hand, typically as a lagrange multiplier it *is* enforcing a first moment constraint.

I’ll see what I can do about organizing a coherent post on this topic.