## Data Science is inherently limited

A previous post showed how diffuse distributions on a space can be used to estimate functions which aren’t sensitive to . This is the essence of statistics as evidence by the ‘s found in coin tossing, election prediction, Statistical Mechanics, and error statistics. But while being able to easily predict is handy, sometimes we’d rather observe and learn something about . Unfortunately these goals are in tension, which is why “data science” is inherently limited.

In the coin tossing example, is the frequency of heads while . So while we can confidently predict without taking a single measurement, actually observing to be within this range tells us basically nothing.

To see this consider the size of the spaces involved. Using logs for convenience, , while the set gives . The subspace is very nearly all of .

It’s for this reason that believing “ proves something about a completely fictitious frequency distribution on ” makes about as much sense as saying:

According to the theory of magic fairies, the outcome of a dice throw should be either a 1, 2, 3, 4, 5, or 6. Since I actually got a 5 when I rolled the dice, the prediction is confirmed and magic fairies exist. They’ve been objectively verified!

Fundamentally then the following three observations are related:

Many things work this way. Gases diffuse because almost any microscopic happenings lead to diffusion. This fact allowed physicists to understand diffusion before they understood the atomic realm, but it also means observing diffusion provides no hint of Quantum Mechanics.

Power laws are a similar phenomenon. Under mild conditions almost anything that could happen will appear to satisfy a power law. Actually observing a power law tells you almost nothing about the underlying physical reality. Mandelbrot style efforts to unlock the secrets of the universe by finding power laws everywhere couldn’t be more misguided. They’re more akin to numerology than science. (the 1/f noise described here as almost mystical, is a similar example. The giveaway is that this noise occurs in many physically different systems).

Having said all that, you do actually learn a little something from observing a power law: namely that those “mild conditions” are satisfied. Jaynes famously exploited this to extract information from frequencies obtained from throwing a dice 20,000 times. His procedure was to compute a theoretical entropy based on various physical effects and to keep adding effects until dropped down to the empirical entropy .

But once , Jaynes had to stop. The empirical entropy serves as a kind of barrier to learning because of the Entropy Concentration Theorem. Basically, almost anything that could have been true at that point leads to frequencies indistinguishable from the ones observed. Once anything that can happen gives the same result, the result tells you nothing.

In the end Jaynes determined which pair of faces on the dice were cut last in the manufacturing process. It’s a justly famous inference, and if this is the one thing you need to know to make millions, then you’re in business. If you need to know anything else about the special circumstances, or laws of nature, which nudged each roll in one direction or another, then you’re out of luck. You can never get it from those frequencies.

What this means for science is that those fields dependant on extracting patterns from frequency type data, which is the majority of statistical applications today, have a long term problem. Once they’ve harvested what little information is available like Jaynes did with the dice, they’ve either got to (1) find a better way of doing science (2) find novel data, or (3) stagnate. Evidence suggests they mostly do (3), occasionally do (2), and rarely do (1).

Finance is a prime example. What meager information about the universe there was in the histograms and correlations of price movements was learned long ago, and despite innumerable man-hours, computer time, and ever more sophisticated statistical methods, the field hasn’t budged in predictive ability since. You might even say it’s gone backwards, since the foundational theories taught since the 70’s are now admitted to be false.

Such is the nature of Data Science. Fortunately it wasn’t around in Newton’s time or physicists would still be baffled by coin flipping.

July 25, 2013konrad

link

“we can confidently predict .4<f<.6 without taking a single measurement"

Wait – exactly which experiment do you have in mind here, and are you suggesting it is representative of binary sequence data in general? I can easily think of a coin-tossing experiment that reliably yields f=1. And a great many repeatable binary experiments of practical interest (e.g. copying many bits of data onto a hard drive: record 0 for correctly copied bits and 1 for flipped bits) yield f outside of your range.

July 25, 2013Daniel Lakeland

link • my site

Konrad, he’s specifically talking about flipping real, two sided coins in a highly energetic fashion. In other words, because of the symmetry of the coin, f ~ 0.5 is pretty much guaranteed. There are plenty of binary outcome experiments that don’t have f ~ 0.5 but even in those cases, many of them have f always close to some other value (in your computer data copying example the failure rate for a brand new hard drive is probably pretty close to 0 after all the error correcting encoding that manufacturers use, sure after a few years of usage, the drive may degrade a lot… but in any case if you observe f ~ 0 in the first month of usage, you won’t learn a lot about f after 5 years because all the drives have f ~ 0 in the first month)

July 25, 2013Joseph

link • author

Konrad, just your typical stat classroom demonstration involving coin flipping and observing the number of heads as the number of flips increases (either using a single coin, or everyone in the class using a separate coin and then aggregating the result).

One thing these demonstrations always have in common is that they never involve taking any kind of physical measurements at all.

July 25, 2013konrad

link

But in the earlier post on noninformative priors, Joseph was arguing (I thought) that the symmetry of the coin is irrelevant. (And Jaynes described skillful coin tossers who can guarantee f != .5 with energetic tossing of real, two-sided coins.)

The whole point of measuring f in the first place is that one assigns significant weight to the possibility that f != .5, (i.e. high prior probability that the distribution on x is _vastly_ non-uniform). Otherwise, the measurements will indeed be uninformative, and who would bother to collect them in the first place?

July 25, 2013konrad

link

Ah, but you haven’t seen the coin-tossing demo I used to do in _my_ class

I would toss the coin myself, and write H on the board regardless of the way it landed…when the students started to suspect something is wrong, I would challenge them to explain their reasoning.

July 25, 2013Joseph

link • author

Also, I might add that in most binary experiments involving real physical systems it’s probably the case that the outcomes are confined to some much smaller subspace due to some additional physical constraints/laws of physics/ or whatever.

In practice we can’t detect this because typically , which usually happens since , and so our predictions from will still be correct. In principle, if we repeated this enough we’d see that occurs more than it should, but n doesn’t have to be very large before this becomes unfeasible.

Sometimes however overlaps substantially with the part of which isn’t in . This we can detect experimentally because will be unexpectedly close to 0 or 1.

It’s another way of looking at how prevents us from learning about things like . Those constraints/laws of physics/ or whatever are hidden from us unless we get really lucky (or use some completely different method).

July 26, 2013konrad

link

Ok, let’s see if I understand. Seems to me that:

1) Your W’ is essentially the “typical set” ( http://en.wikipedia.org/wiki/Typical_set )

2) You are deliberately avoiding use of the bias of the coin (let’s call it p) as a parameter.

My issue is that the typical set is a function of p: without inferring p, how can one know ? Or if we don’t know , how can we know whether in a given case of interest?

You also say: “Sometimes however overlaps substantially with the part of which isn’t in . This we can detect experimentally because f will be unexpectedly close to 0 or 1.” – but this is the point I was trying to make, which it seems to me to contradicts your claim that observing is uninformative. Seems to me that you need to weaken the claim to “ is uninformative in those cases where we already know that f is not unexpectedly close to 0 or 1″, which would just be a tautology.

ps. How does one typeset equations in these comments?

July 26, 2013Joseph

link • author

I’ll take the points one at a time in a series of comments. To typeset equations just put “latexpage” on the first line with square brackets “[" and "]” in place of the quotes. I added it to your last comment.

Saying is uninformative is a slight exaggeration intended for dramatic affect. The point was to drive home the fact that these outcomes occur because almost any possibility leads to them, and not because of some mysterious physical property called “randomness”.

You do learn a little. How much can be quantified by looking at how much it reduces the entropy. Using the numbers in the post: . Whether that little bit learned is valuable to you or not depends on what you’re doing of course.

July 26, 2013Joseph

link • author

“You are deliberately avoiding use of the bias of the coin (let’s call it p) as a parameter.”

Most people think “bias of a coin” is a statement about the Inertia Tensor of the coin. I avoided talking about it because it’s well known that the “bias of the coin” so interpreted is unrelated to the , let alone whatever is supposed to represent.

But there is a bigger problem with introducing , which cuts to the core of the Bayesian/Frequentist divide. Implicitly you’re introducing the following distribution on the space

where . But this distribution is actually a large entropy/low information special case (unless is very close to 0 or 1 of course). Frequentist intuition is so strong that people are loath to imagine anything else. In this case the marginal distributions are all identical and satisfy for any in the high probability region of

But I’m a fully fledged Bayesian and can easily imagine a which shrinks (i.e. has lower entropy) around the true sequence that will be observed in the next 500 flips. Call it . I can even imagine it shrinking so much that eventually you get which has zero entropy.

This improved distribution, which will be considerably more useful and accurate for predicting anything you want about the next 500 flips, not only doesn’t have identical marginal distributions but what’s worse is that . It’s not even approximately true since the marginal’s will all get closer to 0 or 1 while is typically near .5!

And as a Physicist, I can easily imagine collecting the kind of physical measurements needed to make this improved model/distribution a reality.

July 26, 2013Joseph

link • author

“My issue is that the typical set is a function of p: without inferring p, how can one know ? Or if we don’t know , how can we know whether in a given case of interest?”

There so much to say here, but let me confine my remarks to analyzing what happens in a typical case.

Suppose that because of physical constraints/laws of physics/cheating or whatever the outcomes are all confined to a small subset . And in typical fashion it happens that . What happens when a statistician goes to analyze the data from this?

Well first they observe a sequence and get an . Then they use this as an estimator for . Since then it will be the case that . So the model you get from using will have a high probability manifold such that .

So when the statistician goes to predict the next using , they will predict . And since that prediction will turn out to be a good one.

In other words, the usual statistics procedure for analyzing this type of experiment will seem to work in most cases even if all the sequences are secretly coming from a tiny subset of the possibilities!

Or stated another way, the success of this statistical procedure tells you almost nothing about how big actually is unless the ‘s are near 0 or 1.

July 27, 2013konrad

link

Thanks, I think I understand your point now.

First, in response to the second of these comments: Above it seemed like you were using “coin-tossing experiment” in a very narrow sense, but now it’s clear that you are generalising to cases where the IID assumptions do not apply. I think we’re in agreement that IID is a problematic concept, but I wasn’t throwing it out completely – I think in pretty much all applications which are described as coin-tossing experiments one does want to make an IID or similar assumption (what I called repeatability in a previous thread). You are allowing (where includes all available prior information about the initial conditions of the tosses) to be a function of , which is definitely not the classical coin-tossing setup.

In response to the third comment: I initially thought that is a necessary part of your argument, but now I see it is not – I would prefer to state it with .5 replaced by .3 throughout, to illustrate that we _can_ obtain an estimate of that will be useful for prediction (though this breaks down in your more general setting where there is no sense in which the different tosses are repetitions of the same experiment, so pooling of information is no longer possible). I see your main point as being that we learn almost nothing about , which still holds when .

July 27, 2013Joseph

link • author

Konrad,

My next post is going to be “IID doesn’t mean what you think it means”. The phrase “IID assumptions do not apply” is misleading. It hints at the notion that these assumptions are in some sense physical. For example, maybe “independent” means “causally independent”, or at least “predictively irrelevant” or something.

What’s really happening is that IID is a way of making the high probability manifold large (see the post on “The Law of Trading Edges”). It’s important to make large because that creates greater opportunity for .

As long as that happens you won’t be misled. You’re point estimates may not be accurate, but their associated uncertainties will be so large that your interval estimates will still contain the true value.

For this reason, IID assumptions are going to work in many situations that Frequentist intuition would lead you to think it wouldn’t. The example I gave a couple of comments ago illustrates this. In that example the IID assumption implies that but even if , everything will still be fine.

July 28, 2013konrad

link

I do think that the term IID is predominantly used to refer to a physical assumption (or at least an assumption about frequency distributions rather than probability distributions), so I hope you will use a different term to disambiguate. (In an information-based framework it doesn’t make sense to say “identically distributed” because whether this is true depends critically on what information you condition on.) But information-based versions of the IID assumption are also invalidated by your setup.

Looking forward to your next post…

August 1, 2013Brendon J. Brewer

link • my site

IID means that the joint distribution can be written p(x1, x2, …, xn | I) = prod_{i=1}^n p(x_i | I)

where p(x_i | I) is the same for all i. IID is a property of probability distributions, and Bayesians certainly use iid a lot but it is an assumption about the prior beliefs, not about anything physical.

August 1, 2013Joseph

link • author

Brendon,

No doubt. But that’s only part of the story.