## All Probabilites are for One-off Events

Most Statisticians think of sampling distributions as a kind of physical model for an infinite sequence of events. They view priors on the other hand as something different since they can be assigned to one-off hypothesis like “Republicans will win the 2016 election” or “A meteor wiped out the dinosaurs”. This conventional wisdom gets it wrong however, since all distributions describe the probability of one-off events.

We always wish to know some which is a fact about our universe, unique to specific time and place, never to be repeated. Our model and the size of it’s high probability region is a really a reflection of how well we can pin down.

To make money trading stocks in September 2013 for example, you don’t need a model describing the infinite sequence of more or less “random” stock prices. You just need to know the price of stocks in September 2013. The better you know them (i.e. the smaller the high probability region for that month) the more money you’ll make.

If you’re sailing a boat across the Atlantic from 1 Aug 2013 to 10 Aug 2013, you don’t need to model the “random variable” that is the weather as though you were describing the abstract sequence of weather data, on a abstract planet, over an arbitrary time period. You just need to know the weather in the Atlantic from 1 Aug 2013 to 10 Aug 2013, and the less uncertainty you have the safer you’ll be.

If you’re trying to measure the length of a table and obtained data using your laboratory’s ruler on 1 August 2013, then you don’t need to model the propensity of the ruler to generate errors out to infinity. You just need to know the actual numbers contain in the data you actually collected. Knowing even one of those numbers well tells you to a high degree of accuracy, making all other knowledge about the ruler and it’s errors completely irrelevant.

If you’re trying to predict the outcome of the 2012 election, you don’t need a model of how many times Obama would have won in multiple copies of our Universe. You just need to know the actual vote on 6 November 2012 among the space of all possible votes: . The more you can use polls or other information to shrink around the true the better you’ll be at predicting the outcome.

If I wish to predict the percentage of heads in the 100 coin flips I’m about to make at noon on 8 Aug 2013 then I use to describe what I know about the future . Using to predict is no different in principle than making inferences using Pr(“A meteor wiped out the dinosaurs”). Whether that distribution could be used to predict other ‘s observed on a different day is beside the point. The outcome I’m trying to predict is not one element of a some mystical “population”. It’s a physical fact like the current temperature of the room I’m sitting in or whether the dinosaurs were wiped out by a meteor.

In only one instance can the “random variables” mythology be maintained. If we’re concerned about two sequences and which are described using and then sometimes it makes sense to use a common distribution . Such models allow us to imagine that and might have been “drawn” from the “population” . These models tend to be low information/high entropy special cases since the high probability region of must be big enough to include both and .

This is just a special case though. The great sin of Frequentist statistics is to force-fit all statistics into this example. But even here, we really are only interested in and each viewed as one-off events. And if there’s any asymmetry in our information about them we can get better results using models thereby breaking the “random variables” illusion.

In reality, all probabilities have the status of those devilish priors which Frequentists want excised from statistics. All probabilities are for one-off events.

UPDATE: I changed “singular” to “one-off” based on a suggestion by Corey since it really captures what I meant better.

August 8, 2013Corey

link

(Nitpick: since “singular” has a math jargon meaning at odds with its usage here, I’d have preferred “one-off” or something similar.)

I affirm all of this. Probability is always logically prior to empirical frequency whenever the latter concept makes sense. In such cases, it is always possible and often sensible to take a global “one-off event” perspective of the collection of “random events”.

Jaynes touches on this issue in the PTLOS chapter on coding theory in which he discusses its applicability to communication channels that will only ever be used for one message.

August 8, 2013Joseph

link • author

Corey,

I like “one-off” much much better, so I made the change throughout.

August 8, 2013Brendon J. Brewer

link • my site

YES!

There’s no such thing as a repeated event anyway. Flip a coin ten times, every flip was different.

August 8, 2013konrad

link

Hmmm, I agree with most of the specifics, but am I the only one here who disagrees with the general direction?

IMO virtually every interesting problem involving reasoning from incomplete information falls in the “special case” described at the end. Questions relating to pooling of information (Under which circumstances can we pool information from different observations? How should we do this? What is the underlying justification?) lie at the heart of everything. General distrust of pooling any information may be a safe option, and may get us further than many people expect, but surely it must be possible to get still further in many cases where pooling is justified? Are you denying the utility of hierarchical models?

E.g.: “you don’t need to model the “random variable” that is the weather as though you were describing the abstract sequence of weather data” – the question is not whether we _need_ to model the weather this way, it is whether making the assumptions required to model it this way can give us something that the simpler assumption-free approach does not. Sometimes stronger assumptions give us more.

And I think the key to understanding exactly which assumptions induce typical information-pooling methodologies and when they are justified must lie in a PTTLOS-style approach.

August 8, 2013Joseph

link • author

That’s a good point which is worth looking at closer.

“IMO virtually every interesting problem involving reasoning from incomplete information falls in the “special case”

I understand why you say this, but remember that since Frequentist are trying to make every application of statistics look like that special case so it’s easy to get this impression. Two quick points and then a longer one at the end. First, as pointed out in the post the “special case” represents a low information solution which makes it easy to get, but also means the solution is nowhere near as useful as people might like to think. The most useful applications of statistics do not involve this special case!

Second, the requirement to model one-off events while retaining the Frequentist intuition developed in that special case seriously warps parts of statistics. Time Serious analysis is a prime example. In Time Serious analysis you’re trying to model things like stock prices next month but you have to do it a way that can be justified using the Frequentist intuition developed from that special case. The net result of these conflicting goals is a bunch of statistical tools which are both more difficult than they need to be and far less general and useful than they could be. That’s the price that’s paid for not understanding the basic problem in a correct Bayesian way as one of modeling one-off events.

Finally, the hierarchical/multilevel modeling stuff makes far more sense from the above Bayesian viewpoint and is often incomprehensible from a Frequentist perspective, which is why the field is provenance of objective Bayesians like Gelman. But without going into that, let me make a general observation.

Recall that the goal of modeling one-off events is to find a distribution which contains the true value in it’s high probability region and, ideally, this region is as small as possible. There are potentially an infinite number of ways to achieve this goal most of which probably haven’t been discovered yet. Our strategies for doing so are only limited by our ingenuity and the variety of problems we might work on.

One general strategy would be to enlarge the high probability manifold of as much as possible so that is bound to be in it. This is the basis of the successful Maximum Entropy method for example.

Another strategy is the following: collect a bunch of past values of and see what region they lie in. From that construct a distribution whose high probability manifold equals . Then if future values are similar to past values, you’ll get that lies in the high probability manifold of the distribution as well. In reality, the assumption about the present being similar to the past isn’t true anywhere near as much as people assume, but if it holds then this is a reasonable strategy.

A fair amount of “data collection + model building + prediction/inference” is really an example of this kind of strategy.

August 9, 2013konrad

link

One reason I think the “special case” is important is this:

You seem to like examples with 10 measurements of the same quantity. But it seems to me that few applications have such a luxury – e.g. systems change over time (with dynamics that is not known completely) and measurement takes time, so in practice we are more likely to have 10 measurements of 10 related but different quantities. So yes, if we ignore the fact that the different measured quantities are somehow related we are indeed in a low information setting (simply due to the lack of data per variable). You only got to a high-information setting in the first place by assuming your 10 measurements describe the same quantity – this is a pooling assumption of the same type as the past-future similarity assumption you discuss. In practice we may need to relax the complete pooling assumption to partial pooling, without going all the way to no pooling.

August 16, 2013Armchair Guy

link • my site

I’m trying to understand this post. It seems to be saying the goal of inference should be to quantify certainty about a fixed random variable using a probability distribution. The distribution quantifies uncertainty about the measurements we have. Is this interpretation correct?

If so, how would the quality of such an inference be judged in principle? If 5 analysts came up with 5 different “manifolds”, what determines whether one is better than another?

August 16, 2013Daniel Lakeland

link • my site

The goodness of a model should be judged relative to the purpose. Utility theory and such give us a way to compare how good models are. For example a finance model should be judged by the money it makes or saves. A model for mechanical failures by some mixture of how much it costs to implement and how many injuries it prevents etc.

August 16, 2013Armchair Guy

link • my site

Daniel,

I agree, but that’s precisely my question: if we have only a one-off measurement, how can we evaluate any of those statements in principle? E.g., if you are building a model for last month’s daily stock prices and can only use it to make confidence statements about those prices, what does it mean to ask how much money the model makes or saves?

August 16, 2013Joseph

link • author

Armchair Guy,

“I agree, but that’s precisely my question: if we have only a one-off measurement, how can we evaluate any of those statements in principle?”

Easy in principle. The distribution is to be judged by what it implies about . Here is a fixed quantity or parameter or a one-off event. The distribution defines the location of this . Intuitively says that is in the region .

So for example, if then both and would be good, but would not.

It’s ok that there can be more than one good distribution. is just more informative about than that’s all. This is a feature we need since some models are built on more information than others. In truth there are some states of knowledge and such that and .

Greater knowledge about allows us to specify it with less uncertainty.

August 16, 2013Armchair Guy

link • my site

Joseph,

Thanks. I agree both N(100, 10^2) and N(150, 200^2) are good distributions. But it appears you would only know that if you knew x_true = 100. For example, if an analyst claimed that P(x | K1, K2) ~ N(101, 0.0001) was a great model he built using a special secret inference method, we might be suspicious but it isn’t inconsistent with the other two models. In fact, if we know x_true = 100, we know it’s a bad model. Is there any procedure to discover whether it’s a good model (given we don’t know x_true and the analyst refuses to reveal the secret inference method)?

August 16, 2013Joseph

link • author

My comment merely explained what the goal is. How it’s achieved is up to you. There are actually quite a few ways it can be achieved, and you can probably think of many more. Here are a few examples:

1. Observe were ‘s occurred in the past and hope they occur in the same location in the future. Most classical modeling follows this route.

2. Know bounds for so that you can define it’s location within a certain region.

3. Expand the high probably region W so much it can’t help but contain . This is the essence of the maxent approach.

4. Observe some function . It may not tell you but you’ll learn enough about it to say something.

5. Some systems can be related to a deeper state space. Thus where is usually an element of some much bigger microstate. Knowledge of can be exploited to find a suitable .

Really the possibilities are only limited by your ingenuity. It’s up to you to take a real state of knowledge about and convert it into a in such a way that .

If you have no knowledge then spread out so much that the entire space is equal to the High Probability Manifold.

August 16, 2013Daniel Lakeland

link • my site

I think Armchair Guy is asking a slightly different question, not “how do we build models where is in the high probability manifold?”, but rather, how do we build models such that when we use it in the future, the will still be in the high probability manifold, *and* given that we’ll never know either in the past or the future, how can we determine whether it really *is* in the high probability region. In other words, how do we perform model comparison, esp. if we consider our model as only relevant to the data we already had.

essentially, he’s asking about both “goodness of fit” and “future predictive accuracy” I’d say that there are several known techniques, including things like collecting extra data, fitting your model to a subset, and then testing the model on the held-out set. This works when you have a fairly large amount of data. There is also fitting your model to past data, and then observing its performance on future data, finance models, weather models, and pretty much any predictive timeseries models have this quality.

August 16, 2013Armchair Guy

link • my site

I think we’re discussing slightly different things although there is overlap. There’s:

A. The goal,

B. How we attempt to achieve the goal (building model and estimator/manifold), and

C. Assessing how well an estimator/manifold achieves the goal (ie. measuring the performance of the estimator/manifold)

I think you’re talking about #A and #B, and I’m talking about coming up with a procedure that can do #C.

My question is, if various pundits claimed to use their ingenuity to come up with various estimates/manifolds, how could you choose between them? Or if a pundit gave you a really bad manifold, how would you know it’s bad?

Daniel said, “For example a finance model should be judged by the money it makes or saves.”

Which I agree with, except in this framework you just have the measurements on which you built your model and no more, so you don’t have any additional data to test whether the model makes money or not.

August 16, 2013Armchair Guy

link • my site

Daniel,

Thanks, that is indeed what I was asking. It seemed to me what was being suggested in the original post is the goal of inference is restricted to making statements about observed data.

Part of what I’m confused about is, since “all distributions describe the probability of one-off events”, does the concept of “future predictive accuracy” make sense?

If we think of our model as describing only the probability of one-off events, how can it be relevant to any extra data collected without exactly the types of assumptions we are trying to avoid (relationship between past observations and future observations)?

August 16, 2013Joseph

link • author

Armchair Guy,

Not sure I understand the confusion. If I have a (posterior) distribution which says “Apple stock will close on 16 Aug 2013 between $450 and $500″ then you check that prediction to see whether it came true.

For next Monday you predict “Apple stock will close on 19 Aug 2013 between $470 and $490″. Again you just check to see whether that prediction is true.

It’s possible though that you might want to reuse you’re old prediction again for next Monday: “Apple stock will close on 19 Aug 2013 between $450 and $500″. No problem, just check to see if it came true.

Either way your profit and loss depends on the you’re ability to make these predictions for individual days. In no way does it depend on the histogram of Apple stock prices.

Lets say though for the sake of argument that you do care about the frequency histogram of stock prices. You might make the prediction “The histogram for Apple’s stock prices for the month of September 2013 has a shape similar to a Cauchy pdf to within a given approximation”.

Again, just wait until September is over and verify if the prediction is good or not. Models are always and forever judged based on the truth or falsity of what they imply about the real world.

August 16, 2013Daniel Lakeland

link • my site

Joseph: I think he means that suppose you have some family of models parameterized by some parameters, and you require that only parameters which put the actual data from this month into the high probability region of the predictions are allowed. You do bayesian inference, and you get a pdf over the parameters… all well and good.

now, does your model apply to next month? Your answer could be something like, pick some high probability values of the parameters, run the model forward to make predictions, and wait to see if they come “true” (ie. data within the high probability region of the predictions)

I think Armchair’s point is that this implicitly means that we assume the future will be well predicted by a model which was fit to one-off data from the past. I certainly agree that this is true, but I don’t think it necessarily means we need a “repeated sampling” interpretation.

I don’t see any reason to believe that our views on the meaning of Bayesian PDFs will change the fundamental “Problem of Induction” as the philosopher Hume stated it. When we build a model, we almost always want it to generalize to new conditions, but there is nothing *logically* to stop the universe from suddenly obeying entirely different laws of physics as of tomorrow morning. the fact that it is possible to approximately predict things using models seems to be a happy accident. but in general, models for a sequence of measurements need not be anything like repeated samples from a single distribution.

The point as I see it of “probabilities on one-off events” is that when building models, we are free to ignore the potential correspondence between the probability over a given measurement, and the long-term frequency histogram over similar measurements (or we could potentially build our model that way also). it doesn’t mean we aren’t *allowed* to run our model and get new probability distributions for predictions of as-yet-unobserved events. but it *does* mean that we need not assume those distributions will be the same ones as for past events.

August 16, 2013Armchair Guy

link • my site

Daniel,

Thanks again. That is what I was getting at.

Regarding your last paragraph: based on the discussion so far, it seems the modeler has two options: 1) assume future observations have the same distribution as past ones (or depend on the past in some known way), or 2) assume the past and future are unconnected.

Unless I’m still misunderstanding the framework, the first option seems to be what we’re trying to avoid. If so, we’re left with the second option. Under that option, I still don’t see how you could compare different estimators/manifolds. You could apply them to future observations, but the results wouldn’t mean anything — what’s the point of doing so if the past tells you nothing about the future? How would it tell you anything about the quality of the estimator?

August 16, 2013Daniel Lakeland

link • my site

Armchair: I think there’s a third option, assume that we can predict the future in *some* way from the information we get from past observations, make those predictions, give those predictions some distribution which expresses your uncertainty but not necessarily the same one that the past observations had (ie. not repeated sampling from a constant distribution) and then determine whether your distribution over predictions is validated by the actual data that happens in the future.

To put it another way, a “deterministic” modeler would like to predict some measurement from some knowledge, maybe it’s “amount of rain today given a bunch of wind speed, direction humidity and barometer readings. A Bayesian modeler would like to predict the same thing, but acknowledges that he can’t predict a *single* value, instead he must predict a distribution over values. So he builds a model which predicts distributions in some way. He calibrates the model by predicting values that already happened, and comparing them to data. Then, he uses the calibrated model to predict the future, and if his actual readings come out in the high probability region of his predictions, he calls himself successful.

(I realize now that this is sort of sexist wording, but oh well, it’s a blog post and I have diapers to change, so I’m not going to rewrite it.)

August 17, 2013Joseph

link • author

” 1) assume future observations have the same distribution as past ones (or depend on the past in some known way), or 2) assume the past and future are unconnected.”

Armchair, that’s not what I’m saying at all. The point was to clarify the nature of the probabilities we use. This was done as a lead into the following two posts which define what our goal in creating probability distributions and then how we might achieve the goal.

If the best you can do in achieving those goals is to use the same distribution for everything, or treat the past as unconnected to the future then that’s fine in principle. If that’s the best that can be done, then it’s not going to be terribly useful typically, but that’s unconnected to questions of principle.

The point is that in the real world, we don’t model infinite streams of stocks prices or errors. We are only interested in finite data which is unique to a given time, place and circumstance. That perception changes a great deal about how we go about modeling. In particular it makes it easier to go beyond those two special cases (1) and (2) that you mentioned.

August 17, 2013Armchair Guy

link • my site

Joseph,

I see. But if we need the distribution of future observations is the same (or closely related in a known way) to the past, then what are the practical consequences of the distinction? Suppose we are modeling the distribution of stock prices on 19 Aug 2013 alone. But consider these scenarios:

a) 5 copies of 19 Aug 2013 in 5 “randomly generated” parallel universes

b) 5 days of stocks (19th through 23rd) assuming they are independent and the distribution stays the same, in our universe

When comparing the quality/utility of manifold(s) for 19 Aug 2013 (a unique quantity), we appeal to something like b) which appears to be a frequency interpretation of probability, as much as a) is. We just gave it a different name (5 days in one universe instead of the same day in 5 universes).

It seems that, even if we are conceptually modeling a unique quantity, we need to appeal to repeats of some sort to be able to test the inference’s quality.

August 17, 2013Joseph

link • author

Armchair,

There are lots of practical consequences. One is that it breaks the prob=freq identity. See the post “The Definition of a Frequentist” for an example.

Another practical consequence is given toward the end of this post “IID doesn’t mean what you think it does”.

Note those practical consequences are not minor, nor are they rare. Indeed their implications affect almost all non-trivial applications of statistics.

But the bottom line is that “It seems that, even if we are conceptually modeling a unique quantity, we need to appeal to repeats of some sort to be able to test the inference’s quality.” is false, both in theory and in practice.

August 17, 2013Daniel Lakeland

link • my site

I think it’s still the case that we must assume that the future will be like the past (otherwise we can’t use things like “laws of physics” which we have only verified in the past), but we need not assume any “repeated sampling” from a given fixed distribution.

August 18, 2013Joseph

link • author

Daniel,

What about the following scenario: and there is one x such that and there are another x’s such that .

In that case we might easily have that never repeats as time evolves, so that the future is never like the past, but we can still predict very reliably and accurately that .

This would be true not only when never repeats, but also when the nature of the equations of motion changes constantly. I suppose you could say that is unchanged, but even that could be modified. The location of the one value where could rotate as a function of time and never repeat.

Hence everything changes and never repeats, but we can still reliably predict . This may seem stretched and fanciful, but in practice statistcs is tied far closer to such F’s than is commonly realized.

August 19, 2013Daniel Lakeland

link • my site

Although your example is perfectly logically true, and there ARE many statistical problems that have sort of that character (most likely a little more spread out than either exactly 20 or in one case “something else”) still I believe that the laws of physics are more or less of the nature of Newton’s laws when observed on reasonably large physical scales at reasonably low energy relative to mc^2, and I believe that those equations are not changing in time on the time scale of human observation. Because of this I can write equations that are predictive of past observations, and expect them to be predictive of future observations as well.

When it comes to more complex phenomena where our ability to write equations of motion are much more limited, say ecological interaction models for example, I still think that there are aspects of the past which remain unchanged in the future, such as the need for predators to eat prey, and a requirement of all animals to have a source of water, and etc etc so that I can use these “principles” as connections between the past and the future, without giving them a “repeated sampling from a constant distribution” interpretation.

I don’t think you’ll disagree strongly with me here, but it’s worth putting this out there explicitly, because from a practical perspective, at some point we need to consider the science of the processes we’re interested in, and actually build the models, and if you’re used to the idea of “repeated sampling” but are willing to break away from it, it’s useful to discuss what other kinds of “time continuity” we can induce other than repeated sampling.