The Amelioration of Uncertainty

## Max Planck and the Foundations of Statistics

Statistics is full of old and difficult ideas. It’s time for something new and simple. Well, it’s not actually new, but it will seem that way to most. The story begins with the physicist Max Planck over a century ago.

Planck’s 1912 summary of his researches on Black Body Radiation included a chapter titled “Probability and Entropy”. This chapter had a specific purpose. Previously statistical ideas had been applied to gases where the problem was to use known functions of the microstate, like the energy, to predict other functions of the microstate. But the Black Body Radiation problem was physically and mathematically a different beast. Technically it involved using functions of the amplitudes in the Fourier expansion of the Electric field to predict other functions of those amplitudes (fields).

Nevertheless, Planck wanted to use those statistical ideas from thermodynamics to solve the Black Body Radiation problem. To do so he had to first clarify the universal and extra-physical nature of Statistical Mechanics. That was the explicit aim of the “Probability and Entropy” chapter.

It’s easy to see how Planck succeeded, because despite the differences, they are abstractly the same kind of problem. We don’t observe some directly but know some and wish to predict some other .

Planck’s solution is simplicity itself. To illustrate consider the x’s compatible with the observed value of f:

Now within that domain, compute the number of x’s compatible with each value of g:

Then to predict just choose the value which maximizes . Graphically the situation is as follows:

If is close to as shown in the picture, then there will appear to be a functional relationship connecting and . This relation will seem to the observer like a “law of nature”.

(1)

All that’s needed to make this law a reality is for to be among that majority of V associated with . That’s it. There’s no requirement that multiple occur equally often among V. There needn’t even be more than one . Rather success depends entirely on a simple concrete criteria involving the one that actually exists and everything else is irrelevant.

Notice too that the law (1) will appear stable even if is jumping around as a function of time. This is easy to see from the illustration below:

Again, the only thing we need to make this true is that stays within that majority region. In no way is it required that they fill up V, or occupy each point of V equally often in an ergodic sense.

Obviously, to make this work it helps to have as large as possible. For this reason it’s natural to use the following ratio,

as a measure of the “strength” of the law expressed in (1).

If p happens to be close to 1 it will create a “separation” between law (1) and whatever natural process determines . No matter how crazy the physics governing is, it’ll be hidden from any observer only looking at (1). Indeed, it’s precisely this “separation” phenomenon which creates the appearance of distinct branches of science like physics, chemistry, and biology.

There is an exception to this though. If overlaps with the part of V where , as in the picture below, then the law (1) will appear to fail. If observed, this provides us with quite strong information about the nature of and amounts to a kind of statistical learning.

Planck doesn’t mention this “learning” directly, but it’s hard to believe he wasn’t aware of it since the greatest scientific example of it was the discovery of Quantum Mechanics, and Planck’s research was the key step in that discovery.

This may not look like the foundation of statistics. It’s both simpler and very different in outlook than anything you’ll see in a Statistics class. But I assure you a generalization of it easily serves as the foundation for every successful application of Statistics ever made and leads to quite a few new ones.

October 30, 2013
• October 31, 2013Joseph

Since none of the usual suspects have anything to say (Corey?, Daniel?, Brendon?, Konrad?), I’ll add in a few points:

-If you use the Boltzmann entropy then observing which maximizes the entropy tells you almost nothing about . If however, then you’ve learned quite a bit about in the sense that you can narrow it’s location to a much smaller subset of V. That’s the essence of how “entropy” gets associated with “information”.

-The separation phenomenon applies to many other things. It’s the reason why Statistics professors can do successful coin flip demonstrations in the classroom without doing any physics or taking any measurements.

-There’s nothing specific to physics here. That “law” could just as easily of been something from macroeconomics or biology.

• October 31, 2013Corey

I don’t have much to say, other than, “Yup.”

• October 31, 2013Daniel Lakeland

I like this perspective, it accords with my own perspective on continuum mechanics, which I see as models for regionally averaged measurements rather than for the actual behavior of the detailed configuration of the material. I don’t see anything particularly controversial here, but I think it would be interesting for you to develop an example further in the context of something where statistics are regularly used but non-physics oriented and not a coin flip experiment

So, for example, perhaps a public opinion polling example or drug discovery and testing or something. Show how this perspective helps to interpret the results of such experiments and/or different perspectives on predictions of the future that come out of these different interpretations of statistics.

• October 31, 2013Daniel Lakeland

In particular I’d like to point out the phrase “compatible with” in your proposal to “compute the number of x’s compatible with each value of g”.

To the extent that two quantities A and B do not equal exactly they are not compatible with the mathematical statement A=B. In actual sciences, quantities are always finite precision. In one of my blog articles I argue that every quantity ever measured in the past or future of the human race will always have a precision less than something like 120 bits (I don’t remember the exact number but around there). Furthermore realistic measured quantities these days have fewer than 31 bits or so (it takes a very good A/D converter and careful shielding and signal processing to get this precision, even if you’re hand-counting votes as we saw in the Bush vs Gore election we rarely have the precision we really think we do).

So it is possible for two measured quantities to be exactly equal due to the finite precision. But equal under finite precision is not a good criterion for “compatible with” because high precision measurements can have noise caused by essentially uninteresting phenomena not associated with the questions at hand (such as the thermal fluctuations of electrons in the instrument, radio waves from local television stations, vibrations caused by minor earthquakes, whatever).

So we have to define “compatible with” in a different way. We can say that two measured quantities are compatible when and define . Specifying any given cutoff produces a discontinuous transition between compatible and incompatible, so we may prefer to describe a “degree of compatibility” to be a continuous positive function of the difference in measurements with a peak at 0 and which is decreasing for any value away from zero. This sounds familiar, one of the most common of these types of functions is the gaussian density.

Out of all this foofaraw we get the Bayesian concept of Likelihood. A measurement or set of measurements D is compatible with a prediction under a model F whose parameters are q to the extent measured by some compatibility function L(D|F(q)).

Frequentists often require L to have the form that you get automatically from a Frequency interpretation of probability and IID sampling . In this sense, Frequentists are maybe mathematically equivalent to Bayesians who have some kind of (irrational?) prior over the acceptable types of “compatibility” models.

So when Bayesians talk about the posterior probability of the parameter “q” and sometimes “clustering around the true value of q” instead of the “true value” we should probably say “the most compatible value”.

• October 31, 2013Joseph

Daniel,

It’s Halloween so I’ll have to answer later. I’ll just say for now that the reason we have to generalize:

to

is precisely because our information rarely in practice consists of precise .

I like this simple example though, because although this requires significant generalization the basic logic is the same. And this example is so simple you can’t help but see the correctness of it.

There is much, much more to say!

• October 31, 2013Brendon J. Brewer

I agree with Corey, it was a good post, and I really liked the diagrams. These kinds of diagrams are really helpful in understanding statistical mechanics.

As I’ve been promoted to a usual suspect, I do have one comment. Being able to say the “vast majority” of plausible states would predict a certain thing is only possible if you have a prior on the phase space, which is usually taken to be uniform as it is then invariant to Galilean transformations.

• October 31, 2013Brendon J. Brewer

Btw Joseph, next time there’s a MaxEnt in North America you should definitely come.

Nice post, I agree with the essence as far as it goes (except: in law 1 I think you meant g when you wrote g_max: g_max is defined as just a function of f_obs, it is not an observed quantity; the law that one is tempted to infer is that the observed quantity g is equal to g_max). I’m still waiting for the connection between this and what we were discussing in the previous thread, but I suspect it will emerge in due course. I am certainly in agreement that these ideas are foundational to probability theory, much more so than the generative model framework; where we disagree is that I think the generative model framework is (a) not in conflict and (b) very useful in practice, probably more so than these ideas.

Your argument above is informal, but I think it is solid when observations are noiseless and we have a uniform prior on x_true. In practice, it needs to be formalized to circumvent issues related to observation noise (as pointed out by Daniel) and to deal with cases where we have a non-uniform prior on x_true (necessary if the domain of x is not finite). I think there is a natural way to write it out in terms of probability theory, but I’m not sure if we would agree on the details. One important issue is your decision to maximise the cardinality of the set of x values compatible with f_obs and g – with observation noise this corresponds (in the formalization I have in mind) to reporting the posterior mode as your prediction for g, which is only appropriate in some contexts. I’m not sure if entropy is needed anywhere, except for assigning a maxent prior if that’s the way you choose to go.

• October 31, 2013Joseph

from Brendon’s comment:

“Being able to say the “vast majority” of plausible states would predict a certain thing is only possible if you have a prior on the phase space, which is usually taken to be uniform ”

“our argument above is informal, but I think it is solid when observations are noiseless and we have a uniform prior on x_true. ”

There is a fundamental misunderstanding here. I’m not making a probabilistic argument. I’m working up to defining probabilities in a pre-probabilistic setting. The goal is to understand what a Bayesian probability is and where it comes from. So forget about probabilities and distributions and rethink everything from scratch.

I actually defined a p (in truth a probability) in the post. But I didn’t invoke or assume it by hypothesis. I defined it from something more primitive and “pre-probabilistic”. Its definition, purpose, and properties are thus clear. In particular, it’s clearly not the frequency of anything.

But what I don’t understand is how this misunderstanding arose. In the very next paragraph under the equation (1), I gave the explicit requirements needed to make equation (1) appear to be true to someone who can observe both f_obs and g. Also, in the paragraph under the second picture, I gave explicitly the concrete condition needed to make equation (1) true when x_true is changing in time.

Those conditions were so concrete, so unobjectionable, so obviously true, that I don’t understand why you both (Konrad and Brendon) felt the need to invoke a completely fictional frequency distribution. Why? You already have the truth in hand. If x_true lies in, and stays in, the region associated with g_max then equation (1) holds. Period. There’s no getting around it. It isn’t approximately true, or true on average, or sometimes true, or mostly true. As long as that condition holds it’s just plain true.

It makes no difference how or why it holds. X_true could sit in one spot, or it could explore a little part of that region associated with g_max, or it could roam freely over that region. It doesn’t matter. As long as the conditions holds then so does equation (1).

Where does the impulse come from to replace this very simple, correct, intuitive, concrete condition, with a made up, irrelevant, fantasy frequency distribution come from?

Really, I’m hoping someone can give me some insight on this, because I just don’t get it and I’m at a loss to explain it any other way.

• October 31, 2013Joseph

Maybe I could rephrase my puzzlement this way:

Do you agree that if all someone knows is f_obs then g_max is the best guess they can make for g? If not, then on what basis could you ever justify another g?

If you do think g_max is the best guess then why would we ever need to invent a fictional frequency distribution? We already have the best guess you can make.

Is it that you think this fictional frequency distribution is needed to explain how equation (1) could hold in practice? If so you already have an explanation for that. It holds in practice whenever x_true is in, and remains in, the regions associated g_max. Why do we need to invoke any other principle to explain this, let alone a fictional frequency distribution?

• October 31, 2013Joseph

Or how about this: suppose there was only one x_true and no others were ever possible. X_true is unique and can never be repeated in any sense. In this case there’s zero possibility of “x_true being uniformly distributed”. Or distributed in any way whatsoever.

In this scenario, which claim in the post fails?

• October 31, 2013Corey

Joseph,

I think when Brendon and konrad wrote “prior on /phase space, they didn’t mean any kind of frequency distribution — just a Bayesian probability distribution — or maybe even just a measure for quantifying the size of sets. For finite sets, cardinality (i.e., counting measure) is obviously the correct way to go; but what if the domain of is continuous? I think that in that case you need a (sigma-finite?) measure just to compute the ratio of the size of two sets.

In a similar vein, in your reply to Daniel, you gave an integral involving . Since is a density, it generally has units (the inverse of the units of ) and taking its logarithm results in an integral that is not invariant to transformation. Jaynes gives a slightly different expression for the generalization of entropy to continuous domains; his expression avoids the lack-of-invariance problem, but it does so by introducing — and leaving unspecified — a function that amounts to the Radon-Nikodym derivative of exactly the measure Brendon and konrad were asking for.

(By the way, I think konrad is correct that in (1), should be .)

• October 31, 2013Corey

(I meant discrete finite sets.)

• November 1, 2013Brendon J. Brewer

You make one minor pedantic point and get accused of frequentism? What is this?

Corey is right – I was invoking a probability distribution in the Jaynesian sense, not a frequency distribution. It doesn’t matter that x_true is a once-off variable with a fixed value, it has a probability distribution describing the fact that its value is unknown.

I mentioned two circumstances (also discussed by Daniel and Corey) that can make g_max not be the best guess for g. If you don’t want to cast the argument in probability language, I am still happy to go along with it but only when the domain of x is finite and discrete and the observations are noiseless (in which case you probably need to change the diagrams, because they strongly suggest a continuous domain).

If you want p as defined in the post to become your definition of probability, you have a lot of work to do to convince us that this quantity behaves like ordinary probabilities. Do you have a reason not to go the Cox-Jaynes route?

• November 1, 2013Joseph

Ah. I take it all back and apologize. These damned frequencies are the bane of my existence and I thought “I can’t escape the wretched things!”. Everything else is much more tame:

“Jaynes gives a slightly different expression for the generalization of entropy to continuous domains”

or

Are the real expressions for entropy. It’s best not to think of them as generalizations. Gibbs used them explicitly in his statistical mechanics.

The choice of M or m, is a way of telling the machinery “count things over a given space”. That’s the entirety of what they mean. If you make them uniform, then you’re telling them to count possibilities over y space (i space).

As a practical matter you only need them because sometimes you want to count over a space different from the one explicitly being mentioned.

For example, suppose you maximize the entropy with respect to two constraints. There are two ways you can actually do this. One is to maximize the entropy with uniform M subject to both constraints.

The second method is to maximize the entropy with uniform M subject to one constraint. Once you have the answer then maximize the entropy with respect to the second constraint using M=answer from first maximization.

Or, more realistically, maybe you don’t know about the first constraint directly. Then you can just maximize the entropy subject to the second constraint, and M becomes a kind of stand-in or substitute for the first (unknown) constraint.

• November 1, 2013Joseph

“g_max is defined as just a function of f_obs, it is not an observed quantity; the law that one is tempted to infer is that the observed quantity g is equal to g_max)”

The whole “who observed what stuff” makes it easier to explain but is not fundamental. Just take it to mean “if they had it within their power to observe everything in (1)”.

The point is that given f_obs, your choices for g are many. So the relation isn’t a function. To make it a function you have to select one value out of all the possibilities for g.

Given a “good” choice of g, you’ll get what appears to be a “macroscopic” law, like PV=nRT, or a law of macroeconomics, or population biology, or continuum mechanics, or whatever. It will also lead to “laws” in many situations that we don’t normally think of in terms of “macro” and “micro”.

There are many subtleties to carrying this out in practice. You can already see one in that PV=nRT example. T isn’t like f_obs or g_max. It’s actually a Lagrange multiplier for the Energy. E plays the role of f_obs. Using the Lagrange multiplier is equivalent to using the constraint value in most cases, and if it can be given physical meaning, is sometimes more convenient.

• November 1, 2013Joseph

“I am still happy to go along with it but only when the domain of x is finite and discrete”

I always think in terms of finite set in statistics for reasons given by Daniel:

http://models.street-artists.org/2013/03/15/every-scientific-hypothesis-is-a-hypothesis-on-a-finite-sample-space/

My background is much more pure math than anything, and I have absolutely no problem with measure theory, but it’s not fundamental to statistics at all. I take it as axiomatic that if a problem or issue can’t be expressed in terms of finite sets, then it might be a problem for mathematics, but it’s not a problem for statistics.

If measure theory is giving us problems or paradoxes then we are free to use a different mathematical structure. Physicists went through a similar experience over the Dirac delta function. Mathematicians screamed bloody murder about the non-rigorous nature of what physicists were doing. Their solution was copious amounts of very sophisticated, but in a sense limited delta-epsilon proofs, to put it all on firm foundation. Then Lighthill came along and showed in a remarkably thin book that what physicists were doing is rigorous after all. Just not in the way Mathematicians were hoping:

http://www.amazon.com/Introduction-Generalised-Functions-Cambridge-Monographs/dp/0521091284/ref=sr_1_1?s=books&ie=UTF8&qid=1383306167&sr=1-1

• November 1, 2013Joseph

“If you want p as defined in the post to become your definition of probability, you have a lot of work to do to convince us that this quantity behaves like ordinary probabilities.

I wouldn’t say I have a lot of work. The laws of probabilities make sense for counting frequencies. Everyone presumably agrees with that. Well, they make sense for counting things in general.

You and Daniel both mentioned the possibility when f_obs isn’t the result a precise measurement. That is one kind of generalization. In general though, there can be a very wide variety of ways of specifying V. That is to say, there can be many ways in which our state of information leads us to believe x_true is in V.

If you think about, any of those other possibilities for getting “x_true in V” leave all the rest of the logic intact, without modification.

• November 1, 2013Joseph

Daniel,

Continuum Mechanics is an interesting playground, because in principle everything in Continuum Mechanics should be derivable in a way similar to equation (1). In practice, I think there is a huge opportunity here because it looks to me like there is a great deal of continuum mechanics left to be discovered.

Much of that new Continuum Mechanics is going to look weird. The theories themselves will look new, which is only a minor stumbling block. A bigger problem is that the derivations are going to look truly bizarre to anyone who doesn’t understand Jaynes. Which is probably why they haven’t been discovered already.

The short answer to all your great points about measurement uncertainties is to quote Jaynes. If the observation has measurement uncertainty then maximize the entropy subject to both and .

But this seems to be an unsatisfying answer to everyone. Essentially, Jaynes is using a generalization of the MAXENT principle which he never states explicitly, or uses in full generality. Nor have I seen any awareness of it from others (except possibly for one paragraph in Gibbs). This generalization makes MAXENT far more understandable and far more useful in practice. But although it’s simple mathematically, I don’t think anyone will get it without first understanding the kinds of things I’m talking about in this post.

• November 1, 2013Joseph

“(in which case you probably need to change the diagrams, because they strongly suggest a continuous domain)”

There are a finite number of pixels on any computer screen, so I think it’s all good.

• November 1, 2013Joseph

“but I think it would be interesting for you to develop an example further in the context of something where statistics are regularly used”

I’ve already done this in one example. It was in the post “IID doesn’t mean what you think it does”.

To make the connection note that the actual errors in the data are in a subset of . Call this subset V.

The goal then is to predict the value of where G is defined by:

And is equal to zero otherwise.

Notice that this is no mere reinterpretation. The probability distribution used to define V looks absolutely nothing like the frequency distribution of errors. So almost everyone thinks that distribution would give horrendous results and would never use it in practice if they had an inkling about the real frequency distribution.

But once you realize that distribution is just defining V and doing a good job at it, then the fact that 95% of V makes is all the justification we need for guessing which is equivalent to .

Note: the true value was .

In other words, this interpretation gives us real power to get things done in real problems that most statisticians have no idea is there.

• November 1, 2013Corey

What kind of information do I need in order to justify a particular choice of (or )?

• November 1, 2013Corey

Or equivalently, how do I know to count possibilities over $y = h(x)$ instead of $x$? (I use $h$ to avoid name collision with the $f$ and $g$ discussed in the post.)

• November 1, 2013Joseph

Corey,

In this problem the information we had about was . This is equivalent to .

If the only thing we know about is that it’s in V, then you want to count possibilities over V.

If is an appropriate transformation, then you can transform

to

for some appropriate M. Once again the purpose of M is to define what you’re counting over. It’s presence is telling you that you’re actually counting over x.

To take it one step further. Suppose for example, we’re monks doing genetic studies on peas. If we know about DNA we might work with x directly. If we used y it would be purely for convenience. If we’ve never heard of DNA however, we might be forced to use as a stand-in for a hypothesized deeper structure like DNA.

Hmm, lots to comment on there. I’ll take it one at a time:

1) Re : I’m not sure what you’re trying to say – are you saying that it was not a typo? If so, why? (Seems like a straightforward typo to me – as written, (1) is just a trivial consequence of the definition of ; what you wanted to write was an apparent law that would be “discovered” by an observer, namely the observation that .)

2) Finite measurement accuracy is a justification for treating the domain of (and ) as finite and discrete, but not for x (which is never measured anyway). I suspect you are limiting the framework a {\em lot} by treating x as finite and discrete (we do not know that all physical systems can be adequately described this way). But ok, let’s go ahead and discuss just that case.

3) Treating as a finite-accuracy discrete measurement (of a more precise and possibly continuous underlying quantity) means that you absolutely cannot ignore the consequences of finite precision. If you want to continue using a hard constraint this can only be done by treating and as intervals (of length >0) on the real line rather than a point (this allows us to assume noise-free measurement in the sense that we know the underlying quantity does not lie outside this interval; the only alternative is to explicitly bring noise into the framework). The constraint changes from to , and similarly for – once you do this I think the argument goes through without problems.

4) There is a caveat regarding what we want when predicting . If we just want the most likely measurement outcome (at the full measurement resolution), is the best prediction. But in many cases we may be unable to predict reliably with that degree of precision. In such cases, we may be more interested in predicting that will lie in a larger contiguous region (the larger the region, the more reliably we can predict it). But it need not be the case that the best such prediction contains . (Analogous to the fact that the smallest contiguous 95% credibility interval of a posterior need not include the mode.) I think this is an important issue which is well addressed by casting things in a probabilistic framework (if you first compute a posterior distribution for , you can answer any type of prediction question – this is not the case without probabilities).

5) “I wouldn’t say I have a lot of work. The laws of probabilities make sense for counting frequencies. Everyone presumably agrees with that. Well, they make sense for counting things in general.” There is a big gulf between this and demonstrating that your quantities are the same thing as probabilities. One possibility is that they may turn out to be frequencies, which I think you will agree are not the same thing as probabilities.

6) “But this seems to be an unsatisfying answer to everyone.” It’s definitely unsatisfying when stated without motivation, yes.

7) “The probability distribution used to define V looks absolutely nothing like the frequency distribution of errors. So almost everyone thinks that distribution would give horrendous results and would never use it in practice if they had an inkling about the real frequency distribution.” This way of putting it bothers me. When the frequency distribution is known, the probability distribution is {\em equal} to the frequency distribution. When the frequency distribution is unknown, it is unknown (and hence unusable) – the best we can do is to use the probability distribution. When we gain an “inkling” about the frequency distribution, the probability distribution changes to incorporate this knowledge. In all cases, the probability distribution is the best we can do – it represents exactly the information we have about the frequency distribution. If it does not, it is not the probability distribution conditioned on all available information (in which case it is necessary to be more explicit about what information one is conditioning on).

• November 1, 2013Joseph

(1) No it’s not a typo. Knowing does not define a unique value for g. Mathematically we can create a unique g by simply selecting one. Call this unique value g’. Physically, if we choose this g’ well, then whenever we go to measure {f,g} we may find that always leads to g’. If that happens, it’ll seem like there’s a “law” of the form .

(2)-(3) You’re worrying about mathematical details which are completely irrelevant to fundamentals. This is a blog post. I’m not going to write out every delta-epsilon. You, Corey, Brendon, and Daniel (soon?) are Ph.D.s. You can supply your own proofs.

(4)”If we just want the most likely measurement outcome”. I haven’t defined “likely” in any sense whatsoever. So we definitely aren’t after the “most likely outcome”. isn’t the most likely, it’s merely our best guess given that all we know is .

There are a hundred variations on this problem. Once the basic logic is seen, I assume everyone will see how to change the logic for the new problem.

(5) “There is a big gulf between this and demonstrating that your quantities are the same thing as probabilities”

In a formal mathematical sense there isn’t a big gulf. Philosophically, my goal isn’t to show their equivalence to probabilities as usually conceived, either by Bayesians or Frequentists. My goal is understand them at a deeper more primitive level in order to clear up problems, but mostly to extend current practice.

Put it this way: if I can’t show that p the same as a probability, then I’ll reject probabilities, not p.

(6) Indeed. That missing explanation is the missing puzzle piece which makes it possible the see the whole picture.

(7)”When the frequency distribution is known, the probability distribution is {\em equal} to the frequency distribution”

This is seriously untrue. Or at least it would lead to disaster in that problem. Suppose we somehow knew the frequency distribution (future errors were provided in that post), but didn’t know the errors themselves. In that problem, it would lead to all kinds of nonsense. I could tweak it a little and it would lead to epic levels of nonsense.

Even if you knew that frequency distribution, you’d still be far better off using the probability distribution given even though they differ substantially.

1) You seem to be agreeing with me. It’s g’ and not g_max.

2-3) This is not about the mathematical details, it’s about deciding how broadly applicable your framework is. By your own admission it only applies to finite discrete x (my point 2 is that this is a severe limitation).

4) Agreed, I shouldn’t have used the word “likely”. But the point here is that your solution (maximizing the cardinality) only works for one of the hundred variations. I pointed out one variation where it breaks. This doesn’t concern you?

5) I find the development of probability theory as a unique set of rules for reasoning under conditions of incomplete information very convincing (that is, I find the Cox-Jaynes axioms very convincing and am unaware of any error in the mathematical derivation of probability theory from those axioms). The problem under discussion involves reasoning under conditions of incomplete information, hence probability theory applies. The ratio p is not nearly on as strong a footing as the Cox-Jaynes derivation (it seems ad hoc), so if it is in conflict with probability I will reject it.

7) I’m not sure we are using probability distribution and frequency distribution in the same sense here. At any rate that’s a different discussion.

• November 1, 2013Joseph

(1) I don’t get what you’re saying all. I reread all your previous comments on it again, but I still don’t get it. My attempt to remove g_max to illustrate the logic didn’t work (hint g’ is going to be g_max in the only case we’re interested in).

Look, PV=nRT isn’t true for every macrostate, just like isn’t going to hold for every . But it can happen that the later does hold for so many that when we go to test it in practice it always seems to hold (i.e. is always in the right region).

If this happens then the later equation will appear to be a stable “law” just like the ideal gas law. The ideal gas law appears to us macro-observers as an experimentally verifiable and stable “law” even though we know in principle that there are exceptional microstates which violate it.

2-3-4) The goal was to illustrate the basic logic with an example so simple that all conceptually extraneous details are removed. The reason for this is that long experience with those extraneous details has shown me that they trip nearly everyone up, making communication impossible.

That was the goal. The goal wasn’t to give a general framework for probability theory. The goal wasn’t to describe how to solve more general problems. So no, the problems you describe don’t concern me because I do know how to generalize it, and I know what those solutions look like.

5) I agree with you completely about the Cox-Jaynes axioms and I think there is precisely zero chance any development here is going to be inconsistent with them. I am after all describing a viewpoint which I learned from Jaynes.

But while I don’t disagree with the Cox-Jaynes formulation, it’s not the whole story. One example of that is Maxent, which once generalized slightly, is the third leg of Statistics together with the sum and product rule. Another example is illustrated by a post a while back by Daniel titled “where do likelihoods come from”.

For all the Frequentist bleating about “reference priors”, they have absolutely no problem with what amounts to “reference likelihoods”. Daniel working in a more physical setting wasn’t able to fall back on any standard likelihood functions, but had trouble seeing where they really come from in a way that made physical sense.

And since I’m in the business of denying the frequency nature of probability distributions, even sampling distributions, it’s especially important to understand where they do come from. To do that, it’s not enough to just say “well whatever the answer you get is it has to look like Bayesian probabilities with the sum and product rule and all that”.

I agree 100% that it does have look like that, but that’s doesn’t fully answer the question.

• November 1, 2013Joseph

Maybe I can better state what I was getting at by comment about preferring P over “probabilities” if they were in conflict.

In the post, it’s clear exactly why we’re maximizing and what that does for us.

But the usual Statistical modeling would go like this: first invoke a probability distribution, which in practice we just pull out of thin air, but in this case just happens to be . Conceptually, we’re thinking about this as modeling some “data generation mechanism” whatever the hell that’s supposed to mean.

Then we predict by finding the mode of which is not the only thing we could do, but we won’t object too strongly because if is sharply peaked about as in the pictures, then it won’t make a difference whether we use the mean or mode.

Mathematically the viewpoint in the post and this traditional Statistical viewpoint are the same. Neither conflicts with the Cox-Jaynes axioms. But they are very different conceptually, with the former having a far stronger foundation in my mind. From the post we can see clearly what’s happening when maximizing and why it works. We can see exactly why we use and not something else. In particular, we can see that it applies even if there is only ever going to be one and any kind of frequency interpretation is impossible.

If these two viewpoints differ conceptually, then I’m going with the one in the post.

One last try on point 1 (which really isn’t important, I’m just puzzled why you can’t see what I’m saying):
\emph{is} going to hold regardless of because you \emph{defined} as the value of that maximises . Given , doesn’t depend on at all – that’s why the “law” should involve (the quantity of interest that the law is trying to predict).

As for 2-3-4: ok, I see your goal and I’ll repeat my initial comment: nice post, I agree as far as it goes. Personally I’d prefer getting there via the Cox-Jaynes route, but I see your motivation in not wanting to limit your audience.

5: I went back and posted a comment on that post of Daniel. I think it’s becoming clear what this discussion is about. In short, I agree that the two viewpoints differ conceptually but I think one can make use of both – I don’t think they are ever in conflict. There’s more to say, and I hope to commen tmore later.

• November 4, 2013Joseph

That first point is important, it was key to the whole post and Corey thought that was a type as well.

Mathemtically, is just a definition, but physically it’s a prediction. The key question is why do we expect this prediction to consistently hold true in the laboratory? It’ll consistently hold when is so sharply peaked about that almost every possible is consistent with .

Consider an ideal gas in a box of volume . A given energy (equivalent to T) is consistent with pretty much any volume . to get a functional relationship between the two, like PV=nRT, we have to pick one of those possible values.

We do this by picking the value of V which maximizes the entropy . This maximum value will turn out to be . So the ideal gas law should really read .

But so far that’s just a prediction. In real life though, we do observe that Ideal Gases fill up the containers they’re in. Why? because is massively concentrated about it’s maximum value.

• November 4, 2013Joseph

“Personally I’d prefer getting there via the Cox-Jaynes route, but I see your motivation in not wanting to limit your audience.”

Ah! I see where the confusion is. Not only do I find the Cox-Jaynes route convincing, but I have no objections to using it to convince others of the wisdom of Bayes.

But it’s not my goal here to find a way to convince people of Bayes. My real goal is to motivate a particular probability distribution . The Cox axioms are silent on which distribution to use. Think of this as merely a deeper understanding of the Cox-Jayens route (which it should be noted I got from Jaynes).

The inability to specify the form of is a big problem, because in this post I’m showing the basics of how to create macroscopic laws () out of microscopic systems (). In practice this effort has been severely hampered by the inability to see intuitively in most cases what form should have and to understand why particular forms work like they do.

The goal of the post is to explain why we use: rather than some other functional form. The usual explanation would be that we assume a uniform distribution on V, which is conceived of as a frequency distribution.

Hopefully, I’ve shown that this choice has absolutely nothing to do with, and in no way requires, that the be sprinkled uniformly over V. Deriving macroscopic laws really does require a Bayesian understanding of probability distributions and not a Frequentist one.

Now that we have a better understanding of what really means, where it comes from, and why it works, we can proceed to more interesting cases of deriving macroscopic laws from micro foundations.

• November 4, 2013Daniel Lakeland

So, if a measurement implies that and there is a value which would be implied by the vast majority of states in then will almost always turn out to be true in practice regardless of what value actually took on in . When on the other hand, we learn something very specific about , that it’s in a very small subset of .

Ok, but the reason I thought it would be interesting to generalize to some non-statmech non-physics type of situations is that very often we don’t have such a dominant set and instead any measurement we make leaves plenty of ambiguity about what value of we should predict. Like suppose the largest subset of all consistent with any given measurement of fills up only 1/1000 of . Perhaps that’s just a topic for another time.

• November 4, 2013Joseph

Daniel,

The first paragraph nails it exactly. Notice that although the discussion was inspired by physics, nothing in it used anything from physics (like conservation of energy for example).

It’s true we don’t always have such a dominant set V. What happens in that case is open ended so I can’t give a definitive answer, but here’s a couple of examples to give the flavor of it.

Example 1: suppose is jumping around as a function of time and further suppose is changing with time as well. Now it can happen that for we don’t.

Since is bouncing around that inability to get a definite prediction implies the onset of turbulence. The boundary separates the region where g is predictable from where it isn’t. Which in this scenario translates physically into separating the region where g is stable and were it varies wildly and unpredictably.

Example 2: Suppose in macroeconomics we try to use some macrovariable f to predict GDP. We do the theoretical calculation and it shows that still leaves plenty of ambiguity in GDP. This means that we’re unlikely to observe the in practice.

The usual response is to just accept that we’re stuck with probabilistic models and to get moving with our regressions, time serie analysis, or whatever our favorite stat tool is. But we do have another option. We could search around for a second macrovariable . We might even have to invent a new one no one ever thought of before. If we choose well,then it might turn out that does predict the GDP well. So now we’ll get a macroscopic law of the form:

If h really is new, we might not be able to observe it easly, but now that we know how imporant it is, we can put in the extra effort needed to observe it.

If on the other hand our theoretical prediction shows the above law should hold, and it turns out not too, then we just discovered some extremely important new economics. Either way we’re getting the Nobel Prize!

• November 4, 2013Daniel Lakeland

Sounds more or less like general regression analysis.

Re point 1 (again). I don’t think I’m misunderstanding this. E.g. the way Daniel summarized it is exactly how I also understand it. I think we’re just using language differently somehow.

“Mathematically, g_max=Psi(f_obs) is just a definition, but physically it’s a prediction” – this makes no sense at all. If it’s a definition, it _cannot_ also be a prediction. Saying it’s a definition is saying that g_max and Psi(f_obs) are just two different sets of symbols, both of which denote the _same_ concept. For a=b to be a prediction, it has to be the case that a and b denote _different_ concepts, so that we are actually predicting something (namely that, on measuring a and b, those measurements will turn out to be the same). If a=b is the _definition_ of a, then a and b cannot be measured separately because they are just synonyms.

g=g_max would be a prediction: saying that, when measured, the measured value will be equal to the calculated one.
g = Psi(f_obs) would also be a prediction (the same one, because per definition g_max is just shorthand for Psi(f_obs) ).
g_max = Psi(f_obs) is not a prediction, it’s just a tautology (or a restatement of an earlier definition).

• November 4, 2013Joseph

“If it’s a definition, it _cannot_ also be a prediction.”

We can derive PV=nRT mathematically, but it still might be wrong in the laboratory.

We can formally define the mapping mathematically, but when we measure both f and g it’s possible that we see some g different from .

• November 4, 2013Joseph

I think I have it now. This should clear it up:

is defined as the value which maximizes or alternatively, maximizes the entropy .

is the definition of .

“when we measure both f and g it’s possible that we see some g different from g_max”. Exactly. We don’t necessarily expect to see g=g_max, but when we do see it we call it a law.

• November 6, 2013Joseph

Konrad, next time I’m in Cali we got to get together.

Maybe this will make it more concrete. Given and we could define the multiplicity of x’s associated with both simultaneously:

For a fixed f there will in general be multiple possible values of g. If we predict that we’ll see the g with the greatest multiplicity (i.e. highest entropy), and this prediction turns out to be true, it implies the observed values are solutions to the following equation (under some conditions):

(1)

This gives us an explicit equation relating and . Equation (1) is the “law” I was talking about.

This usage exactly corresponds to common historical examples like “the Ideal Gas Law is pV=nRT”.

Ok, back to the rest of the discussion (sorry for my slow rate of posting, I’m struggling to make time for this).

“But the usual Statistical modeling would go like this: first invoke a probability distribution, which in practice we just pull out of thin air, but in this case just happens to be [P(g) proportional to W(g)]. Conceptually, we’re thinking about this as modeling some “data generation mechanism” whatever the hell that’s supposed to mean.”

Here you are using the word “probability” and the notation “P(g)” to refer to propensities rather than probabilities. You are either conflating probabilities and propensities, or accepting the frequentist definition while neglecting to highlight the distinction. You seem to do this quite often: when other people refer to probability you assume they mean frequency or propensity. But the conceptual difference between probability and propensity is huge, and very relevant for this discussion.

We can distinguish three conceptual approaches:
1) probability but no propensity: this is what you are advocating here
2) probability and propensity: this is what I am advocating; I claim it represents a mainstream Bayesian approach
3) propensity but no probability: this is the frequentist approach

My aim has been to argue that approach 2 is useful, and should not be lumped together with approach 3. Your criticism e.g. in “Mother Nature makes fools of Statisticians” seems mostly directed against a naive version of approach 3 (one where point estimates of propensities are regarded as truth even when they are not well established).

• November 8, 2013Joseph

That’s a nice clarification.
Yes I think the propensity stuff is bogus. The examples where it seems to work in statistics are actually relying on a counting argument no different fundamentally from the one in this post (essentially the entropy concentration theorem or a generalization thereof). So regardless of whatever philosophy you want to adopt, it is simply a technical fact that “propensity” isn’t needed to explain any of those effects where people think it’s needed. Just like you don’t need an ergodic theorem to explain the results in this post. A simple counting argument already accounts for them and, more importantly for applications, exploits that phenomenon better.

So here’s my position:

When you combine Statistics with other fields like physics, biology, economics or whatever, Statistics brings no physical knowledge or assumptions to the table. All it does is count possibilities and then makes predictions/inferences which will be true almost no matter which one of those specific possibilities are true.

This viewpoint simply and easily covers every application of statistics out there and leads to quite a few new ones. And moreover, it explains them without any of the conceptual problems of other approaches.

A strong claim. You are challenging me to come up with an application where propensity is useful – this is easily done by just re-casting your own coin-tossing scenario.

Suppose the coin-tossing metaphor represents infection by a newly discovered virus: each toss corresponds to a newly infected patient, and the binary outcome is whether the patient lives or dies. The virus is quite virulent – we have 10,000 newly infected patients, but so far we have only observed the outcomes of 100 cases. We want to predict what will happen to the 10,000 new cases (and more generally, whether this will be a seriously threatening epidemic).

Now, it seems that your recommendation is to reject the (propensity-based) idea that some viruses are more dangerous than others. Rather, you want to ignore the outcomes of the 100 observed cases and give a prediction from the p=.5 binomial: the virus will kill 3,000-7,000 of the 10,000 patients over the next month – batten down the hatches, we need to go into emergency mode!

But that’s clearly ludicrous. Especially once you look at the data and discover that in the 100 observed cases there were no deaths (or 1 death, if you’re going to argue that the number 0 is a special case deserving different treatment). In statistics, we are typically interested in those cases where the data _are_ informative.

Ignore “over the next month” above – a remnant from editing the example.

• November 10, 2013Joseph

Every time I use a simple example to bring out the logic, I get blasted for not solving some other problem.

Different problems have different solutions. I gave a solution for one very specific state of knowledge in order to illustrate a point. No where did I say other states of knowledge were impossible.

Had you considered taking the point I was making and applying it yourself to the new problem?

It requires a little work (but not much) to see that Statistics adds nothing to subjects like physics, biology, economics other than the counting principle described in my last comment. Doing so clears away an enormous number of conceptual problems and significantly increases the opportunities for Statistics.