The Amelioration of Uncertainty

Bayes Theorem and the Forward Arrow of Time.

Shalizi wrote a paper purporting to show Statistical Mechanics couldn’t be an application of Bayesian Statistics because Bayesian updating implies Entropy is decreasing. Such confusion arises because empirical entropies equation computed from frequencies are confused with equation computed from probabilities. This mix-up is natural for a Frequentist like Shalizi, but since it’s at the heart of every misunderstanding about Statistical Mechanics, it’s worth looking at an example so simple anyone can see what equation and equation mean and why they sometimes move in different directions.

A “state” will be a point in the space equation denoted by equation. Initially the system starts at the zero vector equation and evolves in time according to an equation of motion equation. Since the transition itself isn’t of interested, we’ll simulate this by randomly flipping each coordinate some percentage of the time. The empirical entropy is


where equation is the percentage of 1’s in the state equation. Note equation is a “physical” fact that can be “measured” from the state equation.

As time goes on, we can imagine researchers learning the “equations of motion” and using this knowledge to create equation which shrink more and more around the true current state equation. The high probability manifold of equation describes where equation resides in exactly the same way that errors were “located” by a distribution in this post. Whether this “learning” occurs via Bayes theorem or not is immaterial; it’s sufficient that it happens. So consider the following distribution which “knows” the true state better and better as time goes on:


As time evolves any predictions about equation made using equation will be increasingly accurate. In particular, they will make increasingly precise interval estimates for the physical quantities equation or equation. Using R code given below we can see both entropies in action:


The upward slope of equation is the arrow of time, while the downward slope of equation is our increasing information.

There’s no mystery here: the physical state is becoming more jumbled while our knowledge about that state is becoming more precise. One has nothing to do with the other. The same thing happens in coin tossing. If we measure the initial conditions of each flip we can predict the outcome. Yet if we did this a hundred times, we’d still likely get roughly 50% heads because almost every possibility has that result. Those two facts aren’t in contradiction. Statistics doesn’t contravene Physics, at least not for Bayesians.

UPDATE: Just to emphasize – a probability distribution with a decreasing entropy equation is used to accurately predict an observable entropy equation which is increasing. If you find this simple mathematical fact impossible to digest, maybe it’s time to rethink your philosophy of statistics.

APPENDIX: Since the equation has a stochastic element to it, the red curve will look slightly different each time the code is run.


for (i in 1:50 ) {
x = (x+rbinom(100,1,.02)) %% y
sem =c(sem,- f1*log(f1) – (1-f1)*log(1-f1))


plot(sth, type=”o”, col=”blue”, ann=FALSE,ylim=c(0,log(2)),xlim=c(0,50),yaxs=”i”,xaxs=”i”)
lines(sem, type=”o”, pch=22, lty=2, col=”red”)

title(main=”A Tale of Two Entropies”,xlab=”Time”,ylab=”Entropy”)
legend(40, log(2)/2, c(“S_em”,”S_th/100″), cex=0.8, col=c(“red”,”blue”), pch=21:22, lty=1:2)

August 29, 2013
  • August 30, 2013Daniel Lakeland

    Strangely though, I think your example shows that Shalizi is right. We obviously can’t identify physical entropy with the entropy of a bayesian probability distribution describing our knowledge of the world. Such a distribution would imply that obtaining knowledge about the state of the world would reduce the entropy of the world. Clearly, we need to distinguish between “the size of the phase space in which the current state is thought to be” and “the size of the phase space that could energetically have been reached during the passage of time”

    It is conceivable, when you include the energy cost of obtaining information and its own tendency to widen the reachable phase space of the universe, that you could obtain a result where the entropy of the universe and the bayesian probability entropy *might* be identifiable. Then, zeroing in on the phase space vector of a system would naturally cost energy and increase the entropy of the rest of the universe (ie. the measurement instruments, your brain, the atmosphere of your laboratory, etc)

    I know there have been some attempts to discuss this in physics literature in the past, but I haven’t really followed those attempts.

  • August 30, 2013Joseph

    Actually, I didn’t think Shalizi was wrong so much as incomplete. An incompleteness he’s not going to be able to fill because he equates prob=freq.

    Sometimes in stat mech S_em=S_th and sometimes it doesn’t. There are states of information in which the identification makes sense. A great deal depends on what you’re doing and why. In general, having things physically jumbled can interfere with our ability to know things. Sometimes claims in stat mech which appear to be about P are actually statements about F (and vice versa).

    One thing I’m sure of though is that if you interpret the typical probabilities in Statistical Mechanics as frequencies, either over an ensemble of states in phase space, or occupational frequency over time, there is absolutely no way to verify them. In either case it would take unimaginably longer than the universe will exist to do so. So much for frequentist objectivity.

  • August 30, 2013Joseph

    Also, Daniel if you’ve ever read Shalizi’s review (which I love) of Stephen Wolfram’s book (which I hated) he talks about Jaynes’s interpretation of Statistical Mechanics. It’s clear that Shalizi took the results in his paper as decisively refuting Stat Mech as an application of Bayesian Statistics.

    Jayne’s view of Stat Mech wasn’t just an interpretation by the way. He and his students applied it extensively to non-equilibrium statistical mechanics. See here:

    This phenomenal work I fear is being lost. The results are heavy on the Quantum Statistical Mechanics, making it absolutely inaccessible to anyone but physicists. They on the other hand usually aren’t going to get the Bayesian aspect of it at all. I own the book and after studying it for a while, I’m convinced it only partially gets all the results along those lines. There’s great research still to be done.

    The existence of all those results should have given Shalizi pause. It’s full of probability distributions being updated with new information and applying them to non-equalibrium (i.e. entropy increasing) situations. The fact that the die hard Bayesians who did the research never encountered a contradiction with their Bayesianism speaks volumes.

  • August 30, 2013Daniel Lakeland

    Right, I agree that Shalizi thinks of his paper as decisive refutation (and I also loved his review and have no interest in anything Wolfram says). I am more intrigued about the ideas you’re exploring here, the separation of frequency and probability, and some kind of principles of how to use the two in the context of statmech.

    I’m fairly ignorant of QM, I mean much less so than a typical engineer, but much more so than a typical physics grad student. Also, for the most part I have a problem with the foundations of QM. I have no problem with the fact that QM works and is unintuitive but as I pointed out in a post recently: I feel like there are a lot of foundational issues that physicists have closed-off avenues to and that they tend to regroup around “axiomatic” teachings of QM to ignore these foundational issues.

  • August 31, 2013Cosma Shalizi

    Joseph, thanks for engaging with something I wrote in 2004, and was not, in retrospect, some of my clearest writing. Unsurprisingly, I agree with Daniel — what you’re showing is that a Boltzmann entropy (your S_em) increases, as it should, while the entropy of the posterior distribution over states (your S_th) does not. I completely agree that these are mathematically compatible; that’s what I was trying to show more generally. But the whole point of the information-theoretic approach to stat. mech. was to _identify_ those entropies.
    Go read Jaynes’s 1963 derivation of the 2nd law — I paraphrase it fairly closely in my paper, section II B — and see how you’d reconcile it with your calculations here.

    As for why MaxEnt often works in statistical mechanics, and why max-ent-ish things can be useful for non-equilibrium processes, that’s a different story.

    I should add that I don’t think my paper was a “decisive refutation”.

  • September 1, 2013Joseph


    Increasing Entropy has two completely separate meanings in statistical mechanics which are impossible to separate without breaking the link between frequencies and probabilities.

    One is a measure of disorder which equation typifies above. The other is a Bayesian version. Basically start with two phase volumes (i.e. high probability manifolds of some Bayesian probability distribution) equation and equation such that we can associate two theoretical entropies equation and equation.

    If equation is a subset of equation we can then ask the question what is the (bayesian) probability that a state equation bouncing around this volume will accidentally wind up equation. This will be given by

    (1)   equation

    If equation then this probability is incredibly small. So small we’d basically never see it happen. This result is summarized by saying the system doesn’t go from a high entropy state to a low entropy state, but they’re talking about theoretical entropies like equation above. Of course “disorder” represents a special case of the above type of reasoning. Basically the number of “disorderdered” states is far greater than the number of “ordered” states typically (only typically though, you can get negative temperatures in real physical systems an so on). But the above “bayesian” version is far more general.

    The entropy of an ideal gas is theoretical entropy. It’s derived by theoretically calculating the phase volume equation. This in a sense can be “measured”. We don’t actually have “entropy scales” like we do for weight since the actual microstate is only one point of the phase volume. The entropy of the ideal gas is theoretically constructed from temperature and pressure readings in a way that in principle wouldn’t work for every state of the gas (or wouldn’t work if our model is wrong), but in practice does for reasons given by the Entropy Concentration Theorem (i.e. most states are “typical” states in some sense). Also note it isn’t unique. If you use different macroscopic variables you’ll get an entirely different entropy (phase volume) out of that theoretical construction for the same actual microstate. Indeed if you use 6n distinct macroscopic variables where equation you’ll define the microstate completely and the theoretical entropy will be equation.

    Also note that our state of information about the microstate is that equation. That is what we actually know and measure about it. In no way, shape, or form do we know, or have reason to believe that such microstates occur equally often if we repeated this equation times or that they have a given occupational frequency if we observed it for equation lifetimes. The distributions Gibbs taught us to use put their high probability manifold over that volume equation, which is why they work. Not because of any impossible to verify, meaningless frequencies.

    But there is far far more to the story than this. Often times instead of learning, we are losing information about the state of the system, and so you’ll see the theoretical entropy increase. Often times we actually aren’t loosing information, but choose to ignore some of it because it leads to easier calculations which still get the answer we want. The theoretical entropy can increase in this case too.

    The example in the above post is a good candidate for that. If all you wanted to do was predict equation after 50 time steps then equation is complete overkill. You could use a far more defuse distribution (higher equation) and still get a good answer. Indeed if Statistical Mechanics didn’t have this “informational flexibility” we wouldn’t have much use for it.

    None of this can be sorted out unless you clearly separate probabilities from frequencies, which is why after a 100 years non-equilibrium statistical mechanics basically hasn’t budged. It’s still stuck with a few near equilibrium cases and a some limited examples where we can intuit the right answer/assumptions. Most physicists don’t study non-equilibrium stat mechanics and merely learn a slight elaboration of the statmech/thermodynamics already given by Gibbs over a hundred years ago. The subject is considered dead by most physicists.

    If however you want to see some non-equilibrium statistical mechanics that goes considerably beyond that, check out that Grandy book referenced above and the actual results they got. He was one of Jaynes’s students though, and those equations/results aren’t going to make sense unless you understand them in Bayesian terms and not as frequencies of anything. They are in fact the direct generalization of his baysian interpretation of statistical mechanics to the non-equilibrium case. So you might even say the Bayesian interpretation is “not only defended, but also applied”.

  • September 1, 2013Cosma Shalizi

    Non-equilibrium statistical mechanics is very far from being a dead or stuck subject. Much of the most active work in the field rests on large deviations principles, for which relative entropy / Kullback-Leibler divergence is indeed important, but it shows up on the basis of purely probabilistic (sometimes combinatorial) arguments which don’t apply to all stochastic processes, not from Jaynesian claims about the logic of inductive inference. I recommend Touchette and Harris’s pedagogical review of the large deviations approach to non-equilibrium statistical mechanics (arxiv:1110.5216) if you’re interested. But the flat assertion that non-equilibrium statistical mechanics is “considered dead by most physicists”, or that the only people pushing it forward are the Jaynesians, is very hard to reconcile with the contents of a typical issue of Journal of Statistical Physics or Physical Review E.

    Now, I am afraid I cannot agree with you that the Boltzmann entropy log W is “Bayesian”, because it is incoherent for a Bayesian agent not to condition on all available information, which very much includes not only the current macroscopic state, but also the whole history of macroscopic states known to the agent. This means that the Bayesian agent’s posterior distribution over microscopic states is not a function of the thermodynamic state. Of course the thermodynamic entropy (defined via dS = (1/T) dQ) is a state-function, and so is the Boltzmann entropy, and for equilibrium systems both are equal to the Shannon entropy of the corresponding Gibbs distribution (at least in the thermodynamic limit…), and typically keeping track of the history by conditioning serves no useful purpose in equilibrium. But that sort of ignoring-of-evidence is incoherent from a Bayesian viewpoint, rather than being properly justified from Bayesian inference. One might still want to defend an epistemic interpretation of probability here, as reflecting rational ignorance, but this would be a non-Bayesian epistemic probability.

    (I’m not going to try to make an argument from authority, but I did do my Ph.D. thesis on self-organization in far-from-equilibrium systems, and don’t need lecturing on the rudiments of statistical mechanics.)

  • September 1, 2013Joseph


    I didn’t say there weren’t lots of papers published in non-equilibrium stat mech, or that there handn’t been any far-from-equilibrium success. Look at my words again. You were lucky you got do a thesis on the topic, since my physics professors just snickered when I suggested doing the same.

    The second paragraph gets to something fundamental though. Maybe we should call my position Jaynesian rather than Bayesian, because neither he nor I have the slightest conceptual problem with conditioning on less information than we have available. Since the “Bayesianims” really refers to interpreting probability distributions as something other than frequencies, and not some philosophical viewpoint held by Savage or Definitti or whoever you’re referring do by the words “coherent Bayesian”, the change in terminology should alleviate some confusion.

    Certainly we can condition on the entire past, or those parts that are known. There are many examples of that in Grandy’s book. But if it’s convenient we can ditch some of this information regardless of whether the system is in equilibrium or not. Basically if a system is in a region equation then we are free to use any other superset equation since equation as well. Averaging over equation may be mathematically much more tractable and have no appreciable effects on the parts we care about. For example, if I use equation I may get that some volume equation whereas if I use equation I may get that equation which has greater uncertainty, but is still going to be useful in many cases.

    But let’s say I observe equation which I learn from measuring some function of the true microstate equation and put a uniform distribution on equation. Then I use this distribution to calculate equation. What would have to be true for this to be a valid Frequentist calculation? We’d have to interpret that uniform distribution as a frequency of some kind. But however you interpret it, it would take vastly longer than the history of the universe to verify as a frequency distribution. Let’s say we could somehow magically do this. Do we have any reason to believe that the equation’s under repeated preparation of the same system, or equation under time evolution really has a uniform frequency? Basically none. Certainly having just measure equation gives us no reason to suppose that’s true.

    So under a frequency interpretation, the whole thing is a bust. But if we look at that equation as being something like a Bayesian prior probability for a parameter, whose function is merely to describe where (through it’s high probability manifold) in state space equation resides, then that calculation looks very different. It basically says that almost every possibility for equation compatible with our knowledge equation leads to equation. Since equation is one of those possibility’s, then our best guess is that it’s one of those vast majority that make the volume close to equation. Since the counts are so extreme in statmech, we’d basically never observe a violation unless the system was being forced away from equation by either us or Mother Nature.

    So we’re in business. Even if the “true” frequency distribution differs incredibly from the uniform distribution we’re still in business. The actual frequency distribution could be equation for example, because the system always goes into the same microstate under repeated preparation. That distribution is vastly different from the uniform distribution, yet the conclusion that the volume is close to equation is still perfectly good!

    The number .001 doesn’t describe for this example the variability we’d see if we could repeat the whole thing equation times, but it does accurately describe our uncertainty given that we know the measured value of equation and nothing else.

    The frequency properties of those distributions in statistical mechanics aren’t known and aren’t needed. Whenever we actually do care about frequencies, such at equation in the post, we can take care of them using probability distributions which aren’t equal to the frequency of anything – just as I did in the post.

    P.S. you’re not the only one who reads these comments, and not everyone is familiar with stat mech.

  • September 1, 2013Daniel Lakeland

    Cosma, thanks for stopping by! It seems to me that you can only identify the two entropies under very specific assumptions about the information available to the observer. The Micro-Canonical Ensemble is a good example. It’s impossible to observe a Micro-Canonical ensemble because it’s explicitly forbidden to interact with it by definition. If you can’t interact with it, you can’t know anything about it except what’s implied by the laws of physics, such as that the energy will remain constant. The standard analysis of the micro-canonical ensemble uses essentially Laplace’s principle of indifference to say that every state with a given energy has equal probability of being the current state. Since there is no observing and information gathering, this remains the probability, and so the potential for divergence of Boltzmann and Bayesian entropies doesn’t enter.

    There are some additional things to point out. There’s a timescale involved. Suppose a Bayesian statmech person takes a snapshot of a system, and gets approximate measurements of the position and velocity of particles (we’ll have to use classical mechanics here for the moment). There are always measurement errors. After a very short time, the observer can calculate the new positions and velocities and not be too far off. But the Lyapunov exponent of the dynamics will determine how quickly the predictions and the actual dynamics diverge. In typical statmech conditions, such as a gas in a piston, the timescale for divergence is likely extremely short. So after any reasonable time interval our earlier detailed information is likely useless. This may explain why max-ent is so successful in many cases, our historic information is totally out of date on realistic timescales.

  • September 1, 2013Joseph


    It really is odd that you got the “coherent Bayesian” thing from Jaynes since a he really didn’t hold any such position that you’re attributing to him, which even a casual read would reveal.

    In general you can have two different baysian distributions equation and equation. If you denote both of them as equation and use them in the same calculation, then you’re being “incoherent”. But if both K and K’ are true however then there’s nothing “incoherent” about using either one or the other consistently in a given calculation. They’re both are still “coherent” with reality. In particular if K’ contains all the information in K plus some additional truths, then you can switch from equation to equation if it’s convenient.

  • September 2, 2013Joseph

    Dr. Shalizi,

    I’m going to put out a post later this week explaining my position (Jaynes position actually) better and clearer than I could in the comments above. I think I can make it so clear that at least there wont be any misunderstandings, even if there are still disagreements.

    Basically, you can have the second law, decreasing informational entropy as in the post above, and increasing disorder, all without being in contradiction. You can have all this easily and much more but you have to drop the frequentist interpretation of the probability distributions and understand statmech as statistical inference.

    This is not a minor technicality either. As I hope to make clear in the post, it really blows the doors open in statmech (and some other fields too). It drastically changes the nature of the subject in a very practical way. So beyond being one of my favorite subjects, there’s a real point to getting it stated as clearly as possible.

Leave a Reply or trackback