## Noninformative Priors

The esteemed Dr. Wasserman claimed “This is a general problem with noninformative priors. If is somehow noninformative for , it may still be highly informative for sub-parameters, that is for functions where and .”

Not only is it not a problem, but it’s the key to Statistics and fundamental to the philosophy of science.

Consider flipping a coin 500 times and predicting the percentage of heads. From the coin’s symmetry we know each of the sequences will have equal probability (propensity?) allowing us to model it as a random process. A uniform distribution on the space of sequences implies it’s likely . The randomness assumption can be confirmed by flipping a coin and observing in this interval. An easy path from objective knowledge to secure inferences if ever there was one.

Unfortunately, every part of the last paragraph is nonsense.

Observing provides no evidence for “randomness”. Almost all sequences have the property so this outcome is usually observed no matter what causes are active. Or if you like, is likely to be observed regardless of the “true” distribution, even if it’s extraordinarily non-uniform. Something like this has to be true since is so large only a miniscule fraction will ever occur. The next sequence observed is thus being drown from a tiny subspace all possibilities and yet in practice we do observe .

Moreover, the outcome of a flip depends far more on the coin’s initial conditions than it’s Inertia Tensor. The “equal propensity” assumption is not a physical property of the coin, but rather a strong assumption about initial conditions. Statisticians have no clue when the assumption might hold, especially since they never measure Moments of Inertia or initial conditions and mostly wouldn’t know what to do with the information if they had it.

The uniform assumption is thus based on nothing. It’s a noninformative prior of exactly the kind Wasserman (and seemingly Gelman) claim are a lost cause.

It’s amazing reliable predictions can come from a noninformative prior. The mystery is explained by examining the thing predicted. If is a sequence of 0,1′s representing the outcomes of the 500 flips, then the mapping is highly non one-to-one (both in a strict and an approximate sense). The effort succeeds because we’re estimating a function which is largely insensitive to and so ignorance about isn’t much of a hindrance in predicting !

Everything in statistics is like this. The outcome of the 2012 Presidential election is a highly non one-to-one mapping from the space of possible votes into . In Statistical Mechanics, the Energy is a highly non one-to-one mapping from the dimensional Phase Space into . When you average data to estimate , you’re implicitly using the non one-to-one mapping . Statistics is a one trick pony and this is it’s one trick.

But there’s more. That you can be ignorant about one space, but well informed about a function of that of space, allows for a kind of “separation” or “disconnect” between domains. For coins the “separation” allows us to predict without knowing Euler’s Equations for Rigid Body motion. In Statistical Mechanics it allowed Physicists to derive before they knew anything about Quantum Mechanics. Indeed without this disconnect, we wouldn’t survive long. It’s this “separation” that allows us to drive cars safely while ignorant of almost everything causally affecting us. In a similar way, we can do Physics without first knowing everything about Biology and vice versa. That we have any separate successful branches of Science at all is a happy consequence of Wasserman’s “general problem”.

July 21, 2013george

link

Can you please supply some references for your arguments?

In particular, who in your “nonsense” paragraph actually, in print, claims that “the randomness assumption can be confirmed” in the way you describe?

Also, can you give details on what assumptions, if any, you are making? Coin-flipping examples generally assume independence of each flip, which has implications for the possible distributions of the sequences of coin flips.

July 22, 2013Joseph

link • author

George,

Statistics isn’t Theology. You’re allowed to check the numbers yourself. For example, do a simple estimate of how long it would take to actually generate every sequence even one time. If you’re not familiar with the Physics then please see Jaynes’ “Chapter 10 The Physics of ‘Random Experiments’”. Or if you rather hear it from a Statistician see Gelman’s “You can load a die, but you can’t bias a coin”

In regards to your second paragraph, I’ll cite every introductory statistics textbook and almost every statistician I’ve ever run across. Frequentists love to brag about how they’re objectively verifying their assumptions this way. There’s no hope of checking the uniform distribution on the directly, so what they’ll do is imagine there is a which is implicitly assumed be IID. Then they’ll run at test with and fail to reject it because is close to .5. It’s different wording, but it’s entirely equivalent to what I described.

In regards to your last paragraph, assuming a uniform distribution on a sequence implies “independence” but I don’t like to say it that way because it has the wrong connotations. Coin flips are never “causally independent” for example. You could allow for something more general, such as a distribution on which shrinks around the true sequence that appears in the next 500 flips; getting ever closer to the distribution .

Unfortunately, although this distribution would be highly useful and accurate for predicting properties of the next 500 coin flips, it also has the failing that every marginal distribution P(heads on ith flip) gets closer to 0 or 1. Thus the most useful distribution violates the equivalence P(heads)=”frequency of heads” in extreme ways and that is just the sort of thing people have trouble wrapping their heads around.

July 22, 2013Daniel Lakeland

link • my site

George, you might say that Entsophy is a “realist” when it comes to physics. I guess I am too, or at least we both are to some approximate extent.

The point is, when you do two coin flipping experiments, the first coin flip affects the molecules of the air, of the table it lands on, of the coin itself, etc. There is no way to do two coin flips in succession where the fact that the first coin flip was done has *NO* effect on the second.

Statisticians are happy to ignore this for the same reasons that Physicists are happy to ignore the exact conditions of all the molecules in a 2 liter bottle of soda. A few statistical quantities are usually enough to get what we care about, pressure, temperature, concentration of CO2 , fraction of C02 in gaseous form, etc. For a statistician, the fact that you’re doing the flip at a different time and that the coins didn’t hit each other in the air, didn’t have nearly identical statistics of their initial conditions (angular momentum, orientation, etc) is enough to make a statistician happy most of the time.

Although it’s technically the case that butterflies in southeast asia today affect the weather next year, we model that kind of effect as randomness. Randomness is not a thing though, you can’t order randomness in a can from Amazon.com, it’s a model for a lot of stuff that happens that we don’t care about.

However, if you look at many many frequentist statistical tests and methods, there are caveats about the need for “iid normal errors” or “homoskedasticity”, or the like. There are specific tests designed to test for normality or the like: Anderson-Darling, Chi-Squared, Shapiro-Wilks, Kolmogorov-Smirnov.

We need so many tests in part because different frequentists methods rely on different aspects of some underlying distribution. Perhaps some test doesn’t perform well if you don’t have small tails, a well known variance, symmetry of the distribution of errors, whatever. It really is true that there are tons of statistical textbooks that discuss these “violation of assumptions” and tests designed to detect them.

The main assumptions that Entsophy is making in the coin flipping example are essentially newtonian mechanics and a coin flip in which the initial conditions have a lot of angular momentum whose vector is oriented nearly in the plane of the coin (so that the side that is “up” is changing rapidly), where the heights at which they’re let go vary by at least a few centimeters from flip to flip (so that the flight time is spread around in a region rather than constantly exactly the same), and where the coin bounces off a hard surface and comes to rest on its own (so that you’re not catching it when a particular side is up on purpose for example). There’s no assumption on the nature of the randomness, because there is no real randomness. In effect, those kinds of initial conditions will produce sequences that are indistinguishable from the predictions made by iid Bernoulli trials due to the fact that the space of trajectories is so vastly larger than the space of Bernoulli sequences (which is already enormous!) that if the trajectories are spread over even a small region of the trajectory space the outcomes will be relatively uniformly spread over the outcome space.

July 22, 2013konrad

link

I don’t think every part of the “nonsense” paragraph is nonsense.

First, it is important to accept that coin-tossing is just a handy metaphor for any repeatable experiment with a binary outcome – where repeatable means that the _known_ part of the initial conditions is the same. The unknown part of the initial conditions is, of course, unknown – which is why we are using probability in the first place. This means that we are conditioning on the same information when calculating probabilities for different tosses, except that in later tosses we can also condition on the outcome of earlier tosses. The precise details of the experiment are not really of interest, but it is important to accept that when people talk about the propensity of a coin they mean the propensity of the coin-tossing experiment (which as Jaynes pointed out can easily be set up to have a consistent bias).

Define the propensity as the limit of the predictive probability after observing many tosses – after a large number of tosses we have essentially learned all we are going to learn for the purpose of predicting future outcomes, and the predictive probability converges. We need to accept that when frequentists use the word “probability” they are referring to the propensity.

It is true that propensities can be measured or confirmed experimentally. If you set up an experiment with a propensity of .3, you will not observe f close to .5, despite your point about the vast majority of potential outcomes having f close to .5.

July 22, 2013Joseph

link • author

Konrad, with a few caveats and developments I’m fine with this. But that third paragraph is still total nonsense.

Specifically, the claim that a symmetric coin implies an equal probability, or propensity, for each sequence is complete nonsense.

The claim that observing , or equivalently failing to reject , is evidence for the equal probability assumption, is complete nonsense for reasons given in the fifth paragraph.

The claim that observing is evidence that the coin is symmetrical isn’t even true for reasons given by Jaynes and Gelman.

July 22, 2013george

link

I don’t think we’re getting very far here;

> You’re allowed to check the numbers yourself

I never claimed otherwise. I asked for references (i.e. someone in print claiming what’s claimed above to be common) – and got none.

> so what they’ll do is imagine there is a p=prob(heads) which is implicitly assumed be IID.

No. First, the assumption is explicit – and there is a huge literature on what happens when that assumption is violated, and to what extent methods can be adapted to deal with such situations. Second, it’s the events that are IID; p is a parameter, that is not random.

> It’s different wording, but it’s entirely equivalent to what I described.

No it’s not. You make no distinction between failing to reject a null hypothesis and accepting a null hypothesis. Again, there is a massive statistical literature on why these two differ, going back to Fisher.

Daniel: I don’t disagree about caveats, or that a ton of statistical texts discuss tests of assumptions. But for many tests/confidence intervals used in frequentist work, there are versions that relax these assumptions; often all we actually need is some form of independent sampling – see the theory of M estimation – and Central Limit Theorems provide all the rest, without the need to test assumptions. See also http://www.jstor.org/stable/3533623 for why testing assumptions can backfire.

July 22, 2013konrad

link

I think it comes down to assigning the right interpretations to words – I do think it is possible to assign them in such a way that what frequentists say is not nonsense. And absolutely nobody thinks (once they bother to think about it, at least) that we are really talking about physical measurements of actual coins.

When people talk about coins, I always interpret it metaphorically – this is the only way to give them the benefit of the doubt (compare Jaynes’s writing on Bertrand’s paradox – we want to credit people with making sense, so we should read what they say from a perspective in which it makes sense). So “symmetric coin” should just be _defined_ to mean “experimental setup in which both outcomes appear with approximately equal frequency (or exactly equal, in applications where this makes sense)”, otherwise it is clearly nonsense. Anybody would agree that the way in which a coin is tossed dominates the physical measurements of the coin in terms of importance for predicting the outcome, and pointing this out does not refute the idea of repeatable experiments with measurable propensities.

This deals with your first objection: “(approximately) symmetric coin” means “(approximately) equal propensity” by _definition_, it is not a claim.

As for your 2nd objection: first note that observing f \approx .5 is not the same as failing to reject H_0. In practice, failing to reject H_0 means that the difference between the propensity and .5 is too small to reliably distinguish from zero given the available sample size. With small sample sizes (including n=0: I think you’ll agree that our claims should hold in the trivial case) we fail to reject H_0 regardless of the propensity – obviously this cannot be evidence in favour of symmetry.

On the other hand, observing f \approx .5 _is_ evidence in favour of symmetry (read, in favour of the propensity being close to .5), and becomes both stronger evidence of symmetry (we become more certain that the propensity is close to .5) and evidence of stronger symmetry (we can tighten our definition of “close”) as the sample size increases (here it matters whether we are talking about exact or approximate symmetry; in many/most applications exact symmetry can be ruled out a priori, which is why I’m talking about strength of symmetry). Similarly, observing f \approx .3 is evidence in favour of the propensity being close to .3.

July 22, 2013Joseph

link • author

George, if you’re dead set on interpreting what I wrote in the most asinine way possible then go right ahead. But, I don’t have the time to spell out the silliness of comments like “You make no distinction between failing to reject a null hypothesis and accepting a null hypothesis” to someone who clearly isn’t interested in doing the hard thinking.

Even though this is just a blog, there are number of loosely stated but definite mathematical claims in the above post. If you can show one of them is in error, then I’m glad to listen, but otherwise “go pound sand” as we say in the Marines.

Oh, and do a google search on “biased coin hypothesis testing”. You’ll find about 150,000 references where people “test” whether a coin is “biased” by whether or not f is close to .5. I didn’t make it up.

July 22, 2013Daniel Lakeland

link • my site

George, “independent” sampling is a model that has no physical basis. Everything that will happen in the future physically depends on a vast number of things that happened in the past to *some* degree. So what you need is a test that determines whether there is an important difference between the predictions made by an independent sampling model, and the actual outcomes. Per Martin-Lof essentially defined a random sequence as a sequence that doesn’t fail some theoretically computable test based on the assumptions of basic probability theory, this is pretty much equivalent to other important definitions of randomness. In essence, random is as random does.

The thing about such tests is that in practice, they can only tell you that the data you’ve seen so far is indistinguishable from random iid sampling, they can’t tell you what will happen in the future. So for example so long as your electronic measuring instrument was operating properly perhaps your experiment works fine according to iid sampling theory, but all you need is for some contamination signal to leak in from an intermittently failing noisy power supply at the other end of your laboratory and suddenly your measurements have serial correlations, and decoupling to the actual outcomes, and the appropriate distribution is some kind of robust mixture model, and etc. In practice you need either physical realism to design a data analysis in such a context, or you need to choose a new statistical distribution that models this “new kind of randomness” which is really just another way of saying “a more complex phenomenon, which we still aren’t predicting exactly but which has different regular features than before”

I’m with Konrad though, it’s not all meaningless, a propensity is a model, just like Newtonian Mechanics is a model. So long as the things we do are pretty similar from one experiment to another, it can make sense to say that the outcomes will be consistent with the predictions that an iid model with a certain propensity makes. When that’s true the task of modeling can become a lot simpler. If we want to know more than such a model can tell us (such as with very high precision, the particular value of the *next* “coin flip”) then we have to work a lot harder than we do if we just want to predict say the number of successes we will get in the next 30 or 100 repetitions of very similar experiments.

July 22, 2013Daniel Lakeland

link • my site

I guess in my take on it above, you have to say that the “propensity” is a perfectly well defined property of the *model* of the phenomenon, not a property of the physical stuff that is actually going on.

July 22, 2013Joseph

link • author

Konrad and Daniel,

I don’t think you’re disagreeing with anything I wrote (besides a few subtleties that is). The fact is Frequentists believe they are making assumptions which are sensible and testable. They aren’t. They’re actually assuming a noninformative prior, which works because mathematically they’re estimating a function which largely insensitive to .

It’s like saying I can successfully predict the function by assuming any old distribution on the space . Actually observing doesn’t verify any of the assumptions!

The fact that you or I could exploit our deeper understanding of the physics to make reasonable models under the right conditions is beside the point. They aren’t exploiting any physics and they definitely aren’t exploiting any experimentally measures values or any other kind of objectively known facts. Hence “noninformative prior”.

July 23, 2013Brendon J. Brewer

link • my site

Interesting post. Reminds me of a lot of Jaynes’ writing. Sometimes these “flat” priors over huge spaces make amazingly good predictions about certain things, but they can often be inappropriate too.

July 23, 2013george

link

Daniel: Thanks for the discussion. I agree with Konrad too.

Joseph: Calling me names, eh? I’m sure the marines would be proud of you.

If you won’t discuss the field of statistics in the way it’s understood by statisticians (who, for example, are really careful about failing to reject vs accepting null hypotheses) you’re not going to get far with reference-free rants about how they are all Doing It Wrong.

July 23, 2013konrad

link

Joseph and Daniel: Of course we’re not in disagreement on most things, but it seems to me we _are_ in disagreement when you both say that propensity claims are modeling assumptions rather than empirical claims, and when Joseph says that such claims cannot be tested empirically. This to me is the key distinction between probability and propensity, and lies at the heart of the Bayesian/frequentist dispute.

I claim that it is possible to set up a repeatable binary experiment (under a suitable definition of “repeatable”, along the lines I suggested above), which will sometimes output 0 and sometimes 1, as determined by variable, unobserved, initial conditions. I claim that it is possible to set it up in such a way that, over the time scale that the experiment will be observed, there is no systematic shift in initial conditions which would render inference about those conditions practical (this is my version of saying that we cannot improve prediction by modeling correlation between observations, or that the experiment is sensibly modeled as IID – if we don’t add this caveat we will end up either with a time-variable propensity or with a propensity that is averaged over all time and hence not fully informative for prediction in situations where time is known). So far I think you’ll agree.

Next I claim that, once we have set up the above experiment, it is a fixed physical entity with an objectively existing frequency distribution (propensity) that can be measured, and that the measurement becomes more accurate with increasing sample size. As one example: if propensities weren’t physically measurable, nothing in thermodynamics (temperature, pressure, etc) would be measurable.

July 23, 2013Joseph

link • author

Uh George I didn’t call you any names. But if you really can’t be bothered to do two seconds worth of googling here goes. This is literally the first three examples I could find out of 150,000 just by googling “biased coin hypothesis testing”.

Here’s “How do we tell if a coin is fair?” by some Dr. Hand an applied math professor at NYU doing exactly what I claimed:

http://www.math.nyu.edu/~hand/teaching/2006-2007/fall/ma224/Sec3-Hypothesis-Tests.pdf

Here’s a tutorial on Hypothesis testing addressing the question “Suppose that you are trying to decide whether a coin is fair or biased in favor of heads” from csus.edu

http://www.csus.edu/indiv/j/jgehrman/courses/stat50/hypthesistests/9hyptest.htm

Here’s checking whether a coin is fair from Wikipedia:

en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair#Estimator_of_true_probability

I really didn’t make this up. They’re doing exactly what I claimed and there are literally hundreds of thousands of examples of real professors doing the exact same thing. It’s an extremely common text book/ homework example.

Here’s another stat course tutorial (also off the very first page of google hits): “One researcher believes a coin is ‘fair,’ the other believes the coin is biased toward heads.”

www3.nd.edu/~rwilliam/stats1/x24.pdf

Denying something that’s obviously true and then asking for references is just bad faith arguing, and if you do it, I’m going to call you on it.

July 23, 2013Joseph

link • author

“when Joseph says that such claims cannot be tested empirically”

Konrad, I think you need to look closer at what I was saying. I was saying the claims made by Frequentists about each sequence being equally likely (in essence their “randomness” assumption) can’t be checked.

And they can’t. You can’t verify the uniform distribution on $2^{500}$ sequences and actually observing $f \approx .5$ tells you almost nothing since this result is compatible with almost any distribution on that $2^{500}$ space, even if it’s extraordinarily different from the uniform distribution.

Which part of that do you find objectionable?

July 23, 2013Joseph

link • author

Also, Konrad, I wasn’t ignoring your other points, it’s just I think you’re basically correct due to the effects of the Entropy Concentration theorem.

There’s much more I’d love to say on the topic, but frankly I don’t it’s possible to explain my views on it without everyone misunderstanding the hell the out it. I will say though, as far as applications go, the last word has been said on this subject and there are still major useful results left to be discovered.

July 23, 2013konrad

link

Fair enough, you can view it that way – it’s the approach taken by Jaynes, who pointed out the surprising result that even this minimal information is enough to reproduce the binomial distribution, which ordinarily is thought of as requiring rather stronger assumptions. But whereas Jaynes argued that stronger assumptions are not _required_ for certain purposes, you are going further by declining to make those assumptions and then claiming that certain things _cannot be done_ – but those claims are true only in the absence of stronger assumptions.

Essentially you are denying that repeatability is a useful concept, and this forces you to represent the problem in a much higher-dimensional space, where you do not share information across observations. But this has serious limitations, such as limiting you to a sample size of 1 (i.e. 1 sequence) in all cases – which is why you can’t verify anything. It seems to me that one can do more by introducing the notion of repeatability – after all, we often have real world situations where we want to make stronger assumptions, and I think this is what people have in mind when discussing coin-tossing examples (though of course they typically don’t flesh out the details, preferring to just chant the magic “IID” mantra – which saves time and effort, but amounts (I think) to a stronger assumption than is required).

I’m not sure if Jaynes moves on to introduce repeatability assumptions? (My knowledge of that chapter is sketchy.) At any rate I think it can be done properly along the lines I sketched above (where I think my assumptions are quite weak, but stronger than yours), and that one would end up with something very similar to traditional theory.

July 23, 2013konrad

link

As to “I don’t [think] it’s possible to explain my views on it without everyone misunderstanding the hell the out it” – given that your views are heavily based on Jaynes’s, which are not that widely known, this is probably true as far as the general public is concerned. On the other hand, in a small circle of people who have read at least part of his book and are on board with its general direction we can hope to achieve more.

July 31, 2013Brendon J. Brewer

link • my site

Been thinking about this a bit, and I think I understand the point. A lot of the “standard” distributions that often used (either as priors or as sampling distributions) can be derived by putting a flat prior on some big space.

For example,

when doing sampling from a “population” like in an elementary stats problem, you’re actually saying that you have a uniform prior over the (large) space of possible sequences of the identities of people selected. e.g. if the population has size 1000 and you choose 3 people then {343, 122, 654} has the same prior probability as {432, 12, 999} or any other sequence. This allows you to do things like equate your sampling distribution (which describes your prior uncertainty about the data you’re going to get) to the known frequency distribution of the population.

July 31, 2013Joseph

link • author

Brendon,

Yes, at some point in deriving the standard probability models there’s always an assumption of uniformity on some underlying space. Most people don’t view these as uninformative priors, but rather they view them as verified legitimate frequency distributions.

But in really they are uninformative priors. We get away with using flat distributions on a space because we’re trying to predict or infer values of functions which are mostly insensitive to . So while assuming a flat distribution will usually lead to a successful model, it’s also true that the vast majority of highly informative distributions (but not quite all of them) would also have made the model successful as well.

In other words, the success of the model doesn’t turn that uninformative prior into an accurate statement about the frequency distribution. That success is going to be consistent with a very large number of highly non-uniform frequency distributions.

It also explains how you can use “ignorance” to make successful inferences or predictions. It’s simple really. If the value of isn’t sensitive to then you don’t have to know much about the true value in order to guess

August 16, 2013Daniel Lakeland

link • my site

Konrad, in reading your comment about “I claim that it is possible to set up a repeatable binary experiment…”

So we set up this experiment, to make it more concrete (pun intended), suppose it’s a sequence of concrete crush-test cylinders made from a single pile of materials out in the concrete yard. We go, dig up the appropriate mix of stuff from the piles, mix it together, and pour 10 test cylinders. We can do this 10 times a day for a month or so before we need to order more materials. Our binary outcome is whether it takes more or less than force “F” to crush the cylinder after 21 days of curing time.

Now, we’d both like to say that this results in quite repeatable experiments. But there’s nothing to say that in any given day we didn’t maybe do something wrong, like mis-measure some critical ingredient, or mis-label the cylinder so we test it after not-enough curing time, or whatever. that is, maybe things are quite repeatable under sort of “normal” conditions, but there’s nothing to say that in any given smallish subset of the conditions we don’t have something else going on (and quality-control guys will probably tell you this happens way way way more often than we’d really like)

so, if you want to measure the propensity of cylinders to crush at a certain load, and you do it for the first 200 cylinders on the first 2 days, and get some incredibly tight bounds on the frequency histogram, that doesn’t really mean that say 4 or 5 days from now we won’t get something really quite inconsistent with this data.

Now, your point is that you wanted to constrain yourself to “over the time scale that the experiment will be observed, there is no systematic shift” and that would be nice, but given the observable information, there’s no way to predict ahead of time that we mis-mixed or mis-labeled or mis-calibrated or testing machine, or whatever. So we have to include these possibilities in our model because they *can* happen on the timescale we observe, and then when we do that, we also need serial correlations, because all the cylinders on days with too much water will all crush at low force levels that day.

or, we could choose to model the whole mess as just a binary outcome experiment, and we’ll need a longer run to get a stable p value, and we’ll have to hope that we don’t hire a new person and they suddenly have higher propensity for errors, or that as people learn to do the experiment, the propensity for error making is decreasing… We’ll get a less informative model, but it won’t be overly-confident either.

The same thing is true whether we’re talking subatomic collisions in the LHC or survey sampling in the census. On realistic timescales, things *do* change, noise in the instruments, the dedication of the census employees to getting full coverage, whatever. Sometimes we can ignore those changes to good effect, but sometimes we can’t. I basically claim that this is a modeling decision we have to make.

August 27, 2013Christos Argyropoulos

link • my site

Interesting points; I’m surprised that exchangeability as a way to ground statistical models wasn’t mentioned (even though alluded to).

Informally one assumes the everlasting persistence of objects/situations that are amenable to analysis by collecting a finite number of observations that are exchangeable (can substitute for each other in inference). Then a large number of (predictive) probability measures arise with their parameters etc.

This is a useful conceptual model but one that requires this everlasting persistence assumption. If this is not justifiable (and in most data sciences applications this assumption is not reasonable), then a throw-away one-off model over an extensive state space (as the ones advocated by Jaynes, Caticha and this blog) is reasonable.

As a general comment about the material in this blog Jaynes started off MaxEnt because he wanted to propose an alternative to quantum statistics and only latter gravitated to Bayesianism convinced by the presentations of Jeffrey’s, Keynes and Cox who wanted to extend deductive reasoning in the context of uncertainty. His work thus took him to explore different areas over time so if the blogger wants to shut frequentism down (a noble goal) the other aspects of Jaynes work should also be considered.

August 27, 2013Joseph

link • author

Christos,

This must be the day for exchangeability. Right before yours there were two more comments here: http://www.entsophy.net/blog/?p=130#comment-47432

I responded somewhat there. “Quantum Statistics” usually refers to things like the Fermi-dirac and Bose-Einstein statistics in Statistical Mechanics. I think you may have been referring to conceptual problems with probabilities in Quantum Mechanics itself.

My knowledge of jaynes comes almost entirely from his papers, including the physics ones, rather than his book. He describes his progression in a number of papers and I think your history is a little backwards. Jaynes latched onto Bayes very early, and didn’t see how it transformed statistical mechanics (and Maxent) until he saw Shannon’s paper. With that understanding, it was only natural that he would try to use those insights to “fix” quantum mechanics. He didn’t succeed in breaking the quantum muddle, but his related physics papers, including his neoclassical theory, are still interesting as hell.

Incidentally, although it surely must seem like frequentists are my target, it’s actually Bayesians who retain way too much frequentist intuition.