The esteemed Dr. Wasserman claimed “This is a general problem with noninformative priors. If is somehow noninformative for , it may still be highly informative for sub-parameters, that is for functions where and .”
Not only is it not a problem, but it’s the key to Statistics and fundamental to the philosophy of science.
Consider flipping a coin 500 times and predicting the percentage of heads. From the coin’s symmetry we know each of the sequences will have equal probability (propensity?) allowing us to model it as a random process. A uniform distribution on the space of sequences implies it’s likely . The randomness assumption can be confirmed by flipping a coin and observing in this interval. An easy path from objective knowledge to secure inferences if ever there was one.
Unfortunately, every part of the last paragraph is nonsense.
Observing provides no evidence for “randomness”. Almost all sequences have the property so this outcome is usually observed no matter what causes are active. Or if you like, is likely to be observed regardless of the “true” distribution, even if it’s extraordinarily non-uniform. Something like this has to be true since is so large only a miniscule fraction will ever occur. The next sequence observed is thus being drown from a tiny subspace all possibilities and yet in practice we do observe .
Moreover, the outcome of a flip depends far more on the coin’s initial conditions than it’s Inertia Tensor. The “equal propensity” assumption is not a physical property of the coin, but rather a strong assumption about initial conditions. Statisticians have no clue when the assumption might hold, especially since they never measure Moments of Inertia or initial conditions and mostly wouldn’t know what to do with the information if they had it.
The uniform assumption is thus based on nothing. It’s a noninformative prior of exactly the kind Wasserman (and seemingly Gelman) claim are a lost cause.
It’s amazing reliable predictions can come from a noninformative prior. The mystery is explained by examining the thing predicted. If is a sequence of 0,1′s representing the outcomes of the 500 flips, then the mapping is highly non one-to-one (both in a strict and an approximate sense). The effort succeeds because we’re estimating a function which is largely insensitive to and so ignorance about isn’t much of a hindrance in predicting !
Everything in statistics is like this. The outcome of the 2012 Presidential election is a highly non one-to-one mapping from the space of possible votes into . In Statistical Mechanics, the Energy is a highly non one-to-one mapping from the dimensional Phase Space into . When you average data to estimate , you’re implicitly using the non one-to-one mapping . Statistics is a one trick pony and this is it’s one trick.
But there’s more. That you can be ignorant about one space, but well informed about a function of that of space, allows for a kind of “separation” or “disconnect” between domains. For coins the “separation” allows us to predict without knowing Euler’s Equations for Rigid Body motion. In Statistical Mechanics it allowed Physicists to derive before they knew anything about Quantum Mechanics. Indeed without this disconnect, we wouldn’t survive long. It’s this “separation” that allows us to drive cars safely while ignorant of almost everything causally affecting us. In a similar way, we can do Physics without first knowing everything about Biology and vice versa. That we have any separate successful branches of Science at all is a happy consequence of Wasserman’s “general problem”.