The Amelioration of Uncertainty

## What do we need to model?

Part of the communication difficulty between Bayesians and Frequentists is that they’re modeling different things using similar mathematics. So it’s worth looking closely at a simple example to see what each is hoping to achieve with their methods.

Suppose we take a series of measurements in hopes of estimating the unknown . The “data generation mechanism” of the errors is IID . To be concrete I simulated 10 errors from this distribution and got:

(1)

Of course these errors would be unknown to us since we only directly observed .

The goal of Frequentists is to model the “data generation mechanism” which describes the propensity of the measuring device to give off errors on out to infinity. Their choice of IID will be judged based on how well it approximates the frequency of as .

Bayesian’s have a very different goal. Their job is to pin down those one-off unique never-to-be-repeated numbers in (1) as much as possible. Their choice of will be judged on how well it identifies the location of in the space .

Our particular Frequentist happens to be an extremely good modeler of frequencies. Somehow they learn the “data generation mechanism” is IID and use this to model the errors. They report the following 95% Confidence Interval for :

Not too shabby. The Bayesian isn’t as good a modeler as the Frequentist since they aren’t able to intuit the exact properties of the “data generation mechanism” on out to infinity, but they’re not total amateurs either. After some work, they’re able to describe potential values of using the high probability region of with

These probabilities aren’t equal to any frequencies and as a model of the “data generation mechanism” they fail miserably. But they do a reasonable job of describing where lies in space. So the Bayesian computes the 95% Credibility Interval and gets:

The Bayesian answer is clearly an improvement over the Frequentist one.

Unfortunately, things only get worse for the Frequentist from here. There’s no way for the Frequentist to improve their answer by improving their model. They already have the correct “data generation” model which exactly describes how the data was generated. The Bayesian however can continue to improve their description of the true numbers in ever more accurate detail. Eventually if they’re good enough they’ll identify exactly, at which point they’ll know exactly.

Frequentists seem to be modeling the wrong thing. So what motivates them to do this?

Well, they do it because if their highly dubious assumptions about the “data generation mechanism” magically turn out to be true, then they’ll get intervals which wrongly identify the magnitude of a fixed percentage of the time in measurements that will never actually be made.

It’s left as an exercise for the reader to spot the flaws in that.

UPDATE: I’m not sure why my point here is so difficult for people to get. The is an unknown, but fixed parameter. It doesn’t have a frequency distribution of any kind. If you try to model it’s frequency distribution then you’re seriously clueless. If on the other hand, you try to describe (using a function ) where in this fixed parameter resides, then you’re in business. It’s that simple.

August 9, 2013

Nice post, but I’m confused on a few points:

1) You start by specifying (despite your consistency in arguing against such setups) a setup that really has a nondeterministic IID data generation mechanism (this is your real-world setup, not a model). Then why do you call the model which is in exact agreement with this setup “highly dubious”? Even in the real world (where we know such a model is false) it may (open for debate) still be the best available model. But in the described setup it’s not only the best available model, it’s a _perfect_ model.

2) Your modellers clearly have access to information you didn’t give us. How did the frequentist find the data generating distribution? That’s only possible from an infinitely large (or large enough up to precision requirements) data set, or access to an oracle who knows the data generating mechanism. But they also seem to not have access to the same information as each other – otherwise the Bayesian would also have been able to infer the data generating distribution. This does not sound like a fair contest.

3) The frequentist methodology is aimed at making claims about future observations under the same setup (or, after retraining on different data, under a different setup that still contains a fixed data-generating mechanism). The motivation for this is that the orthodox statistician is in the business of making black-box estimators that can be sold to non-specialists, with multiple-use guarantees – the statistician’s aim is to maximise the nr of satisfied customers. Given that (as per your stipulation) we do have a setup with a fixed (and in principle learnable) data generating mechanism, it is entirely reasonable to suppose that there may be future measurements – why say (in this particular setup) that such measurements will never be made?

• August 9, 2013Joseph

(1) All the talk about “data generation mechanism” and using the same distribution to generate the data as used to model it, was for purely expositional purposes. I was just speaking the language of Frequentists so they’d get what I was saying.

Personally, I don’t believe any of that stuff. I believe when we collect data, were just observing concrete aspects of the universe at it evolves, that we don’t know limiting frequencies of much of anything, and that the real reason such Frequentist fantasies work is because they’re really doing what the Baysian is doing.

(2) There are two issues: (a) what is our goal? And (b) how do we achieve it? In the post I’m only talking about (a) and not mentioning (b) at all because I’ve found trying to cover both confuses people to much. Once we have a clear idea of exactly what the goal is, it’s relatively easy to discuss how it might be achieve it.

(3) Well for the above I was imagining that we’d collected all the data we’re going to collect. There’s always a finite limit. What would happen if we were to collect more data is complicated to talk about since almost anything could happen. But it’s worth considering at least one fairly general but also a good deal more physically realistic case. Suppose because of unknown laws of nature/hidden constraints/cheating that the errors all coming from some relatively small subset of the high probability manifold of that IID .

When you go to repeat this many times what you’ll typically get is a very binary outcome depending on where in that space is. You’ll either get that those 95% CI’s actually have 100% coverage or you’ll get they’ll actually have something like 30% coverage.

Those “guarantees” that Frequentists talk about are pretty much complete nonsense as physical facts. You’ll be hard pressed to find any examples anywhere of (1-alpha)% CI’s which actually had (1-alpha)% coverage except in highly contrived instances like a simulation designed to achieve just that.

Those “guarantees” are based on the assumption that upon repeated trials will move about in just the right way. If you really think about what that means physically, you’ll realize that it amounts to an incredibly strong assumption about the way the physical universe behaves and evolves. Not only do we almost never have such information, but it’s almost never true.

In reality what happens is that we generally do have some actual information about the ruler. For example we know that if the ruler has good subdivisions down to the 1mm level, then to an order of magnitude we know that . So this tells us that is in some hypersphere with a reasonable radius. By assuming a model like IID were using a distribution whose high probability manifold corresponds to that hypersphere. That’s how we find a whose high probabiltiy manifold contains .

That’s what’s really happing and it doesn’t require making any grand physical assumptions which statisticians have absolutely no evidence for and rarely turn out to be true after the fact. By taking a “majority vote” over that high probability manifold we are then making the best guesses we can make given our state of information. There’s no guarantee those best guesses are true of course, but to do better you’re going to need additional real information and not made up Frequentist fantasies.

And this isn’t just hypothetical. There are instances, for example in quality control, in which we do have additional information about trends in the errors and so on.

• August 9, 2013george

Can you say more about how your Bayesian is getting to their answer? (and continuing to improve it, post-hoc). In particular you haven’t described their prior, which makes the post somewhat nebulous.

Also, where do you get the idea that Bayesians want to identify the error terms (epsilons) in the dataset at hand? The Bayesian goal (see e.g. Hoff’s intro book, Ch 1) is usually stated as quantifying uncertainty about a population parameter (mu, here) based on the data at hand (the Y’s, here) – which is what the posterior does. It’s certainly possible, as a Bayesian, to report what’s known about the epsilons, but it’s not what the credible interval does.

And regarding “highly dubious assumptions” which can “magically” be true; please look into results on robustness, e.g. M-estimation. Many procedures can be justified in frequentist terms based on large sample results alone, typically (but not always) with simple random sampling as the only really important assumption. I know you’re not a fan of sampling as a fundamental concept, but for those doing experiments where the sampling is part of their data collection (e.g. doing a survey) sampling assumptions are far from being just a convenient bit of math; they really do reflect how the data were generated.

George: Given Y, the relationship between mu and epsilon is deterministic and known. So quantifying uncertainty about epsilon is exactly equivalent to quantifying uncertainty about mu.

Joseph: Ok, so we can get at what you meant by removing the information about how you generated the errors and moving the statement about the data-generating mechanism down, treating it as a modelling assumption made by the frequentist but not the Bayesian. But I’m still unclear on which further assumptions are made by the modellers – clearly they need more information/assumptions to calculate results such as those given. And I don’t see why you allow the Bayesian but not the frequentist to further refine the model afterwards (presumably by adding more assumptions)?

“Those “guarantees” are based on the assumption that upon repeated trials epsilon will move about R in just the right way. If you really think about what that means physically, you’ll realize that it amounts to an incredibly strong assumption about the way the physical universe behaves and evolves. Not only do we almost never have such information, but it’s almost never true.” – The assumption can be stated as “the errors are well described as having been produced by a shared data-generating mechanism, with specified parametric form”. I agree this is a strong assumption, and one of my motivations for following this series of posts is to think about how it can be relaxed. But I don’t want to throw it out completely, partly because in many applications it is actually true (not just reasonable but _true_ – because the assumption says “well described” rather than “actually generated by”). And in many cases it is also possible to generate further data from the same distribution.

• August 9, 2013Joseph

George,

I didn’t describe how either statistician got their models in order to emphasize what their goals were. But there’s nothing unusual about post hoc improvements in models. Perhaps an example just to make the whole thing seem a little more concrete. The Bayesian realizing what he needs to do go the schedule of who made the measurements that day. He saw that “Lefty” took the first measurement and he since he only has one hand his measurements always fall short. Next up was “Short Linda” who because of her stature always tends to get positive errors.

As ridiculous as that sounds a part of factory quality control involves using the knowledge of who was working which shift to make determinations of a very similar nature.

• August 9, 2013Joseph

“Also, where do you get the idea that Bayesians want to identify the error terms (epsilons) in the dataset at hand?”

Well there’s some poetic license there. Most Bayesians have way to much frequentist intuition gained from their early introduction to statistics to go full bayesian. That’s what they should be doing though.

Note though, the credibility interval is just the high probability manifold. A big part of what I was saying is that all distributions, whether they are posteriors for or whether they are models for the errors have exactly the same status. They describe the location of one-off non-repeatable things like and . These distributions are good when these true values are in the the high probability region of the distribution. Or stated another way, when the true values are in the credibility intervals created from those distributions.

• August 9, 2013Joseph

“And regarding “highly dubious assumptions” which can “magically” be true; please look into results on robustness, e.g. M-estimation”

And to all that theory I’ll respond “do what almost no statistician does and go into lab and actually check what the errors on measuring devices look like.” If you do so you’ll immediately be struck by the question which almost every thoughtful statistician has asked at one point or another “why are normality assumptions so unreasonable effective when they can’t possible be true most of the time”

There is a bigger point though. In reality errors aren’t the result of a nice NIID simulation like the one above. Consequently they have all kinds of patterns in them that make then look non-normal and non-independant. But here’s the thing: the Bayesian goal and process described above doesn’t require the errors to look normal or be independent!

If the errors had of been exactly -13,13, -4, 3,-4,13,8,13,5,5 then both statisticians would have produced intervals which contained 100.

For that matter, if the errors had of been 1,1,1,1,1,-1,-1,-1,-1,-1 then this would have been in the high probability manifold of and would thus have achieved the Bayesian’s goal. Indeed if these had of been the errors, the interval estimate for would still contain 100. That’s why the normality assumption is so unreasonable effect. Frequentists just misunderstand what’s really going on.

• August 9, 2013george

konrad: I appreciate there are connections, but the (standard) Bayesian goal is to describe what we know about one-dimensional mu, given a prior and the data, and not to state what we know about 10-dimensional epsilon. Mu reflects a population, vector epsilon does not, so these are different goals. One can learn about mu based on priors and sufficient statistic(s) alone, this is not true of the elements of epsilon.

Joseph: Without knowing what analyses were done – so that we know what these analyses would give with other data – your statement that the Bayesian answer is “clearly an improvement” can’t be seen as reflecting anything except the Bayesian being lucky with this particular dataset. If you want to think about it another way, note it would be trivial to devise a frequentist analysis that gives more precise intervals than Bayesian ones for some data, and then to only show it for an example of such data. But it wouldn’t convince anyone of anything.

And regarding Lefty/Short Linda, what you’re describing is almost exactly the point made (famously) in Cox 1958, with regard to good frequentist analysis. Nothing stops a frequentist analysis using extra information of the sort you describe.

• August 9, 2013george

Joseph: you write that;

… you’ll immediately be struck by the question which almost every thoughtful statistician has asked at one point or another “why are normality assumptions so unreasonable effective when they can’t possible be true most of the time” …

One answer is in M-estimation – i.e. “all that theory” that you dismiss. One way of understanding why Normality assumptions (which I agree won’t actually be true) turn out to be ‘effective’ is through noting that in many common problems they happen to give, to good approximations, the same answers we get *without* Normality assumptions, and indeed without any other parametric assumptions. The reasons for this all boil down to the Central Limit Theorem(s) happening to be true. If you’ve never heard of these arguments, try Stefanski and Boos 2002. These ideas are typically not in elementary statistics courses but these days any decent grad program will teach them.

Also, you write that;

“If the errors had of been exactly -13,13, -4, 3,-4,13,8,13,5,5 then both statisticians would have produced intervals which contained 100.”

In the absence of any description of what these statisticians are doing – ideally to the point of being able to reproduce it – then the reader has no reason to believe this. Describing what your “full” Bayesian does might also be a helpful way to express what it is you think Bayesians “should be doing”.

• August 10, 2013Brendon J. Brewer

“Normality assumptions (which I agree won’t actually be true)”

Normality is an assumption about the robot’s prior state of knowledge. It makes no sense to say it is either true or false.

“turn out to be ‘effective’ is through noting that in many common problems they happen to give, to good approximations, the same answers we get *without* Normality assumptions,”

Yes, I think this is why we can get away with “standard” models so often.

• August 11, 2013Joseph

George,

“If the errors had of been exactly -13,13, -4, 3,-4,13,8,13,5,5 then both statisticians would have produced intervals which contained 100. In the absence of any description of what these statisticians are doing – ideally to the point of being able to reproduce it – then the reader has no reason to believe this.”

Seriously? With those errors the actual data taken is 97, 113, 96, 103, 96, 113, 108, 113, 105, 105. Now given a distribution for the errors (and a uniform prior ) it is an elementary problem in first quarter undergraduate statistics to get either the CI or the Bayesian interval. I’m not going to spell out the steps for this, just like I’m not going to explain what integrals are or how to multiply numbers. If you can’t reproduce those steps yourself to verify the claims I made, then that’s your problem not mine.

Looking at the robustness of these procedures by looking at what would happen assuming frequency distributions other than normal completely misses the point of everything I’m saying. The doesn’t have a frequency distribution! It’s an unknown, but fixed parameter!

I don’t doubt that in practice Frequentist will get reasonable answers. I never said otherwise and that wasn’t the point. Suppose someone simply told the Frequentist that the first error was -14.8 then they’d look at the first measurement and immediately conclude that exactly. Yet their original CI is still “correct” seeing as how it based on the exactly correct “data generation” model.

On a simple problem like this one, I’m sure Frequentists would recognize implicitly that their goal was to determine that fixed, but unknown parameter with as little uncertainty as possible, and they’d quietly discard their perfectly “correct” confidence interval and just report .

• August 11, 2013Joseph

Note: I added an update to the post. If you just don’t get it, then you just don’t get it. I can’t explain it any simpler.

• August 11, 2013Joseph

Brendon

“Normality is an assumption about the robot’s prior state of knowledge. It makes no sense to say it is either true or false.”

I don’t think that’s quite right. It is true that a distribution for that fixed parameter represents a state of knolwedge about that parameter.

Specifically, when you use a distribution , you are saying “my state of knowledge is that the lies in the high probability manifold of

Since can lie in an infinite number of sets, then there can be a infinite variety of “states of knowledge” that we might have about them. That is to say, there are in infinite number of distributions which have in the high probability manifold.

On the other hand, you can’t just use any old . In the example in the post, you couldn’t use to describe where is located. That distribution clearly doesn’t describe the location of correctly. You might still think of that distribution as being a “state of knowledge”, but the “knowledge” in question isn’t true.

• August 11, 2013Joseph

And George,

A couple more points: “These ideas are typically not in elementary statistics courses but these days any decent grad program will teach them” I went to a highly ranked graduate school in statistics.

Second, if there are any persistent correlations at all (even small ones), that will destroy the Central Limit Theorem. The CLT won’t even approximately hold.

But this brings up a more important point (which really should be another post):

The Normal distribution is a Maximum Entropy distribution (subject to a constrainted mean and variance).

One consequence of this is that if you start out with any distribution (probability or frequency) and subject it to any process, (mathematical or physical) which preserves the mean and variance, but increases the entropy, then that process will drive the distribution toward a Gaussian.

This is an enormously powerful observation. It’s very general, especially since the same remark applies to all the other maxent distributions (with different constraints obviously). It’s also powerful because there are many natural process which do preserve the mean and variance, so it is relevant in a lot of situations that we care about. The Central Limit Theorem is one just example of this and not even the most interesting one at that.

But none of this is relevant for the discussion in the post. The distributions involved really only serve one purpose: to describe the location of the fixed numbers in space. There’s no “process” involved. Gaussians were used purely for convenience and their familiarity.

• August 11, 2013george

Joseph, you write that;

“it is an elementary problem in first quarter undergraduate statistics to get either the CI or the Bayesian interval”

It’s elementary to get *a* confidence interval and *a* Bayesian interval. (It’s also elementary to get them to agree exactly.) But there are infinitely many of both so, particularly when discussing foundations, you need to say which analyses your frequentist and Bayesian are using to get their different answers. Note that this is not the same as spelling out the arithmetic.

Your original post does not provide sufficient information for the reader to “reproduce those steps”. The post might be clear to you, but it’s not clear to the reader.

Also, when we “suppose someone simply told the Frequentist” the answer, i.e. the true value of mu, then the Frequentist can (trivially) give the point interval mu as a valid 95% confidence interval, and also one of optimal width. A frequentist analysis that defiantly ignores pertinent information is a straw man.

And regarding Central Limit Theorems – note the plural – there are (many) CLT results for correlated variables.

Finally, regarding being “seriously clueless”; you’re accusing a long, thoughtful and useful literature of being idiotic. Your basis for this is that the deviations of observations from the population mean (i.e. the epsilon) can never have a frequency distribution. But that can’t hold in situations where we *know*, by design, how observations are randomly sampled. Could you maybe try to explain your approach without attempting to rubbish everyone else?

• August 11, 2013Brendon J. Brewer

“You might still think of that distribution as being a “state of knowledge”, but the “knowledge” in question isn’t true.”

Agreed.

• August 12, 2013Joseph

George,

Yes I can. The next post is about that.

“A frequentist analysis that defiantly ignores pertinent information is a straw man”

That’s the exact opposite of what I said. I said they would NOT ingore pertinent information and that this is an implicit recognition that their job isn’t to model the “data generation mechanism” but is to describe the fixed parameter with as little uncertainty as possible.

The fact is that even when you’ve perfectly modeled the true “data generation mechanism” and you’ve go the perfect answer according to Frequentist principles, anyone with any additional knowledge of those fixed numbers no matter how slight can beat that “perfect answer”.

Here’s the R session log to get those intervals in the post:

> rnorm(n=10, m=0, sd=10)
[1] -14.762088 12.617334 -5.545190 2.062673 -3.403790 14.553407
[7] 7.700046 13.452766 4.017266 5.608544
> x=.Last.value
> y=x+100
> y
[1] 85.23791 112.61733 94.45481 102.06267 96.59621 114.55341 107.70005
[8] 113.45277 104.01727 105.60854
> mean(y)-(10/sqrt(10))*1.96
[1] 97.43203
> mean(y)+(10/sqrt(10))*1.96
[1] 109.8282
> m=c(-13,13,-4,3,-4,13,8,13,5,5)
> mean(y-m)+(2/sqrt(10))*1.96
[1] 100.9697
> mean(y-m)-(2/sqrt(10))*1.96
[1] 98.49048

• August 12, 2013george

Thanks for the code. Your Bayesian seems to be doing something far different from most of the textbook ones, in which the posterior for mu depends just on the prior for mu, known sigma, and vector Y – and not vector m.

What is m and where did it come from?

Also, as per Konrad’s earlier point, why does the Frequentist know sigma exactly and the Bayesian get sigma wrong by a factor of five? This difference seems to account for a lot of the “improvement” you claim the Bayesian achieves.

Regarding what you said (or not) about straw men – you wrote that “There’s no way for the Frequentist to improve their answer by improving their model” and claimed that a Frequentist who was “simply told” the true value of mu couldn’t use this information. You don’t seem to distinguish between a valid confidence interval (that covers the truth in 95% of experiments) and one that is valid *and* makes full use of the available information. The former is a straw man, the latter not so much.

• August 12, 2013Joseph

George,

You’ve completely misread just about everything. I never said they would ignore the information; I said the exact opposite. What I said was they can’t simply improve their model of the data generating mechanism because they already have the correct model for it. What they will do is quietly ignore their exactly correct model and go with the better answer. I’ve repeated this three or four times now.

You’re first three paragraphs show that you don’t get a single part of what I’m saying. I can’t tell if you’re just yanking my chain or serious, but I’ll try again just in case:

The Bayesian’s is not a frequency. It’s merely a way to describe where the fixed parameter lies in . The parameters (mean and variance) of that distribution are just there to shift the high probability manifold of over a region that contains . That is their only purpose and meaning.

I could have used something else. For example, I might have said “ is within a radius of 4 of the point ” and used the indicator function on that set as “the distribution”. The computation would have been harder, but the point would be exactly the same.

The calculation is a straightforward application of bayes theorem with a uniform prior for . Given the in the post and then the posterior for is . Then I just got the 95% Bayesian Credibility Interval from this posterior in the usual manner. It’s a very straight forward and simple application of Bayes Theorem.

• August 12, 2013george

You said “There’s no way for the Frequentist to improve their answer”. If by “answer” you mean “model for generation of Y” instead of “inference on mu” then you’re not describing the question the Frequentist actually addresses – which is inference on mu.

But back to less philosophical questions;

What is m and where does it come from? (And why isn’t the Frequentist using whatever source of information led us to choose this m?)

How is the posterior for univariate mu a 10-dimensional multivariate Normal?

NB I am not yanking anyone’s chain. You make extraordinary claims about the idiocy of standard statistical methods, and the superiority of something roughly Jaynesian. I want to see your claims explained and/or justified, clearly, with a fair representation of what the standard methods do and don’t achieve, i.e. without “poetic license”.

• August 12, 2013Joseph

George,

These questions has been gone over multiple times. Just reread and think until you get it.

Maybe one thing to help you focus. This post had a very specific purpose. The purpose was to determine whether the statistician should be striving to either:

(a) figure out things about the “data generation mechanism” resulting in a frequency distribution for the errors, or
(b) figure things about the fixed parameter resulting in a distribution which has a philosophical status the same as a prior for .

• August 12, 2013Brendon J. Brewer

The P(errors) being a multivariate normal is the prior distribution. Once you get the data you are way more informed about the errors in the data set.

• August 12, 2013Daniel Lakeland

Let me add some hopefully clarifying ideas here:

the “m” we assume comes from some kind of model, pretend it’s a deterministic model that includes some basic scientific knowledge. So not only does the data generating mechanism come from an IID normal, but it also comes from some actual process, and if we can gain somehow some information about the scientifically described causal process, then we can predict some of the “randomness”. An example might be some kind of computer in your shoe predicting the roulette wheel. To a very good approximation, in the long run, every slot in the wheel is evenly sampled. But in any GIVEN run you could predict it a little better by timing the ball (see the book “The Eudaemonic Pie” for the actual 1970′s historical events this example is based on).

Joseph’s main point is that *there is no one-true frequency distribution* even in his example where there is. None of the Bayesian’s individual m values are based on *the long run frequency* of anything, they’re based on individual, one-off model predictions. Nevertheless, they give a better result, precisely because they give up on interpreting P as a frequency of anything, and instead allow it to be a “credibility” of a model prediction.

• August 12, 2013Joseph

The sum and product rule of probability theory are the key tools for propagating counts. Frequentists imagine that they are only useful for propagating frequency counts, but in truth they will propagate counts of anything: states, possibilities, or whatever.

So given the range of possibilities for described by the high probability manifold of the Baysian’s , this induces, through the relation , a range of possibilities for .

This “range of possibilities” for is described by the high probability manifold of the posterior. The 95% Credibility Interval of the posterior is just that “range of possibilities”.

Basically, the result shows that 95% of all possible values of ‘s, lead to a . Without having more information that could be used to confine to a smaller range of possibilities, this is the best we can do.

• August 12, 2013Brendon J. Brewer

How are you typing mathematics? Test: \mu \alpha \pm \in

Is this idea of a “high probability manifold” well defined? I think the idea works well in colloquial speech but if you try to make it precise you can confuse yourself. e.g. is (0,0,….,0) in the high probability manifold of a 10000 dimensional unit gaussian?

• August 12, 2013Joseph

Just put “latexpage” at the beginning of the comment with brackets [ ] in place of the quotes. I’ve been meaning to figure out a way to have this automatically inserted so you don’t have do it by hand.

How about defining the high probability manifold like this .

• August 12, 2013george

Daniel; thanks for trying.

If a frequentist has a roulette-timing machine in their boot, their analysis should use it; the goal of frequentist analysis is not to replicate the data generating mechanism conditional on as little information as possible. See Cox 1958, and subsequent literature.

Similarly, if a “textbook” Bayesian – i.e. one interested in inference on mu directly, who might view Y as iid samples – has information that the Y are not simple iid samples – e.g. that the samples were measured, that we known the measurements, and that some came out slower than other (thus breaking exchangeability) then they should use that inference.

Joseph; you are evidently frustrated, and so am I. But I don’t see any argument here – or even an attempt at an argument, frankly – that gives a situation where Frequentist, “textbook” Bayesian, and your “should be doing” Bayesian/Jaynesian get a fair comparison. (Konrad didn’t, either.)

Regarding your (a)/(b) difference, I am far from convinced and suggest you try again. Give a physical example – Daniel’s roulette wheel is fine, or use the setup in Cox 1958 – where all the available information is stated clearly up front. State which actual analyses your Frequentist etc would then do, what they’d learn, and (if you can) why it would serve them so terribly badly.

Bye.

• August 13, 2013Joseph

George,

This post was not about showing Bayesians getting better results than Frequentist in a fair comparison, nor did it have anything to do with Bayesain having a better sampling theory, which somehow frequentists weren’t able to use.

Once again you’ve confirmed that you simply can’t understand what I’m saying.

• August 13, 2013Daniel Lakeland

George, I took the main point to be the one in the last paragraph of my previous post. To be philosophically consistent, a Frequentist can not use a model like the Bayesian’s model where the probability distributions used do not have some kind of long-term frequency interpretation.

Of course, most likely Frequentists all over the place violate this philosophical principle a lot (they use “normal” when things are not anything like “normal” for example). But in principle, if the sampling of Joseph’s measurements really is long-term N(100,10) and the Bayesian has a special model in which he can predict things with what seems to be a little smaller uncertainty via some additional knowledge, then we have to ask what the Frequentist is allowed to do in response to this similar knowledge? The problem is, what happens when we don’t have an opportunity to find out what the long-run frequency distribution of the Bayesian’s predictive model is? Suppose that we’re going to measure say the position of a rocket that blows up, there is no opportunity to make further measurements once it’s blown up, we need to know whether the rocket took the path it was supposed to… the Frequentist can’t evaluate the model based on long-run frequency of errors. What next?

You could argue for example that we could shoot multiple rockets. But suppose there’s no money to do that unless we can show the first one did what it was supposed to. (it’s easy to translate this into say a drug-discovery context, or manufacturing quality-control or other context where we have very limited resources and have to make decisions before having anything we might call “long run frequencies”)

Joseph’s point seems to be that if we know that typically measurements are within a certain distance of correct, we can describe this using pretty much any shape of probability distribution whose high-probability-region includes values only within about that distance from the measurement. Normal happens to be useful, but indicator functions on balls or mixtures of normals or mixtures of indicator functions, or raised cosine curves, or lots of other things could work as well. None of these are based on any kind of real “frequency” interpretation.