When we teach about the Pythagorean Theorem, we almost always, at some point, use a 3-4-5 triangle. The numbers are friendly, and they work. We don’t usually make this explicit, but I bet that many of us also carry that triangle around in our own heads as an internal prototype for how right triangles work—and we hope our students will, too. (The sine-cosine-1 triangle is another such prototype that develops later.)

In teaching about (frequentist) hypothesis testing, I use the Aunt Belinda problem as a prototype for testing a proportion (against 0.5). It’s specific to me—not as universal as 3-4-5.

Part of this Bayesian quest, I realized, is to find a great example or two that really make Bayesian inference clear: some context and calculation that we can return to to disconfuse ourselves when we need it.

### The Paper Cup Example

Here’s the one I was thinking about. I’ll describe it here; later I’ll explain what I think is wrong with it.

I like including empirical probability alongside the theoretical. Suppose you toss a paper cup ten times, and 8 of those times it lands on its side. At that point, from an empirical perspective, P( side ) = 8/10. It’s the best information we have about the cup. Now we toss it again and it lands on its side. Now the empirical probability *changes* to 9/11.

How can we use a Bayesian mechanism, with 8/10 as the prior, to generate the posterior probability of 9/11?

It seemed to me (wrongly) that this was a natural. The point of Bayesian inference is that our conclusions change with additional data; Bayesians tell us that the computations are not that hard; they imply that it’s conceptually actually easier; and here is a clear situation in which our inference about the parameter P(side) changes with data.

So I asked a Bayesian sitting at my table at breakfast at ICOTS, how do you formulate this? Her response: “it’s not that simple.”

I found that response enormously frustrating in the light of Bayesians’ enthusiasm for their method. And since I’m a closet Bayesian trying hard to come out, I really wanted an easy, clear prototype to help my understanding—and my eventual evangelism.

Well. One cannot condemn the field because of a short, inadequate response from a hungry conference-goer. But I still want that example. So let’s fast-forward to the books I’ve been reading. They have examples, and some are better than others. Let’s compare two.

### The Rigged Coin Example

In one example, you have a coin that might be rigged. You know (for sure) that the probability of heads is either 0.2, 0.5, or 0.8. You have reason to believe that these three possibilities are equally likely. You flip the coin ten times and get seven heads. Now what is the probability that the coin is fair?

You can see, vaguely, how we’re going to use conditional probability to answer this question—it’s analogous to the question from before:* if she’s wearing a crown, what’s the probability that she’s a princess*? But there are a bunch of details to get right in how we formulate the calculation, and we’ll postpone that for another post.

Instead, let’s focus on problems with this example. There are at least two:

- Why do we think that the three possibilities are equally likely? I’m not actually too worried about this one. As much as we want our examples to be authentic and realistic, we can’t always do it. We could invent a more complicated mechanism—we have three coins certified to have these probabilities, and we draw one out of a bag—but that may just be overkill.
- Why use a coin? This is more serious, and here’s why: while I was working through this example, I was constantly distracted by the possibility that the coin’s parameter might be 0.55 or 0.618. The parameter is, by its nature, continuous, and we’re looking at it in a categorial context.

I think it’s best that the initial example be categorical. Then we can add things up in order to compute probabilities; when we go all continuous, we’ll have to integrate. Better to stick with addition while we’re developing conceptual understanding and intuition.

### The M&Ms Example

This is from Downey’s *Think Bayes*:

[Before 1995,] the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of mine has two bags of M&M’s, and he tells me that one is from 1994 and one from 1996. He won’t tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow one came from the 1994 bag?

Again, this is a contrived situation, but that doesn’t bother me. It is a little convoluted, but at least it’s purely categorical, so doesn’t distract the reader with the chance of an underlying probability not in the set.

Does the categorical nature of the example outweigh the complexities (twelve probabilities in the problem statement, of which we need maybe 4; the fact that we have to keep track of two M&Ms, one from each bag)? I’m not sure—but at the moment, I like the second one better. Next time, we’ll do the calculations.

You can start to imagine how the calculations will go: there will be questions like, *what’s the probability that the first M&M is yellow given the fact that it’s from the 1994 bag*? (0.2). We’ll construct formulas using Bayes’s Theorem to figure out various probabilities. Both examples lend themselves to a ribbon-chart or two-way-table kind of thinking; we can imagine using diagrams like the ones I use with conditional probability (from the previous post) to help make sense of things. For the M&Ms, there will be columns for the two different possibilities (a.k.a. hypotheses): the yellow M&M is from 1994 or 1996. Next time.

### Return to the Paper Cups: Where I Went Wrong

But before we close, notice that we can’t easily cast the paper-cup example I want that way. I’ve flipped it ten times, now we’ll get one more data point. Where’s the two-way table? Even though the question is all about updating probabilities based on data, it doesn’t easily fit the Bayes-situation paradigm.

Where did I go wrong? Let’s review: the *empirical* probability of a cup landing on its side changed from 8/10 to 9/11 (0.80 to about 0.82). But what’s the *actual* probability? We don’t know. The empirical probability is an estimate. We expect it to be in the vicinity. We’d be surprised if it were, say 0.3.

But in Bayesian terms, *we’re not calculating an actual, specific probability*. Instead, we’re calculating our *belief* in a probability—and unless we’re certain (and we’re not), that belief is a distribution. In this case, it’s a smear between about 0.5 and 0.95, a *continuous* probability distribution. So there are two probabilities going on here: the probability of landing on its side, P(Side), and *completely separately*, this distribution of our beliefs, where we might say, the probability *density* of our belief that P(Side) is 0.80 is some number, say 4.2.

Aside on learning: O stats teachers, you know how it takes a while (sometimes longer than the course) for a student to really understand the difference between a distribution of the data and a sampling distribution? There are two distributions, and they mean different things. I think I’m going through the analogous phase in becoming a card-carrying Bayesian: there are two probabilities, and they mean different things. It’s also kinda meta, like so many things these days: our prior and posterior distributions are probability distributions of—probabilities.

The Bayes world won’t give us a number shifting from 0.80 to 0.82. Instead, we get a prior *distribution* of our belief about P(Side) which will center near 0.8 but have some spread. (After all, with 8 “sides” in 10 flips, are we *certain* that P(Side) is 0.8? Of course not!) Then, after the next flip, another *side*, that distribution will shift north towards about 0.82.

I think that the peaks of these distributions are called *maximum likelihood estimates*, or MLEs. If I’m right, that’s one thing Bayesian calculations will give us.

I promise to get to that, and show you the distributions, but this “simple” example was unwittingly continuous. It will therefore require integration; it’s not simple. So next time, we’ll look for a really great categorical example or two and do the calculations involving sums.

Finally, if I’m right in that analysis above, notice how much insight I got out of wrestling with the wrong example. (Also, if I’m right, I really wish that Bayesian had made this explanation. Maybe she did and I needed more experience to hear it.)

Yes, the 8/10 or 9/11 is a maximum-likelihood estimate for the parameter p, and you don’t need Bayesian statistics for MLEs (they are more of a frequentist estimation technique). What you have in a Bayesian context is a prior distribution of possible values for p (for this example, a beta distribution works well, because it is a conjugate prior for the binomial likelihood distribution), then an updated posterior distribution (again a beta distribution). The popular prior distribution to use is the beta distribution with parameters (a=1,b=1), because that has all values of p being equally likely.

If the counts of your observations are 8 and 2, the posterior distribution for p is the beta distribution with a=8+a_prior, b=2+b_prior. Because the prior parameters get added to the counts to get the posterior parameters, the prior parameters are often referred to as pseudocounts. The mean of a beta distribution is a/(a+b), so using the (1,1) prior (add-one pseudocounts), your prior mean estimate would be 1/2 and your posterior mean estimate (8+1)/(10+2)=0.75 for 8 out of 10 and (9+1)/(11+2) = 0.769 for 9 out of 11. The full posterior beta distribution can be plotted (it peaks at (a-1)/(b-2) if a>1, b>1, which is 0.8 and 0.8181 as one would like, but the means are closer to the middle than the modes), and things like confidence intervals or distributions of various functions of p can be determined.

I can’t think of numerical example right now, but going together with your Aunt Belinda problem I do know Bayesian testing is good for parapsychology (I have seen it in an actual paper), because you want to start with the prior assumption that psychic powers are very, very unlikely.

Yeah — somewhere in this blog, I have a reference to a wonderful talk by Jessica Utts (just do a search) on the topic. It addresses the deep question, how much evidence do you need to change your mind.

Pingback: A Bayesian Example: Two coins, three heads. | A Best-Case Scenario