As laid out (apparently not too effectively) here, I’m on a quest, not only finally to learn about Bayesian inference, but also to assess how teachable it is. Of course I knew the basic basics, but anything in stats is notoriously easy to get wrong, and hard to teach well. So you can think of this in two complementary ways:
- I’m trying to ground my understanding and explanations in basic principles rather than leaping to higher-falutin’ solutions, however elegant; and
- I’m watching my own wrestling with the issues, seeing where I might go off-track. You can think of this as trying to develop pedagogical content knowledge through introspection. Though that sounds pretty high-falutin’.
To that end, having looked critically at some examples of Bayesian inference from the first chapters of textbooks, I’m looking for a prototypical example I might use if I were teaching this stuff. I liked the M&Ms example in the previous post, but here is one that’s simpler—yet one which we can still extend.
There are two coins. One is fair. The other is two-headed. You pick one at random and flip it. Of course, it comes up heads. What’s the probability that you picked the fair coin?
Let’s say that F is “we picked the fair coin.” Before we flipped it, P(F) = 0.5. Now we want the probability of F, given that we get heads: P(F|H). That is, the coin flip, the heads, is data. And we want to use the data to update our understanding.
The obvious (smells of Monty Hall) answer is that flipping makes no difference. We picked it at random. It’s gonna stay 50%. (Depending how rabid we are, we might say, “you already picked the coin; the probability is 1 or 0.” But we won’t.) But imagine continuing to flip. Ten times. All heads. Now what do you think? Clearly, it’s really unlikely that it’s the fair coin. In this situation, we could imagine doing a traditional null-hypothesis test. How many times out of a thousand would you get all heads in ten flips of a fair coin? About once. We would reject that null as implausible. But we would be forbidden from making a statement about the probability that the null hypothesis, F, is true.
But Bayes lets us do just that. We can enumerate the possibilities (for one flip) in a diagram:
Our choice of column—the coin—is 50–50. Within each column, the flip is fair, but on the left side (two-headed coin, green boxes), the two results are the same: both heads. So the diagram enumerates four equally-likely possibilities. We can count up the number of times heads shows up (3) and how many of those are from the fair coin (1), giving us our probability: 1/3.
How does this correspond to the (un)familiar formula? Here it is, with A and B:
For our situation, A is F: we have a fair coin. B is H: we flipped heads.
Does it work?
- P(F) is the probability that we picked the fair coin originally: 1/2.
- P(H|F) is the probability that, if we picked the fair coin, we’d get heads: 1/2.
- P(H) is the hard one: the probability that, in this context, not knowing which coin we have, we get heads. We need the diagram or its equivalent. The result: 3/4.
Notice a potential pitfall. We touched on this last time: the two letters in the formula—A and B; F and H—are categories from different attributes, different variables. I’ll bet there is a tendency to say, “there are two possibilities, A and B.” No: there are two possibilities, A and not A. Within each, there are two possibilities, B and not B.
What happens if we flip the coin another time, and get another head? There are two ways to do this. First, we can enumerate all the possibilities for both tosses, like this:
With one more head, the probability that we have the fair coin has fallen from 1/3 to 1/5. In the Bayes formula, P(HH|F) is 1/4 and the denominator, P(HH) is 5/8:
But there’s another way. Suppose we don’t analyze both tosses at once? Suppose we’ve already done one toss. In that moment, we know that the probability that we have the fair coin P(F) is 1/3—not the original P(F) = 1/2.
Now we get one more head. How does that change our estimate of the probability? Here’s a diagram showing one more flip (not two as before), but starting with P(F) = 1/3.
If you look at the areas, you can see that since the green H’s are twice the size of the orange one, the orange H is 1/5 of the total area of H. That is, the probability that the coin is fair (orange) given that it’s heads, P(F|H), is still 1/5.
In the formula, as we said, P(F) is 1/3. (It’s the prior.) P(H|F) is still 1/2. It’s a fair coin, after all. P(H) is harder. With this diagram, we can count up area (and get 5/6), but pretty soon we’ll need to do that thing where you add up (conditional) probabilities weighted by the “outer” probabilities. (At least one book refers to dividing by this quantity normalizing, which makes sense; it’s the size of the universe of “we flipped heads.”) In this case,
With that, our formula works out:
Reflecting on Priors
Although I’m convinced this analysis has been correct, I’m uneasy (on behalf of future students) about P(F) being the prior. I think the potential confusion comes from the implied temporal nature of the calculation. How is it that P(F|H) is the same kind of thing as P(F), only later?
On one hand, it’s easy: you have a probability of having the Fair coin (F); then you flip heads (H); the result—the posterior—is the probability of the fair coin given that we flipped heads,(F|H).
On the other hand, we probably first learned about Bayes’s formula in a non-sequential context. Suppose we study the association between sex and whether your first name ends in a vowel. (Not surprisingly, more girls’ names end in vowels.) If we’re about to meet a person chosen from a group of 100 boys and 100 girls, the chance that it’s a girl is 50%. But if we learn that the person’s name ends in a vowel, the probability that they are female rises.
It’s not the sequence, then, that the situations have in common, but this: the “other” quantity (the quantity on which we condition: in our examples, flipping heads, H; or V, the probability that any name ends in a vowel) is additional information that we can use to make a better estimate of a probability, and therefore make a better prediction.
So the “prior”-ness of the prior is not temporal, but rather informational. We’re comparing the situation without the additional information to the situation with it. The difference is data.
Notice how this is like regression or ANOVA, where additional information (the x value, the predictor) gives us a more precise prediction for the quantity of interest (y, the response).
The Initial Prior—and subjectivity
In our coin example, we specified that we chose one of the two coins at random. This gave us our initial 50–50 split and the associated diagram.
But suppose that we didn’t think it was so likely that a coin was double-headed. Maybe we think there’s only a 1% chance that the coin is two-headed. Then the calculation will come out differently. For one thing, getting two heads in a row will not seem that strange; it’s not good evidence against the fair coin. (We’ll explore this in a later post.) That is, our initial opinion about the probability affects our results—and will ultimately influence any decisions we make based on those results.
Opinion seems inescapable here; and that could give some folks pause. I think I’m happy to accept that any time we have a probability, there is some element of subjectivity. But helping students accept that—and not conclude that therefore all answers are equivalent—will be something to pay attention to.