It’s such a joy when my daughter asks for help with math. It used to happen all the time; it’s rare now. She just started medical school, and had come home for the weekend to get a quiet space for concentrated study.
“Dad, I have a statistics question.” Be still, my heart!
“It’s asking, if you have a random mRNA sequence with 2000 base pairs, how many times do you expect the stop codon AUG to appear? How do you figure that out?”
I got her to explain enough about messenger RNA so that I could picture this random sequence of 2000 characters, each one A, U, G, or C, and remembered from somewhere that a codon was a chunk of three of these.
“I think it’s more of a probability, or combinatoric question than stats…” I said. (I was wrong about that; interval estimates come up later. Read on.)
The First Thing to Go Wrong
“Yeah,” she said, “it’s a permutation thing, right?”
I saw right then how she could get derailed. She was picturing some dimly-remembered bit of math, the part that had binomial coefficients and ratios of factorials. When in fact, this one is all about the fundamental counting principle, a fancy way to say, there are 4 times 4 times 4 different codons—four choices for each of three letters—and that’s 64. So the answer ought to be 2000/64, or about 31.
We talked through that and she got it, no problem.
But I was left with a feeling that we’d hit an important rock in the daughter’s mathematical ocean. And here’s what I think it is:
We take our successful students (like the daughter) and rocket them though combinatorics so that they can, however briefly, understand how to calculate binomial coefficients, distinguish combinations from permutations, and solve problems about license plates with 6 non-repeating letters.
But what they need is the ability to recognize a counting situation and reason about it, saying, there are 26 ways to choose the first letter, 25 ways to choose the second, and so forth, and to know deeply that you multiply these numbers to get the total number of configurations.
That is, we go on to the formula too soon. The formula comes from a product (and quotient if order doesn’t matter) which we foolishly render with factorials. I mean, factorials are elegant and the formula is general, but I think its form obscures the underlying meaning. It erases why it works. So much so that when you get a simpler problem (like the genetics one) you wonder “what are n and k?” instead of “what do I multiply?”
This happens all the time. But I didn’t expect to see it right then (perhaps permutation and combination formulas should go on my little darlings list). So we should add it to whatever collection of examples we should be compiling about when abstraction and formalism get in the way of understanding.
The Second Thing to Go Wrong
As I sat there I naturally constructed a Fathom simulation to check. I also wanted to check— just make sure—that I got the same answer when I grabbed three letters randomly, 2000 (well, 1998) times, as I got when I made a list of 2000 letters and then looked at all 1998 three-letter sequences.
At which point I saw the stats/modeling issue. And it’s this:
You know the prof wanted the answer “31.” It’s the orthodox expected value. But in fact, if you do it, you don’t actually expect 31. It’s a relatively rare result. (See the graph.) 31 is the average of the answers you get, but you wouldn’t be surprised if you got anything from maybe 22 to 43.
I’m not just being an interval-estimate weenie. Honest. The medical school peeps are training doctors and researchers. The assumptions of the problem—the codes are equally likely, they’re independent—are a model. We make models to see their consequences; we use them to predict, and we use reality to evaluate our models. So if you actually have 2000 base pairs of mRNA and you count the AUGs, and you get 37 of them, what do you conclude? If you did the problem and got 31, and that’s all you did, you might decide that something was wrong, whereas we can see in the graph that it’s completely consistent with the original assumptions.