I’ve just posted twice on activities about variation. Variation is an important topic in stats. In order to decide on the best path through the material, it might help to ask why we care about variation in the first place.
Here a quarter-baked musing: variation matters in two related domains: prediction and classification.
Prediction
When we want to predict what will happen next, knowledge of variation can inform us about the range of phenomena to expect. Put another way, if you know your timer runs out of sand in 85 to 90 seconds, you can make plans. You can wash your hands or go pick up the newspaper and be confident that the sands will still be falling when you return. If you don’t care about five seconds’ advantage, you know that it’s a fair tool to use when you and Aloysius play Perquackey.
This is what I’ve been clumsily trying to get at in class (and, having written this, maybe I can be a little more agile). But there’s more to it than that. We want to make the best predictions possible; and there are two ways to tell if a prediction is any good:
- It’s right.
- It’s narrow.
The first is obvious. The second is not, but is the point of a lot of statistics. That is, with my timer, I will be right every time if I predict that it will take between a second and a year for the sands to run out. But as a prediction, it’s useless. It’s too broad.
So if I can narrow my timer range down from [85, 90] to [86, 88]—and we were still right every time—we would say that it’s a better prediction. It’s a clearer, more precise description of the phenomenon. (The wider range may be good enough for some purposes; you don’t have to look for a better prediction all the time.)
It gets interesting, however, when the two goals—being right and being narrow—conflict. Let’s concoct another example: in Rivertown, USA, the water rises every spring to between 15 and 20 feet. If we know it’s always in that range, we build the levee to 20 feet and forget it. But if the river only ever gets to 19 feet, it’s more economical to stop there, at 19 feet. That is, there is a cost to making a broader prediction, a benefit to making it narrower.
But there is also a cost to making it too narrow. Suppose that every 20 years or so the river rises to 21 feet. We have a dilemma: do we spend the extra money to raise the wall to 21 feet, or accept the occasional cost of a flood? That is, what is the balance between being narrow and being right? This is a central issue in “official” stats, for example, in confidence intervals, but I think it’s one my regular stats kids can grapple with immediately, especially if we can concoct a setting where the costs and benefits are clear. (Note to self: the DataGames “lawn darts” scenario just got better with this insight…)
Experience—and the mathematics of variation—also help us predict things that have not happened yet. If we have timer measurements of 86.41, 86.44, and 86.49 seconds, and we wanted to describe a secure range for future values, we’d be foolish to use [86.41, 86.49], because it’s so likely that we didn’t get the minimum and maximum values in just three measurements. On the other hand, we’d be pretty confident to exclude a value as different as, say, 20.00, because of the size of the variation in the existing values, even though there are only three of them.
Put another way, knowing about variation also helps us know what not to expect.
Classification
Suppose we’re watching a bunch of sparrows. They’re all small and brown, but they do vary: some are a little bigger, some smaller, some more brown, some a little less. Once in a while, a particularly chubby sparrow appears; our reaction might be “wow, what a fat sparrow!”
But then one shows up that’s seriously surprising. It’s huge. There are two possibilities:
- Our internal “range of sparrow sizes” needs adjusting.
- It’s not a sparrow.
So we use the size of the variation among sparrows to help us separate the sparrows from the non-sparrows; quantifying variation facilitates classification.
I’ll stop here. It’s time for breakfast. But the astute reader (which I hope includes me when I reread this) will see how this connects not only to classification but also to hypothesis testing in orthodox stats and thence, of course, to the nature of science.
It also connects to that still-unwritten essay about how there are no categorical variables.