# Stability: A Central Idea for the New Stats

If we replace the old Normal-distribution-based paradigm with a new one featuring randomization, we happily shed some peculiar ideas. For example, students will never have  to remember rules of thumb about how to assess Normality or the applicability of Normal approximations.

This is wonderful news, but it’s not free. Doing things with randomization imposes its own underlying requirements.  We just don’t know what they are. So we should be alert, and try to identify them.

Last year, one of them became obvious: stability.

(And it appeared recently in something I read, I hope I find it or somebody will tell me so I don’t think I was the first to think of it, because I never am.)

What do I mean by stability? When you’re using some random process to create a distribution, you have to repeat it enough times—get enough cases in the distribution—so that the distribution is stable, that is, its shape won’t change much if you continue. And when the distribution is stable, you know as much as you ever will, so you can stop collecting data.

Here’s where it arose first:

It was early in probability, so I passed out the dice. We discussed possible outcomes from rolling two dice and adding. The homework was going to be to roll the dice 50 times, record the results, and make a graph showing what happened. (We did not, however, do the theoretical thing. I wrote up the previous year’s incarnation of this activity here, and my approach to the theory here.)

But we had been doing Sonatas for Data and Brain, so I asked them, before class ended, to draw what they thought the graph would be, in as much detail as possible, and turn it in. We would compare their prediction to reality next class.

What did the student graphs look like? Kind of like the previous year’s: The most experienced and sophisticated among them remembered the theory and drew a symmetrical ziggurat centered on 7. One was even more sophisticated and drew a bell-shaped curve. Some students drew flat distributions extending from 2 to 11. A couple showed some variation; was it insight or was it sloppiness?

Of course, none of them looked like the real graphs. I know, and you know, that fifty rolls sounds like a lot to a high-school student, but it’s no way near enough to get something that resembles the theoretical distribution. So we had a chance in the next class to have some surprises comparing prediction to reality. For example, if you predicted that 7 would have the most, but your data had 8 with the most, what do you conclude?

We then combined all the class data (another part of the homework was to enter the data into a common pool; we used Fathom Surveys) and saw that, gosh, once we had 1000 cases, the thing looked more like a symmetrical hump.

So: even without doing the theoretical dance, we could come up with an important realization: in this situation, even with 50 examples, you can’t trust the details of your results.

When we made the Fathom simulation of two dice, students could re-randomize and see sets of 50 dice in rapid succession, and get a strong picture of how widely the distributions varied. Then they could add 950 more pairs, and see how the distribution of 1000 was more stable. It still varied, but it didn’t change nearly as much as the distributions with N = 50.

All this is good, and points out the importance of giving students experience with randomness.

But my point here is that in order to use empirical, random-based results, students need to recognize stability in order to draw conclusions. Here are some thoughts about that:

• What do we really mean by stable? The answer is partly qualitative: the shape stays kinda the same. But it can also be quantitative: stable enough for what you’re after.
• A simple example: suppose you suspect a pair of dice of being loaded for sixes. So you’re looking for the probability of rolling a 12. You need to decide how precisely you want to know that probability (±1%?) and adjust N accordingly. The key is that you can simulate it in software and look at multiple runs—which is the equivalent for increasing N manyfold—so that when you do it only once in reality (with the suspect dice) you know the (informal) margin of error.
• Later in the course, suppose you’re doing a more sophisticated analysis. You’re collecting a sampling distribution for some measure such as your estimate for the total number of German tanks. How many do you collect? Enough so the distribution looks stable (that’s the qualitative part), or, if you want to know something specific such as the percent of time you underestimate the total number, see how that varies in multiple runs.
• Suppose we’re simulating the Aunt Belinda problem. Students need a collection of 20 coins, from which they compute a measure, the number of heads. Then they collect that measure, but how many times? Enough so that the resulting [sampling] distribution is stable. But the students often get confused. How many cases do I need in this collection? How many in the measures? It may be that explicitly recognizing stability as a criterion will help them.
• Stability also impacts small-number statistics. When students did the Census project, some had situations where they had very small numbers in some groups. For example, they might have found out that 75% of Armenian-Americans went to college compared with only 50% of Turkish-Americans. What do you conclude? Nothing when you realize the sample sizes are 4 and 2. I could get it across to students by saying that the proportions are unstable. We had used the term in the simulation context and it made sense here. Figuring out whether this is a legitimate connection might be useful. 