There has been a lot of happy chatter recently about doing statistical tests using randomization, both in the APStat listserve and at the recent ICOTS9 conference. But testing is not everything inferential; estimation is the other side of that coin. In the case of randomization, the “bootstrap” is the first place we turn to make interval estimates. In the case of estimating the mean, we think of the bootstrap interval as the non-Normal equivalent of the orthodox, *t*-based confidence interval. (Here is a youtube video I made about how to do a bootstrap using Fathom.) (And here is a thoughtful blog post by list newcomer Andy Pethan that prompted this.)

But Bob Hayden has recently pointed out that the bootstrap not particularly good, especially with small samples. And Real Stats People are generally more suspicious of the bootstrap than they are of randomization (or permutation) tests.

But what do we mean by “good”?

One response would be to measure it as we would the traditional confidence interval (CI): how frequently does the interval capture the true mean? In particular, does it capture it at the same rate as it claims? A true 90% CI should have the property that, if you repeat the process of finding the interval many times, 90% of the intervals generated would contain the true mean.

(Notice that having to say it this way is very hard, and why we like the bootstrap procedure as a way to help students understand sampling variability and the idea of an interval estimate. But this post is about measuring the effectiveness of the bootstrap, and commenting on whether that is a good idea.)

So let’s do that. It’s probably been done many times before; I remember seeing a poster about it at ICOTS, but I can’t find the reference, as usual. Besides, it’s instructive to do it yourself. And in this case, I can’t do it in Fathom for technical reasons, so I have an excuse to (re-)learn some Python. (code here)

## Here’s the plan:

- Generate random data with some sample size
*n*. We’ll know the mean of the underlying population, but our sample won’t have that mean. - Compute a 90% orthodox CI for that sample, and see if it contains the true mean.
- Compute a 90% bootstrap “plausibility” interval for the same sample, and see if it contains the true mean.
- Repeat steps 1–3 1000 times and count up the captures.

Step 3 is the hard part. Here it is, broken down:

**3a**: Sample with replacement *n* times from our sample to make a “bootstrap sample”

**3b**: Compute its mean.

**3c**: Repeat 3a and 3b many times (call this *m* times), collecting the means to make a “bootstrap distribution of the means.”

**3d**: Find the 5th and 95th percentiles of that distribution.

Those are the limits of our “plausibility interval.” The bootstrap distribution is a distribution of means of our samples, so it’s really a sampling distribution, and this is where it connects, conceptually, to a sampling distribution from a randomization test or even to the orthodox, Normal-based sampling distribution on which the CI is based.

## Results: Uniform Source Distribution

Before we start, we need to specify the source distribution.

For the first test, I did the simplest thing: I used the canned random function. That is, the data were uniformly distributed on the interval [0, 1). The true mean was 0.5.

For the bootstrap, I computed *m* = 800 samples (so there were 40 points in each tail of the sampling distribution) for each of the *n* = 1000 cases. What fraction of the runs captured the mean?

If everything is perfect, we expect to capture the mean 0.900 of the time:

n | bootstrap | orthodox |
---|---|---|

3 | 0.631 | 0.864 |

5 | 0.815 | 0.898 |

10 | 0.856 | 0.899 |

30 | 0.887 | 0.903 |

100 | 0.900* | 0.900* |

100 | 0.886 | 0.895 |

* This actually happened. That’s why I did it again for the next row.

If I did it right, Bob’s complaint is totally true: if we measure the bootstrap by how well its capture percentage matches the percentiles we use to calculate the interval, it does badly at small sample sizes, but is comparable to the orthodox CI procedure by the time you get to *n* = 100.

## Skewed Source

That was for a uniform distribution. What if we feed it something skewed? For the next table, I took the random number on [0, 1) and raised it to the fourth power (see the illustration for the shape). Now the interval is the same, but most of the data piles up on the left, close to zero. The true mean is 0.2.

We expect the orthodox procedure to do worse at small sample sizes, but improve rapidly as the CLT has its way. How does the bootstrap do?

n | bootstrap | orthodox |
---|---|---|

3 | 0.531 | 0.729 |

5 | 0.733 | 0.811 |

10 | 0.833 | 0.857 |

30 | 0.874 | 0.879 |

100 | 0.895 | 0.899 |

As you can see, both procedures have trouble at small samples, but the bootstrap catches up to the orthodox more quickly than before—between *n* = 10 and *n* = 30.

## Reflection

As one who wants to escape the tyranny of the Normal distribution and the confusion it sometimes wreaks, I’m disappointed that the bootstrap didn’t do better. At least, when the distribution is skewed, it’s not a lot worse than the traditional technique, as long as you have *n* = 10 or so.

On the other hand, I could quote myself:

For the reasons Bob outlines (or, more precisely, because I never had heard a convincing argument that [the bootstrap] turned out the same as an orthodox CI), I always try to call an interval generated by the naïve bootstrap (the only one I know how to do) a “bootstrap” interval or a “plausibility” interval. And it’s what I use with my non-AP class. I like the word “plausibility” for this, as it carries with it the qualitative idea of surprise: would you be surprised to learn that the mean were outside the interval? (APStat, Mar 17, 2014)

That is, even I never claimed that the percentages worked out correctly, but rather that we could use the technique at least to get students to think about plausible values—and maybe eventually assign ranges to answers instead of solid numbers.

Having said that, a 9o% plausibility interval with *n* = 3 and a skewed source came dangerously close to missing half the means!

On yet another hand or so, if we get real about our students and their data:

- I hope they don’t draw many conclusions with
*n*= 3 in any case. - We don’t make many decisions with a value right on the edge of the interval (or with a
*P*-value right near 0.05). And when we do, I hope we’re appropriately unsure; maybe we need more data to be more certain.

Wow — thank you for this, and for the code to test myself.(still working on getting SciPy running on Windows, but I see the concept from your code). For whatever reason I never questioned the accuracy of the simple bootstrap interval to capture to population mean as often as predicted. However, I am especially interested not in the result but the process of how you test, because in the process of checking it, you demonstrate a strong conceptual understanding of what a confidence intervals MEANS and why we would want to use one. The improved accuracy of all methods at a larger sample size is also important. I like all of the discussion and debate these ideas could spark in class.

Thanks! I wonder how hard it would be for students to really “get” this. It’s so meta; but I think you’re right: if they can construct a simulation to test whether a procedure does what it claims, that may be good evidence that they understand the procedure. Stay in touch! Let me know what happens!