## Capture/Recapture Part One

If you’ve been awake and paying attention to stats education, you must have come across capture/recapture and associated classroom activities.

The idea is that you catch 20 fish in a lake and tag them. The next day, you catch 25 fish and note that 5 are tagged. The question is, how many fish are in the lake? The canonical answer is 100: having 5 tagged in the 25 suggests that 1/5 of all fish are tagged; if 20 fish are tagged, then the total number must be 100. Right?

Sort of. After all, we’ve made a lot of assumptions, such as that the fish instantly and perfectly mix, and that when you fish you catch a random sample of the fish in the lake. Not likely. But even supposing that were true, there must be sampling variability: if there were 20 out of 100 tagged, and you catch 25, you will not always catch 5 tagged fish; and then, looking at it the twisted, Bayesian-smelling other way, if you did catch 5, there are lots of other plausible numbers of fish there might be in the lake.

Let’s do those simulations.

## How Good is the Bootstrap?

There has been a lot of happy chatter recently about doing statistical tests using randomization, both in the APStat listserve and at the recent ICOTS9 conference. But testing is not everything inferential; estimation is the other side of that coin. In the case of randomization, the “bootstrap” is the first place we turn to make interval estimates. In the case of estimating the mean, we think of the bootstrap interval as the non-Normal equivalent of the orthodox, t-based confidence interval. (Here is a youtube video I made about how to do a bootstrap using Fathom.) (And here is a thoughtful blog post by list newcomer Andy Pethan that prompted this.)

But Bob Hayden has recently pointed out that the bootstrap not particularly good, especially with small samples. And Real Stats People are generally more suspicious of the bootstrap than they are of randomization (or permutation) tests.

But what do we mean by “good”?

## Stability: A Central Idea for the New Stats

If we replace the old Normal-distribution-based paradigm with a new one featuring randomization, we happily shed some peculiar ideas. For example, students will never have  to remember rules of thumb about how to assess Normality or the applicability of Normal approximations.

This is wonderful news, but it’s not free. Doing things with randomization imposes its own underlying requirements.  We just don’t know what they are. So we should be alert, and try to identify them.

Last year, one of them became obvious: stability.

(And it appeared recently in something I read, I hope I find it or somebody will tell me so I don’t think I was the first to think of it, because I never am.)

What do I mean by stability? When you’re using some random process to create a distribution, you have to repeat it enough times—get enough cases in the distribution—so that the distribution is stable, that is, its shape won’t change much if you continue. And when the distribution is stable, you know as much as you ever will, so you can stop collecting data.

Here’s where it arose first:

It was early in probability, so I passed out the dice. We discussed possible outcomes from rolling two dice and adding. The homework was going to be to roll the dice 50 times, record the results, and make a graph showing what happened. (We did not, however, do the theoretical thing. I wrote up the previous year’s incarnation of this activity here, and my approach to the theory here.)

But we had been doing Sonatas for Data and Brain, so I asked them, before class ended, to draw what they thought the graph would be, in as much detail as possible, and turn it in. We would compare their prediction to reality next class.

## Nuts and Bolts: Estimates from a Randomization Perspective

Suppose we believe George Cobb (as I do) that it’s better to approach stats education from a randomization perspective. What are some of the details—you know, the places where the devil resides?

Let’s set aside my churning doubts after this last year and look instead on issues of mechanics.

Long ago, when Fathom began, Bill Finzer and I had a discussion in which he said something like,

If you want to simulate a die, it’s best to have students sample from a collection of {1,2,3,4,5,6} rather than use some random-number generator, because sampling is a fundamental and unifying process.

I thought he was wrong, that sampling in that case was unnecessarily cumbersome, and would confuse students—whereas using something quicker would get them more quickly and rewardingly to results.

But things in class have made me remember this insight of Bill’s, and this is as good a place as any to record my reflections.

We have been learning about interval estimates. In my formulation, you have to choose between two procedures:

• If you’re estimating a proportion, set it up as a 2-collection simulation. In the “short-cut” version  you set the probability to the observed probability (p-hat), collect measures, and then look at the 5th and 95th percentiles of the result. This gives you a 90% “plausibility interval” you can use as an estimate of the true probability. (If you’re a stats maven and are bursting to tell me that this is not truly a confidence interval, I know; this is a short cut that I find useful and understandable. More about this later.)
• If you’re estimating some other statistic (such as the mean), set it up as a bootstrap: sample, with replacement, from the existing data. Calculate the measure you’re interested in. Collect measures from that bootstrap sample; they will vary because you get different duplicates. Look at the 5th and 95th percentiles of the resulting distribution. This is the 90% plausibility interval for the true population mean (or whatever statistic you’ve calculated). (Details here.)

I noticed that the students who struggle the most have tended, for whatever reason, to sample when they don’t have to. Often it doesn’t affect the result, it just takes more time. But I started noticing that when they had been doing bootstrapping (mean, not proportion) and then had to do a proportion problem, they would tend to set it up as a bootstrap. It works. And doing it presents a more unified way of looking at estimates.

## How much understanding do we want, anyway?

So. Long time no post, too many words to write elsewhere, mea culpa, mea culpa. But let me get to it:

Since forever (well, last year) I have been trying to run a (regular) statistics course using a randomization approach à la George Cobb. And let’s be clear, this is a radical randomization course.

What do I mean by that? Two things, I think:

• Randomization is not just a means of making traditional stats more approachable, but rather a substitute for traditional stats.
• If you go this way, students need to know how to construct the technological tools they use.

This is a noble goal, but it’s sure not universally working. Some students get it. But there are always some that do—despite our best efforts.

But others, well, it could be senioritis, but after a whole year with some of this class, I’m beginning to think that it’s something developmental. No traction whatsoever. It could be me; I sure spend a lot of time thinking that I suck as a teacher, and of course I make bonehead instructional choices. But if it’s not me, it may be impossible. Which grieves me because I so want this to work.

But let me take some space here to describe what I’m thinking.

If any reader doesn’t know what the heck I’m talking about, check out this old post. If you’re saying, “yeah, randomization, cool,” know that we use Fathom, everybody has it at home so I can (and do) assign homework, and we use randomization procedures exclusively. So if we want to do a hypothesis test, you invent a measure for the effect you’re seeing, figure out what the null hypothesis is, simulate it, and compare the resulting sampling distribution of the measure to the test statistic.

The beauty of this approach is supposed to be (and I still believe it) that students reallio-trulio have to face the real meaning of the stats, that whole subjunctive thing, over and over again.

The problem (referring to the two bullets above) is twofold:

• The business of constructing a null hypothesis and keeping track of the measure and the sampling distribution is just plain hard.
• Making the simulation is also hard.

And the combination is just brutal.

Do a lot of problems where the measure is always about the mean or standard deviation. Develop the whole thing as if you’re about to use the Normal distribution. But stop short of that. Give the students template documents (or write applets) so they can see the distributions they get but don’t have to actually build the machinery themselves.

Here’s a rebuttal:

Why do we want to do randomization in the first place? Because traditional frequentist Normal-distribution-based stats are limiting and create incorrect hooks in kids’ heads. They get bits of formulas stuck in there and don’t understand what’s really going on. They don’t distinguish (for example) between the distribution of the data and the sampling distribution of the mean. And this, in turn, leads to a slavish devotion to turning some crank and finding < 0.05 and nothing else.

And with technology like Fathom, they can actually do what the big dogs do: create statistics (measures) that accomplish whatever they want, and use those to describe what’s going on. This leads to what I think should be one of the main goals of high-school math education: showing students that they can use symbolic mathematics as a tool to accomplish what they want. If you just use means, you fall prey to the tyranny of the center (I still need to write that article!) and lead them to complacent orthodoxy.

But in fact students cannot all get it. The measures mechanism, as splendid as it is, is too hard to use. Most people don’t use it. There are a few fluent advanced Fathom users, but no more. Why else did Fifty Fathoms sell so well? So you can’t expect the most math-troubled, insecure students to be able to use it no matter how sweetly you cajole them.

And of course it’s not just Fathom—it’s the field. Statistical thinking is hard. Lower your expectations. A few people will get it now. A few more will in college. That will have to be enough.

In any case, this year’s experience has ground me down. I won’t be coming back next year. Maybe after some time out of the classroom I’ll be able to face it again. Any suggestions are welcome.

## Fantasy and Reality in Inference

In which he describes his approach to inference.

The null hypothesis is never true.

I guess I knew this at some level, but I never really got it till this Spring. Then it hit me that this was worth telling students. (Can I get them to discover it? Maybe.)

Let me back up a bit and approach it from the direction of Aunt Belinda.

### Aunt Belinda: A Touchstone Situation

Aunt Belinda claims to have power over flipping coins. She takes 20 nickels and throws them into the air. When they land, there are 16 heads. How should we interpret this result?

I want kids to learn to ask, “is it plausible that the coins are fair and Belinda has no special powers?” and realize that they can answer that question by flipping 20 fair coins over and over again, and seeing how often you get 16 or more heads.

Setting aside a lot of other discussion (no, she refuses to do it again) and what I hope is obvious pedagogy (the first time you see this, everybody gets 20 actual coins and has to do it a few times, chasing the rollers all over the classroom), we get Fathom to do the simulation because it saves so much time. Early on, I posted a graph showing the result of a whole lot of simulated 20-coin events, and reproduce it here.

At this point, we confront the basics of statistical inference. (These are also the bullet points in one of my learning goals, a.k.a. standards.)

• P-value. Seven out of 1000. Is it plausible? Students need to distinguish plausible from possible. Ideally, we also set some plausibility limit on this empirical P-value (which is what this 0.007 is after all) that depends on the circumstance and how willing you are to be wrong (oooh! Type I errors!). This last year, I mentioned that a lot but basically punted and explained that an orthodox reasonable value was 0.05.
• The null hypothesis is that the coins are fair, and there are no special powers. Articulating a null hypothesis is important. I began my discussion of the null by saying that it’s often the dull hypothesis: the situation when nothing of interest is going on.
• The sampling distribution is the one in the picture: repeated results from trials where the null hypothesis is true. They are not the same because random events come out differently even when the coins are fair.
• The test statistic is 16 heads out of 20. It’s what you compare to the sampling distribution to assess whether the result is plausible.

We then draw a conclusion, in this case, to reject the null hypothesis. That is, we think that something—we’re not sure what—is interefering with a fair toss of the coins. And we admit that it is possible (but not plausible) that we’re wrong and the coins are in fact fair.

What goes wrong?  Continue reading Fantasy and Reality in Inference

## What Went Right

Yikes. Another couple months. And a lot has happened: I experienced senioritis firsthand, our house has been rendered uninhabitable by a kitchen remodel (so we’re living out of suitcases in friends’ spare rooms), and my first year of actually teaching stats has drawn to a close.

It is time to reflect.

My tendency is to flagellate myself about how badly I suck, so (as suggested by Karen E, my brilliant and now former assistant head of school, we’ll miss you, Karen!) let me take a deep breath and report first on what seemed to work. Plenty of time for self-flagellation later.

### Resampling, Randomization, Simulation, and Fathom

The big overarching idea I started with—to approach inferential statistics through resampling à la George Cobb—worked for me and for at least some students. It is not obvious that you can make an entire course for these students with randomization as the background. I mean, doing this is mapping an entirely new path through the material. Is it a good path? I’m not certain, but I still basically believe it is.

To be sure, few of my students got to the point where they automatically chose the right technique at every turn. But they did the right thing a lot, and most important for me, I never had to leave them in some kind of mathematical dust where they made some calculation. For example, (and I may be wrongly proud of this) we got through an entire year of statistics without introducing the Normal distribution. This may seem so heretical to other teachers, it deserves a post of its own. Later. The point here is that no student ever was in a position of calculating NormalCDF-of-something and not understanding what it really meant.

Did they perform randomization tasks and not really understand? Sure. But when they did, they did so “closer to their data,” so they had a better chance to fix that non-understanding. They didn’t rely (for example) on the Central Limit Theorem—which, let’s face it, is a black box—to give them their results.

### Fathom and Technology

Fathom was a huge suggess throughout. It was great to be able to get them all the software and assign homework in Fathom. They enjoyed it, and really became quite adept at using the tool.

One big question was whether they would be able to use the “measures” mechanisms for creating their own simulations. Basically, they can. It’s a big set of skills, so not all of them can do everything we covered, but in general, they understand how to use the software to implement randomization and simulation techniques. This goes hand in glove with actually understanding what these procedures accomplish.

We also became more and more paper-free as the year went on, setting and turning in more and more assignments as pdfs. The “assignment drop box” wasn’t perfect, but it worked well enough.

### Starting SBG

I decided to try standards-based grading, at least some local version of it, in this first year. On reflection, that was pretty gutsy, but why wait? And it worked pretty well. Most importantly, students overwhelmingly approved; the overall comment was basically, “I like knowing what’s expected.” Furthermore—and this may be a function of who the kids were more than anything else, bit I’ll take it—there was hardly any point-grubbing.

It is also satisfying to look over my list of 30-ish standards and see that

• They largely (but not completely) span what I care about.
• They set standards for different types of mastery, ranging from understanding concepts to using the technology to putting together coherent projects.

They need editing, and I need to reflect more about how they interact, but they are a really good start.

### Flipping and Video

At the semester break, I decided to take a stab at “Flipping the Classroom.” This was a big win, at least where I used it most—in giving students exposition about probability.

There is a lot that can go wrong with videos as instruction (the Khan brouhaha is a good example; see this Frank Noschese post for a good summary of one view) and I want to explore this more. But the basic idea really works, and the students recognized it: if it’s something you would lecture about, putting it on the video has two big plusses:

• They can stop and rewind if they don’t get it
• You can do it over til you get it the way you want. No more going back and saying, “when I said x it wasn’t quite right…”

My big worry is that if I assign videos as homework, hoping to clarify and move on in class, that the lazy student may watch, but will blow off thinking, assuming that they can get me to cover it again. I need to figure out a non-punitive way around that problem; or maybe it’s not so bad simply to be able to use class time for the first repetition…

### Some Cool Ideas

Besides these esssentially structural things, I had some frankly terrific ideas during the year. Some I have mentioned before, but let me list just four, just as snippets to remind me what they were; later if I get to it I’ll elaborate:

• Using sand timers and stopwatches to explore variability.
• Going to the nearby freeway overpass to sample cars.
• Using the school’s library catalog to do random sampling.
• Going to the shop to make dice that were not cubes.

There were other curricular successes such as using old material from Data in Depth—particularly the Sonatas—for work during the first semester.

### Wonderful Kids

I can’t say enough about how much I appreciate the students. Again, I could do better at helping create a really positive class culture, but they did pretty damned well on their own. They got along well, took care of each other, exchanged good-natured barbs, were good group members and contributors.

Even the most checked-out seniors, already accepted into college and having reached escape velocity: they may not have worked very hard outside of class, and assignments may have slipped, but in class they were engaged and still learning. And some juniors did strong, strong work that will make writing college recs easy next year.

And I got a couple of those letters—teachers, you know the ones I mean—that make it worth the effort.

So all in all, a good year. Much to improve, yes. But it’s worth savoring what went right.

## Never mind winning the race. Are we on the right track?

It’s another crisis of confidence in stat-teacher land. It’s actually not as bad right now as it was over the last week, all thanks for the improvement be to wonderful students. But still. I feel like Ralph Rackstraw in HMS Pinafore:

…in me there meet a combination of antithetical elements which are at eternal war with one another. Driven hither by objective influences — thither by subjective emotion — wafted one moment into blazing day by mocking hope — plunged the next into the Cimmerian darkness of tangible despair, I am but a living ganglion of irreconcilable antagonisms. I hope I make myself clear…

No? To me neither. In any case, the end of the third quarter fast approaches, and we’re battering away at the Gates of Inference. Will we get inside in time to do anything with it?

I’ve been really pleased with this semester’s arc so far. Staying empirical, mostly. Starting with some hands-on probability, learning to simulate in Fathom, then building up the simulation skills while addressing increasingly realistic and relevant problems. And I like my choice of aiming for “scrambling” situations; we’re now doing randomization tests with student-constructed measures to assess group differences in settings that the students choose. They don’t know they’re called randomization tests, and we’re picking strong associations (so P is generally ≤ 0.001), so everything is obvious, but they are mostly doing them. We’ve been saying the inference-y words a lot without adding the principles to the learning goals (yet), so this is mostly mechanical—but the students are gradually getting the idea.

So it seems good! Robin Lock even commented! More videos got made! And I have yet to mention the Normal distribution, which I view as a very good thing. I mean, imagine: actually understanding the basics of stats without having to break out the Normal.

But then two things happened:

• I read parts of a few chapters in Workshop Statistics (Rossman, Chance, and the same Lock)
• I started realizing how much else I wanted to get to

As to the first, nothing is quite so depressing as seeing that somebody else has done a much better job of organizing a bunch of material. Of course, they take a more traditional path though this thicket (they do include the Normal; then again, theirs is a college class), but they have a dizzyingly terrific set of activities that build well on one another. So I wonder if I should have bitten the bullet and bought a set of these—and hewn closely to their curriculum instead of going my own way on this.

And as to the second, I know I want to see if I can use this randomization approach on other forms of inference, both tests and estimates. But I also want them to have time for projects and a lot else, like expected value and gambling. And, save me, but I worry how much I need to expose them to more orthodox stats approaches, so that later when they tell a professor they took stats  in high school and the prof asks, “well, is this a t situation, or is this where we use chi-square?” they will actually be able to answer.

It all combines to fill me with doubts and feelings of total doofusism, that I have stupidly led these students into some box canyon where they can’t quite understand something that, if they did, would not quite be enough to get the big picture I think is so important. I have not described this well, but it’s a start. Another whole big slab of self-loathing comes from bad use of time and lousy follow-through.

Meanwhile, the scrambling video. From way last month. This one uses ScreenFlow instead of Camtasia:

## Another vid: Fathom simulations with sampling

The classic randomization procedure in Fathom has three collections:

• a “source” collection, from which you sample to make
• a “sample” collection, in which you define a statistic (a measure in Fathom-ese), which you create repeatedly, creating
• a “measures” collection, which now contains the sampling distribution (okay, an approximate sampling distribution) of the statistic you collected.

This is conceptually really difficult; but if you can do this (and understand that the thing you’re making is really the simulation of what it would be like if the effect you’re studying did not exist—the deeply subjunctive philosophy of the null hypothesis, coupled with tollendo tolens…much more on this later), then you can do all of basic statistical inference without ever mentioning the Normal distribution or the t statistic. Not that they’re bad, but they sow confusion, and many students cope by trying to remember recipes and acronyms.

My claim is that if you learn inference through simulation and randomization, you will wind up understanding it better because (a) it’s more immediate and (b) it unifies many statistical procedures into one: simulate the null hypothesis; create the sampling distribution; and compare your situation to that.

Ha. We’ll see. In class, we have just begun to look at these “three-collection” simulations. I made a video demonstrating the mechanics, following the one on one- and two-collection sims described in an earlier post. They are all collected on YouTube, but here is the new one.

## Making Simulations in Fathom: another scaffold, technical challenges

I’ve posted recently about “flipping the classroom,” the idea of putting the exposition—the lecturing—in little digestible vodcasts to be watched at home, (ideally) leaving more time for discussion, one-on-one work, etc., and (ideally) preventing me from nattering on and boring my students.

In that effort I made a series of vids about probability. Now we’re making simulations in Fathom, exploring empirical probability, and beginning on the road to inference. (We’re avoiding the orthodox terminology for now: don’t tell the students, but they’re simulating the conditions of the null hypothesis in order to compare the test statistics to the sampling distributions they create in the simulations. See the post about randomization.)

It’s going OK, but once you use randomness and make measures, you’re no longer in beginning Fathom. It’s conceptually harder as a whole, and the mechanics of the software inevitably ramp up in difficulty as well. So I’ve made a video that’s all about the mechanics of doing this in Fathom with one and two collections. (The three-collection case is coming…)

You wanna see it? Here it is:

Anyway, in that effort, I thought that the easy-peasy way to make the videos—using Keynote—was not sufficient. So I used Camtasia Studio, which was really fun and worked fine.

I’m looking into ScreenFlow for capture as well, and Vimeo for distribution.

Note: I had trouble for a while with getting the resolution right in YouTube. Coulda sworn that one of the Camtasia presets for YouTube was 480 x 640, but it’s 380 x 640. Text came out looking crummy, like this: