Strom’s Credibility Criterion

Long ago, way back when Dark Matter had not yet been inferred, I attended UC Berkeley. One day, a fellow astronomy grad student mentioned Strom’s Credibility Criterion in a (possibly beer-fueled) conversation—attributed to astronomer Stephen Strom, who was at U Mass Amherst at the time.

It went something like this:

Don’t believe any set of data unless at least one data point deviates dramatically from the model.

The principle has stuck with me over the years, bubbling in the background. It rose to the surface in a recent trip to the mysterious east (Maine) to visit a physics classroom field-testing materials from a project I advise called InquirySpace.

Background

There is a great deal to say about this trip, including some really great experiences using vague questions and prediction in the classroom, but this incident is about precision, data, and habits of mind. To get there, you need some background.

Students were investigating the motion of hanging springs with weights attached. (Vague question: how fast will the spring go up and down? Answer: it depends. Follow-up: depends on what? Answers: many, including weight and ‘how far you pull it,’ i.e., amplitude.)

So we make better definitions and better questions, get the equipment, and measure. In one phase of this multi-day investigation, students studied how the amplitude affected the period of this vertical spring thing.

If you remember your high-school physics, you may recall that amplitude has no (first-order) effect (just as weight has no effect in a pendulum). So it was interesting to have students make a pre-measurement prediction (often, that the relationship would be linear and increasing) and then turn them loose to discover that there is no effect and to try to explain why.

Enter Strom, after a fashion

Let us leave the issue of how the students measured period for another post. But one very capable and conscientious group found the following periods, in seconds, for four different amplitudes:

0.8, 0.8, 0.8, 0.8

Many of my colleagues in the project were happy with this result. The students found out—and commented—that their prediction had been wrong. So the main point of the lesson was achieved. But as a data guy, I heard the echo of Stephen Strom.

Continue reading Strom’s Credibility Criterion

Stability: A Central Idea for the New Stats

Sums from 50 pairs of dice.
Not convincingly triangular 😛

If we replace the old Normal-distribution-based paradigm with a new one featuring randomization, we happily shed some peculiar ideas. For example, students will never have  to remember rules of thumb about how to assess Normality or the applicability of Normal approximations.

This is wonderful news, but it’s not free. Doing things with randomization imposes its own underlying requirements.  We just don’t know what they are. So we should be alert, and try to identify them.

Last year, one of them became obvious: stability.

(And it appeared recently in something I read, I hope I find it or somebody will tell me so I don’t think I was the first to think of it, because I never am.)

What do I mean by stability? When you’re using some random process to create a distribution, you have to repeat it enough times—get enough cases in the distribution—so that the distribution is stable, that is, its shape won’t change much if you continue. And when the distribution is stable, you know as much as you ever will, so you can stop collecting data.

Here’s where it arose first:

It was early in probability, so I passed out the dice. We discussed possible outcomes from rolling two dice and adding. The homework was going to be to roll the dice 50 times, record the results, and make a graph showing what happened. (We did not, however, do the theoretical thing. I wrote up the previous year’s incarnation of this activity here, and my approach to the theory here.)

But we had been doing Sonatas for Data and Brain, so I asked them, before class ended, to draw what they thought the graph would be, in as much detail as possible, and turn it in. We would compare their prediction to reality next class.

Continue reading Stability: A Central Idea for the New Stats

Tangerines: Why Do Paired Tests?

It was supposed to be a lesson about interval estimates.

The Original Plan

Here’s what I did: I bought ten tangerines from the cafe, so each pair of students would have its own personal tangerine. They were going to weigh the tangerine (I swiped a balance from Chemistry, good to 0.1 grams) and write the value on the white board. Then, when we had all ten, we’d enter the data and have something good for a bootstrap. We would see whose tangerines were outside the 90% interval, muse about how our impression of the mean weight of tangerines had changed since we weighed our own tangerines, and discuss how it was possible that more than 10% of the fruit was outside the 90% interval.

As usual, other activities took longer than I thought and most of the class was not ready to weigh their tangerines.

Mulan and Lisel were ready, though, so I had them weigh the tangerines and put the weights on the board. That way we could at least do the bootstrap about some actual right-there data.

But we didn’t even get to that, so I saved the tangerines for the next day. And here’s where the wonderful thing happened.

The two girls had not only recorded the data, they had numbered the tangerines and labeled them with a sharpie.

Opportunity Taken

So the next class, when we were ready to weigh them, I could ask the students whether they thought the weights would be the same today (Wednesday) as they had been last class (Monday). After a brief discussion, they agreed that since eventually they would be dried-out “desert tangerines,” they would have dried out a little (a kid got to use osmosis and was proud of himself) and they would weigh a little less.

Continue reading Tangerines: Why Do Paired Tests?

Writing Issue in Stats Projects

In the new book of student projects, I make the point that students are getting to do technical writing, but that this is not a technical writing course, so I was pretty forgiving. Mostly this was because, for this class (at this time) it would have sucked the life out of the project to insist that students improve the projects after they were turned in and graded. I did give feedback, but not all students (read: few) took the chance to revise the projects for the book.

But if I were ever to do it again, what could I insist on? Where did I think they could improve with better instruction?

Another way to put it is, how would I change the learning goals on which I assess them?

There are a couple of things I noticed about the student work that I want to record here while they’re fresh:

  • As adults, there are a lot of things we know that we bring to bear in order to tell what’s reasonable or what’s surprising and interesting. We’re developing that in students, but they’re really just beginning. This is flagrantly true when we look at financial data—they don’t have a clue about what’s a high salary or how much it costs to rent an apartment—but their lack of experience shows up elsewhere too. This is a huge opportunity to turn our classrooms into places where they learn useful stuff.
  • As writers, they are not always very skilled in building technical arguments. The reasoning has to be tighter (or at least different) from what they’re being trained to do in history and English. Maybe more to the point, technical writing offers more chances for internal inconsistencies, and those need to be rooted out and expunged.
  • As writers, they’re not as willing as they should be to revise. Changing direction seems to be a bigger chore than it should be given that they’re using word processing. Where we might think, no biggie, it’s only five pages, they say, OMG, it’s five whole pages!

A typical example

When we started the projects, a couple of students wanted to document Civil War deaths. (Interestingly, using Census data to make a new estimate of Civil War deaths was just in the news.) To try to do this, one of them picked a southern state and collected data from 1850, 1860, and 1870. He was expecting it to rise from 1850 to 1860 (a kind of baseline increase) and then, if not decline in 1870, at least not increase as much. Continue reading Writing Issue in Stats Projects

Nuts and Bolts: Estimates from a Randomization Perspective

Suppose we believe George Cobb (as I do) that it’s better to approach stats education from a randomization perspective. What are some of the details—you know, the places where the devil resides?

Let’s set aside my churning doubts after this last year and look instead on issues of mechanics.

Long ago, when Fathom began, Bill Finzer and I had a discussion in which he said something like,

If you want to simulate a die, it’s best to have students sample from a collection of {1,2,3,4,5,6} rather than use some random-number generator, because sampling is a fundamental and unifying process.

I thought he was wrong, that sampling in that case was unnecessarily cumbersome, and would confuse students—whereas using something quicker would get them more quickly and rewardingly to results.

But things in class have made me remember this insight of Bill’s, and this is as good a place as any to record my reflections.

We have been learning about interval estimates. In my formulation, you have to choose between two procedures:

  • If you’re estimating a proportion, set it up as a 2-collection simulation. In the “short-cut” version  you set the probability to the observed probability (p-hat), collect measures, and then look at the 5th and 95th percentiles of the result. This gives you a 90% “plausibility interval” you can use as an estimate of the true probability. (If you’re a stats maven and are bursting to tell me that this is not truly a confidence interval, I know; this is a short cut that I find useful and understandable. More about this later.)
  • If you’re estimating some other statistic (such as the mean), set it up as a bootstrap: sample, with replacement, from the existing data. Calculate the measure you’re interested in. Collect measures from that bootstrap sample; they will vary because you get different duplicates. Look at the 5th and 95th percentiles of the resulting distribution. This is the 90% plausibility interval for the true population mean (or whatever statistic you’ve calculated). (Details here.)

I noticed that the students who struggle the most have tended, for whatever reason, to sample when they don’t have to. Often it doesn’t affect the result, it just takes more time. But I started noticing that when they had been doing bootstrapping (mean, not proportion) and then had to do a proportion problem, they would tend to set it up as a bootstrap. It works. And doing it presents a more unified way of looking at estimates.

Continue reading Nuts and Bolts: Estimates from a Randomization Perspective

How much understanding do we want, anyway?

So. Long time no post, too many words to write elsewhere, mea culpa, mea culpa. But let me get to it:

Since forever (well, last year) I have been trying to run a (regular) statistics course using a randomization approach à la George Cobb. And let’s be clear, this is a radical randomization course.

What do I mean by that? Two things, I think:

  • Randomization is not just a means of making traditional stats more approachable, but rather a substitute for traditional stats.
  • If you go this way, students need to know how to construct the technological tools they use.

This is a noble goal, but it’s sure not universally working. Some students get it. But there are always some that do—despite our best efforts.

But others, well, it could be senioritis, but after a whole year with some of this class, I’m beginning to think that it’s something developmental. No traction whatsoever. It could be me; I sure spend a lot of time thinking that I suck as a teacher, and of course I make bonehead instructional choices. But if it’s not me, it may be impossible. Which grieves me because I so want this to work.

But let me take some space here to describe what I’m thinking.

If any reader doesn’t know what the heck I’m talking about, check out this old post. If you’re saying, “yeah, randomization, cool,” know that we use Fathom, everybody has it at home so I can (and do) assign homework, and we use randomization procedures exclusively. So if we want to do a hypothesis test, you invent a measure for the effect you’re seeing, figure out what the null hypothesis is, simulate it, and compare the resulting sampling distribution of the measure to the test statistic.

The beauty of this approach is supposed to be (and I still believe it) that students reallio-trulio have to face the real meaning of the stats, that whole subjunctive thing, over and over again.

The problem (referring to the two bullets above) is twofold:

  • The business of constructing a null hypothesis and keeping track of the measure and the sampling distribution is just plain hard.
  • Making the simulation is also hard.

And the combination is just brutal.

A less-radical approach might be:

Do a lot of problems where the measure is always about the mean or standard deviation. Develop the whole thing as if you’re about to use the Normal distribution. But stop short of that. Give the students template documents (or write applets) so they can see the distributions they get but don’t have to actually build the machinery themselves.

Here’s a rebuttal:

Why do we want to do randomization in the first place? Because traditional frequentist Normal-distribution-based stats are limiting and create incorrect hooks in kids’ heads. They get bits of formulas stuck in there and don’t understand what’s really going on. They don’t distinguish (for example) between the distribution of the data and the sampling distribution of the mean. And this, in turn, leads to a slavish devotion to turning some crank and finding < 0.05 and nothing else.

And with technology like Fathom, they can actually do what the big dogs do: create statistics (measures) that accomplish whatever they want, and use those to describe what’s going on. This leads to what I think should be one of the main goals of high-school math education: showing students that they can use symbolic mathematics as a tool to accomplish what they want. If you just use means, you fall prey to the tyranny of the center (I still need to write that article!) and lead them to complacent orthodoxy.

Which leads to the re-rebuttal:

But in fact students cannot all get it. The measures mechanism, as splendid as it is, is too hard to use. Most people don’t use it. There are a few fluent advanced Fathom users, but no more. Why else did Fifty Fathoms sell so well? So you can’t expect the most math-troubled, insecure students to be able to use it no matter how sweetly you cajole them.

And of course it’s not just Fathom—it’s the field. Statistical thinking is hard. Lower your expectations. A few people will get it now. A few more will in college. That will have to be enough.

In any case, this year’s experience has ground me down. I won’t be coming back next year. Maybe after some time out of the classroom I’ll be able to face it again. Any suggestions are welcome.

Whooosh, and another semester zooms by

Crikey, I sure haven’t posted!

Some news and reflection:

This last semester in regular Statistics is the first time I have taught a class for the second time. I went back and forth the whole time in my off-the-cuff assessment of whether it was any better. Easier, sure: I definitely was able to use material from last year. But did it take any less time? No. And most important, was I any better at paying attention to the students? There I’m not sure; I really felt, a lot of the time, that I was neglecting the most experienced students while I helped those who were having the most trouble. Or, more generally, that despite getting better as a teacher, there are many ways in which I suck. This is not as self-flagellatory as it sounds (as Walter Brennan used to say, “no brag, just fact”) but rather the continuing realization that this path requires continuing improvement.

All of which is alleviated somewhat by seeing the student work, especially on the semester projects. Everybody can improve, of course, but they were, in general, kind of wonderful.

Continue reading Whooosh, and another semester zooms by

What Went Right

Yikes. Another couple months. And a lot has happened: I experienced senioritis firsthand, our house has been rendered uninhabitable by a kitchen remodel (so we’re living out of suitcases in friends’ spare rooms), and my first year of actually teaching stats has drawn to a close.

It is time to reflect.

My tendency is to flagellate myself about how badly I suck, so (as suggested by Karen E, my brilliant and now former assistant head of school, we’ll miss you, Karen!) let me take a deep breath and report first on what seemed to work. Plenty of time for self-flagellation later.

Resampling, Randomization, Simulation, and Fathom

The big overarching idea I started with—to approach inferential statistics through resampling à la George Cobb—worked for me and for at least some students. It is not obvious that you can make an entire course for these students with randomization as the background. I mean, doing this is mapping an entirely new path through the material. Is it a good path? I’m not certain, but I still basically believe it is.

To be sure, few of my students got to the point where they automatically chose the right technique at every turn. But they did the right thing a lot, and most important for me, I never had to leave them in some kind of mathematical dust where they made some calculation. For example, (and I may be wrongly proud of this) we got through an entire year of statistics without introducing the Normal distribution. This may seem so heretical to other teachers, it deserves a post of its own. Later. The point here is that no student ever was in a position of calculating NormalCDF-of-something and not understanding what it really meant.

Did they perform randomization tasks and not really understand? Sure. But when they did, they did so “closer to their data,” so they had a better chance to fix that non-understanding. They didn’t rely (for example) on the Central Limit Theorem—which, let’s face it, is a black box—to give them their results.

Fathom and Technology

Fathom was a huge suggess throughout. It was great to be able to get them all the software and assign homework in Fathom. They enjoyed it, and really became quite adept at using the tool.

One big question was whether they would be able to use the “measures” mechanisms for creating their own simulations. Basically, they can. It’s a big set of skills, so not all of them can do everything we covered, but in general, they understand how to use the software to implement randomization and simulation techniques. This goes hand in glove with actually understanding what these procedures accomplish.

We also became more and more paper-free as the year went on, setting and turning in more and more assignments as pdfs. The “assignment drop box” wasn’t perfect, but it worked well enough.

Starting SBG

I decided to try standards-based grading, at least some local version of it, in this first year. On reflection, that was pretty gutsy, but why wait? And it worked pretty well. Most importantly, students overwhelmingly approved; the overall comment was basically, “I like knowing what’s expected.” Furthermore—and this may be a function of who the kids were more than anything else, bit I’ll take it—there was hardly any point-grubbing.

It is also satisfying to look over my list of 30-ish standards and see that

  • They largely (but not completely) span what I care about.
  • They set standards for different types of mastery, ranging from understanding concepts to using the technology to putting together coherent projects.

They need editing, and I need to reflect more about how they interact, but they are a really good start.

Flipping and Video

At the semester break, I decided to take a stab at “Flipping the Classroom.” This was a big win, at least where I used it most—in giving students exposition about probability.

There is a lot that can go wrong with videos as instruction (the Khan brouhaha is a good example; see this Frank Noschese post for a good summary of one view) and I want to explore this more. But the basic idea really works, and the students recognized it: if it’s something you would lecture about, putting it on the video has two big plusses:

  • They can stop and rewind if they don’t get it
  • You can do it over til you get it the way you want. No more going back and saying, “when I said x it wasn’t quite right…”

My big worry is that if I assign videos as homework, hoping to clarify and move on in class, that the lazy student may watch, but will blow off thinking, assuming that they can get me to cover it again. I need to figure out a non-punitive way around that problem; or maybe it’s not so bad simply to be able to use class time for the first repetition…

Some Cool Ideas

Besides these esssentially structural things, I had some frankly terrific ideas during the year. Some I have mentioned before, but let me list just four, just as snippets to remind me what they were; later if I get to it I’ll elaborate:

  • Using sand timers and stopwatches to explore variability.
  • Going to the nearby freeway overpass to sample cars.
  • Using the school’s library catalog to do random sampling.
  • Going to the shop to make dice that were not cubes.

There were other curricular successes such as using old material from Data in Depth—particularly the Sonatas—for work during the first semester.

Wonderful Kids

I can’t say enough about how much I appreciate the students. Again, I could do better at helping create a really positive class culture, but they did pretty damned well on their own. They got along well, took care of each other, exchanged good-natured barbs, were good group members and contributors.

Even the most checked-out seniors, already accepted into college and having reached escape velocity: they may not have worked very hard outside of class, and assignments may have slipped, but in class they were engaged and still learning. And some juniors did strong, strong work that will make writing college recs easy next year.

And I got a couple of those letters—teachers, you know the ones I mean—that make it worth the effort.

So all in all, a good year. Much to improve, yes. But it’s worth savoring what went right.

Never mind winning the race. Are we on the right track?

It’s another crisis of confidence in stat-teacher land. It’s actually not as bad right now as it was over the last week, all thanks for the improvement be to wonderful students. But still. I feel like Ralph Rackstraw in HMS Pinafore:

…in me there meet a combination of antithetical elements which are at eternal war with one another. Driven hither by objective influences — thither by subjective emotion — wafted one moment into blazing day by mocking hope — plunged the next into the Cimmerian darkness of tangible despair, I am but a living ganglion of irreconcilable antagonisms. I hope I make myself clear…

No? To me neither. In any case, the end of the third quarter fast approaches, and we’re battering away at the Gates of Inference. Will we get inside in time to do anything with it?

I’ve been really pleased with this semester’s arc so far. Staying empirical, mostly. Starting with some hands-on probability, learning to simulate in Fathom, then building up the simulation skills while addressing increasingly realistic and relevant problems. And I like my choice of aiming for “scrambling” situations; we’re now doing randomization tests with student-constructed measures to assess group differences in settings that the students choose. They don’t know they’re called randomization tests, and we’re picking strong associations (so P is generally ≤ 0.001), so everything is obvious, but they are mostly doing them. We’ve been saying the inference-y words a lot without adding the principles to the learning goals (yet), so this is mostly mechanical—but the students are gradually getting the idea.

So it seems good! Robin Lock even commented! More videos got made! And I have yet to mention the Normal distribution, which I view as a very good thing. I mean, imagine: actually understanding the basics of stats without having to break out the Normal.

But then two things happened:

  • I read parts of a few chapters in Workshop Statistics (Rossman, Chance, and the same Lock)
  • I started realizing how much else I wanted to get to

As to the first, nothing is quite so depressing as seeing that somebody else has done a much better job of organizing a bunch of material. Of course, they take a more traditional path though this thicket (they do include the Normal; then again, theirs is a college class), but they have a dizzyingly terrific set of activities that build well on one another. So I wonder if I should have bitten the bullet and bought a set of these—and hewn closely to their curriculum instead of going my own way on this.

And as to the second, I know I want to see if I can use this randomization approach on other forms of inference, both tests and estimates. But I also want them to have time for projects and a lot else, like expected value and gambling. And, save me, but I worry how much I need to expose them to more orthodox stats approaches, so that later when they tell a professor they took stats  in high school and the prof asks, “well, is this a t situation, or is this where we use chi-square?” they will actually be able to answer.

It all combines to fill me with doubts and feelings of total doofusism, that I have stupidly led these students into some box canyon where they can’t quite understand something that, if they did, would not quite be enough to get the big picture I think is so important. I have not described this well, but it’s a start. Another whole big slab of self-loathing comes from bad use of time and lousy follow-through.

Meanwhile, the scrambling video. From way last month. This one uses ScreenFlow instead of Camtasia:

 

Analyzing Two Dice

Back at the beginning of the semester, I said that I wanted kids to get a picture in their heads about adding two dice. Here are some (belated) results.

On the first quiz, way back in January, I asked this question:

Aloysius says, “if you roll two dice and add, the chance that you get an even number is P(even) = 0.5 because half of all whole numbers are even.”

Use an area model to help show that he’s right that P = 0.5, but explain why his reasoning is wrong.

I expected students to add up the various probabilities for sums of 2, 4, 6, 8, 10, and 12—(1/36 + 3/36 + 5/36 + 5/36 + 3/36 + 1/36)— get 18/36 and say it was 1/2. How silly I was. They were much more creative.

I also hoped they would say that the reasoning was bad because the numbers are not equally likely. How silly I was.

Right Answers

Anyhow, some diagrams. Click to enlarge. First, the most popular. I hadn’t anticipated the “checkerboard” aspect of the area diagram and how easy it would be to see that half of the (equally-likely) combinations were even. It also tempts one to make an analogous problem with five-sided dice:

Student one, the diagram
Here is the diagram; the legend is next.
student work, student one, part 2
The legend and explanation for the "checkerboard" response.

But then, several students used this kind of diagram, which is kind of brilliant and totally unexpected (by me at least):

Diagram using even and odd approach
A different approach: this student reasons about the sums of even and odd numbers rather than from the canonical diagram.

Wrong Answers

Of course, there were still islands of trouble, for example:

third student example
Here, the student does not get how the area model works despite having watched the videos.

We also have this one; where it come from no one (least of all the student) is sure. I include it for all you teachers out there who will nod and say, “yup, you never know what you’re gonna get.” Remember: click to enlarge.

strangest repsonse
An unusual area model. No one knows why the student used the sizes in the diagram.

What do we make of this? The best thing is that the student at least knew that something was wrong with the diagram, and owned up to it on the paper. This is something I have been asking them to do, and I get it really seldom.

But then, the denominator 441 is in fact the number of little squares in the diagram (21 x 21), but of course 220.5 of them are not colored in; that’s just the number you’d have to have to make the probability 0.5.

So I have an assessment problem. I’m reasonably convinced that the right answers show some level of understanding, but I can’t really tell, from the wrong answers, what’s going wrong.

Parte Deux: Why Aloysius was Wrong

Although most of the diagrams were good, most of the responses to why Aloysius’s reasoning was bad were not. Here’s one that makes me doubt myself to the core:

His reasoning is wrong because you never know what you’re going to roll but you do know that 50% of the possible sums are even but not that you’ll roll an even number/sum 50% of the time.

Here’s one that shows a good observation (but still doesn’t complete the catch):

Aloysius’s reasoning is wrong because in the spread of #s (sums) that you can get, which is 2–12, there are more even #s. What he should have said was that

I don’t know

Then an attempt to use the vocabulary:

The probability of rolling two dice and having an even sum is mutually exclusive…

There were also a handful of non-responses, a handful of good ones, and the rest something like the above. Many were enough longer that I will spare you reading them.

Anyhow, I really like the question in principle, so what should I do about this? On the second quiz, I put in another Aloysius question—I gave them a Fathom simulation with lots of mistakes, and asked them to identify the mistakes and fix them. I think that went better. I will also be insisting on corrections in order to do re-takes to improve the scores on the corresponding learning goals. That will at least force them to confront what went on and think about it again.

But now it’s time for dinner.