Tangerines: Why Do Paired Tests?

It was supposed to be a lesson about interval estimates.

The Original Plan

Here’s what I did: I bought ten tangerines from the cafe, so each pair of students would have its own personal tangerine. They were going to weigh the tangerine (I swiped a balance from Chemistry, good to 0.1 grams) and write the value on the white board. Then, when we had all ten, we’d enter the data and have something good for a bootstrap. We would see whose tangerines were outside the 90% interval, muse about how our impression of the mean weight of tangerines had changed since we weighed our own tangerines, and discuss how it was possible that more than 10% of the fruit was outside the 90% interval.

As usual, other activities took longer than I thought and most of the class was not ready to weigh their tangerines.

Mulan and Lisel were ready, though, so I had them weigh the tangerines and put the weights on the board. That way we could at least do the bootstrap about some actual right-there data.

But we didn’t even get to that, so I saved the tangerines for the next day. And here’s where the wonderful thing happened.

The two girls had not only recorded the data, they had numbered the tangerines and labeled them with a sharpie.

Opportunity Taken

So the next class, when we were ready to weigh them, I could ask the students whether they thought the weights would be the same today (Wednesday) as they had been last class (Monday). After a brief discussion, they agreed that since eventually they would be dried-out “desert tangerines,” they would have dried out a little (a kid got to use osmosis and was proud of himself) and they would weigh a little less.

We weighed them and sure enough, they were less.

Then I got to ask: “we all agree that the tangerines are lighter. What sort of statistical process could we use with these data to show that they really did get lighter—you know, show that the smaller numbers are not due to chance?”

Well. We were learning about bootstrapping and interval estimates. We had just discussed how if two intervals don’t overlap, that was good evidence for a “real” difference. So they went off to find the two intervals and compare them. And here they are:

two tangerine intervals — 90% bootstrap intervals for the mean weight of 10 tangerines on Monday (top) and Wednesday (bottom).

If you’re unfamiliar with bootstrapping, it’s a substitute for confidence intervals. You repeatedly calculate the mean of a sample taken with replacement from the original data. Here we put lines at the 5th and 95th percentiles. This gives you a 90% interval for plausible values for the true population mean. See the bottom of this post for a video.

You can see that on Monday (the top) the interval goes from about 82 to 86 grams; on Wednesday it’s from about 79 to 84 grams.

So the intervals definitely went down, but they have a big overlap. So, class, does it look as if the difference could be due to chance?

Uneasiness in the room. According to what we’ve been learning, we cannot rule out chance.

But, but…every single tangerine lost weight—about two grams. It’s obvious that the tangerines are getting lighter. How come the bootstrap didn’t show it? Hooray! I get to introduce a paired procedure. Because the girls had labeled the tangerines, we know which was which, and we can calculate the difference from Monday to Wednesday. What’s the mean difference? About 2 grams: and we can use the same bootstrap procedure to get an interval estimate for that mean:

tangerine differences — 90% bootstrap interval for the mean difference in weight between Monday and Wednesday

Aha! The plausible values for the decrease are the range (1.93, 2.22). So we happily reject a decrease of 0 as grossly implausible.

Students were still uneasy. Understandably.

I went on to draw more pictures and describe a thought-experiment: Okay, suppose we got 20 tangerines on Monday, weighed 10 and ate them. Then we weighed the 10 remaining ones on Wednesday. We would not be able to distinguish the two samples statistically, because of the overlap. After all, here are the original individual tangerine weights—not the bootstrap samples of means:

original tangerines — Original tangerine data. Notice how the scale is huge compared to the scale in the bootstrap graphs above.

I mean, just look at them: they could easily be drawn from the same population. It’s only when we connect them:

originals with lines — Same data, but with the tangerines connected to themselves.

That we see how systematic the decrease is. We can do it because we have more information: the identities of the tangerines.

Anyway, a win-win. We got practice the bootstrap with real data and got to do paired analysis, which was not even something I had thought to get to in this course.

The Data

For those of us who just feel better if we can see the numbers. Weights in grams:

Mon  Weds 
75.2 73.0 
83.0 81.3 
82.6 80.3 
84.7 82.8 
82.2 80.4 
80.5 78.3 
83.1 81.3 
85.2 82.7 
90.4 88.5 
88.7 86.3

Other connections

“Can you do a paired test in Fathom, a normal, orthodox t-test, not this weird bootstrapping stuff?” Sure. Check out this post.

“I didn’t get that bootstrap thing, but it looks intriguing. Got more?” You bet. Here’s a video I did for the kids:

Author: Tim Erickson

Math-science ed freelancer and sometime math and science teacher. Currently working on various projects. View all posts by Tim Erickson

Okay, this is not from me, it’s from Bob Hayden, eminent stats guru:

This is VERY nice! I have one concern and one suggestion.

The concern is with

“We would see whose tangerines were outside the 90% interval, muse about how our impression of the mean weight of tangerines had changed since we weighed our own tangerines, and discuss how it was possible that more than 10% of the fruit was outside the 90% interval.”

I think the CI is for the mean, and because it is a common student misunderstanding to interpret it as some kind of tolerance on individual observations, I hate to do anything that might encourage that.

The suggestion is to do the sign test here which is not an AP topic but is an easy application (and good review) of what students (in AP anyway) learn about the binomial. Basically, for ten pairs, and a null of no weight change, the chance of all ten showing a loss is (1/2)^10 or about 0.001. (Of course this is a paired test, too.)

To which I say, good point! I actually thought about the sign test here, but chickened out.

With regard to musing about how many of the original tangerines were outside the 90% interval for the mean, your point was exactly what I meant by musing about this: I wanted to confront this misconception head-on.

How can it be (I asked them) that most of our measurements are outside the interval if it ought to enclose about 90% of them? After discussion, we can answer the question: Because that’s not what the interval is about; it’s about our expectations for the mean, not our expectations for individual measurements.

One thing I did was to project the graph of the data (penultimate graph in the post) on the white board and draw the mean-intervals on it. Then we can see that, yeah, looks like the mean pretty much has to be between these lines, even though many of the measurements are outside.

One thought on “Tangerines: Why Do Paired Tests?”

Tim Erickson says:

14 June 2012 at 8:08 am

Okay, this is not from me, it’s from Bob Hayden, eminent stats guru:

This is VERY nice! I have one concern and one suggestion.

The concern is with

“We would see whose tangerines were outside the 90% interval, muse about how our impression of the mean weight of tangerines had changed since we weighed our own tangerines, and discuss how it was possible that more than 10% of the fruit was outside the 90% interval.”

I think the CI is for the mean, and because it is a common student misunderstanding to interpret it as some kind of tolerance on individual observations, I hate to do anything that might encourage that.

The suggestion is to do the sign test here which is not an AP topic but is an easy application (and good review) of what students (in AP anyway) learn about the binomial. Basically, for ten pairs, and a null of no weight change, the chance of all ten showing a loss is (1/2)^10 or about 0.001. (Of course this is a paired test, too.)

To which I say, good point! I actually thought about the sign test here, but chickened out.

With regard to musing about how many of the original tangerines were outside the 90% interval for the mean, your point was exactly what I meant by musing about this: I wanted to confront this misconception head-on.

How can it be (I asked them) that most of our measurements are outside the interval if it ought to enclose about 90% of them? After discussion, we can answer the question: Because that’s not what the interval is about; it’s about our expectations for the mean, not our expectations for individual measurements.

One thing I did was to project the graph of the data (penultimate graph in the post) on the white board and draw the mean-intervals on it. Then we can see that, yeah, looks like the mean pretty much has to be between these lines, even though many of the measurements are outside.