Was the run-up to the recent election an example of failed statistics? Pundits have been saying how bad the polling was. Sure, there might have been some things pollsters could have done better, but consider: FiveThirtyEight, on the morning of the election, gave Trump a 28.6% chance of winning.

And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen *all the time*.

This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.

In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter. We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.

(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still *ahead*, right?”)

Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or *P*-value.

My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are *both* ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.

Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.

Okay, Ph.D. students. That’s a good nugget for a dissertation.

Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:

- Do you take an umbrella?
- If it doesn’t rain, do you think, “the prediction was wrong?”

]]>

(Also, get a micrometer on eBay and a sweet 0.1 gram food scale. They’re about $15 now.)

Long ago, I wrote about coins and said I would write about hexnuts. I wrote a book chapter, but never did the post. So here we go. What prompted me was thinking different kinds of models.

I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?

A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The *more-or-less cubic* part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density.

Most students, however, won’t have that instant insight. So we begin by asking them to predict. They will draw graphs that increase, which is a good start, and many student graphs will curve upwards. Great! Then when they measure, students see something like the graph. (Here is a Desmos document with the data.)

They will look at this graph, and many will say it looks like a parabola. Which it does. Kinda. But if you put a quadratic of the form on the graph, you will never get it to fit well, no matter what value of *k* you choose.

We’re doing this quickly, so let’s skip ahead to say that when you try , things go much better (though they aren’t perfect). Then you can explore why you think it “goes like” *x* cubed. You might get there by discussing the quarter-inch nut and the half-inch nut, and pass out samples, and ask, why isn’t the half-inch *twice* the weight of the quarter-inch. And so forth.

Now the discussion can go in two interesting and different directions. One goes to the wonderful snook data where you can do a similar exploration about fish, and (especially if you do a log transform) you discover that the best exponent for these fish is about 3.24 rather than 3.00. And you gotta wonder how that could be.

But we’re not going there. Instead we start with a great question that astute readers must have asked, but we have avoided until now: what do we mean by the *size* of a hexnut? The answer is, the diameter of the bolt that it fits. Right? A quarter-inch nut is the one that fits a quarter-inch bolt. It’s bigger (duh) than a quarter inch across.

So let’s consider modeling the geometry of the situation. A hexnut is a hexagonal prism, right? And it has a circular hole.

We can measure more than the weight. Suppose we measure the distance across the faces (*f*) and the thickness (*t*) as well. Then the area of the relevant hexagon is , and the resulting volume of the solid—taking the hole into account—is

.

So we can model the mass with , where is the density of the material. The next graph, from Fathom, shows this calculated model mass against the measured mass, along with and a residual plot. We have already slid the **rho** slider to a decent value.

I have positioned the slider so the first few points are flat. When I do that, they are not centered at zero. Also, the larger nuts deviate from the line more and more. This is an indication that there are systematic effects that our model doesn’t account for. I leave it to you to think about what those might be!

The value of rho? For this graph, 8.22 grams per cc, which is not bad (just a little high) for the density of steel.

And the modeling point? If you agree with my earlier assertion that simplification and abstraction are the hallmarks of modeling, then this is modeling even though we didn’t make a fancy function. We assumed that the nut is truly shaped like a hexagonal prism with a cylindrical hole cut in it. That’s patently false, but it’s a completely decent approximation, and much easier to deal with than the ugly, messy truth. What’s more, we can use the ways reality deviates from the simple, abstract model to decide whether our model values might be a little high or low.

Here are the data from that graph. Lengths are inches, mass is in grams:

boltSize flat thick mass 0.25 0.4298 0.2148 3.09 0.3125 0.4951 0.2651 4.72 0.375 0.5555 0.3236 6.73 0.4375 0.6807 0.3802 12.77 0.5 0.7407 0.4253 15.85 0.625 0.9223 0.5450 31.06 0.75 1.0920 0.6337 48.8

There is a lot more to say, but hey, it’s almost four hours so it’s way time to stop.

]]>

The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.

Here is the link. If you teach stats, make a bookmark: http://dasl.datadesk.com/

The site includes scores of data sets organized by content topic (e.g., sports, the environment) and by statistical technique (e.g., linear regression, ANOVA). It also includes famous data sets such as Hubble’s data on the radial velocity of distant galaxies.

One small hitch for Fathom users:

In the old days of DASL, you would simply drag the URL mini-icon from the browser’s address field into the Fathom document and amaze your friends with how Fathom parsed the page and converted the table of data on the web page into a table in Fathom. Ah, progress! The snazzy new and more sophisticated format for DASL puts the data inside a scrollable field — and as a result, the drag gesture no longer works in DASL.

Fear not, though: @gasstationwithoutpumps (comment below) realized you could drag the download *button* directly into Fathom. Here is a picture of a button on a typical DASL “datafile” page. Just drag it over your Fathom document and drop:

In addition, here are two workarounds:

**Plan A:**

- Place your cursor in that scrollable box. Select All. Copy.
- Switch to Fathom. Create a new, empty collection by dragging the collection icon off the shelf.
- With that empty collection selected, Paste. Done!

**Plan B:**

- Use their Download button to download the .txt file.
- Drag that file into your Fathom document.

Note: Plan B works for CODAP as well.

]]>

So. This is a book of 42 activities that connect geometry to functions through data. There are a lot of different ways to describe it, and in the course of finishing the book, the emotional roller-coaster took me from great pride in what a great idea this was to despair over how incredibly stupid I’ve been.

I’m obviously too close to the project.

For an idea of what drove some of the book, check out the posts on the “Chord Star.”

But you can also see the basic idea in the book cover. See the spiral made of triangles? Imagine measuring the hypotenuses of those triangles, and plotting the lengths as a function of “triangle number.” That’s the graph you see. What’s a good function for modeling that data?

If we’re experienced in these things, we say, oh, it’s exponential, and the base of the exponent is the square root of 2. But if we’re less experienced, there are a lot of connections to be made.

We might think it *looks* exponential, and use sliders to fit a curve (for example, in Desmos or Fathom. Here is a Desmos document with the data you can play with!) and discover that the base is close to 1.4. Why should it be 1.4? Maybe we notice that if we skip a triangle, the size seems to double. And that might lead us to think that 2 is involved, and gradually work it out that root 2 will help.

Or we might start geometrically, and reason about similar triangles. And from there gradually come to realize that the *a*/*b* = *c*/*d* trope we’ve used for years, in this situation, leads to an exponential function, which doesn’t look at all like setting up a proportion.

In either case, we get to make new connections about parts of math we’ve been learning about, and we get to see that (a) you can find functions that fit data and (b) often, there’s a good, underlying, understandable *reason* why that function is the one that works.

I will gradually enhance the pages on the eeps site to give more examples. And of course you can buy the book on Amazon! Just click the cover image above.

]]>

Suppose we take our data and instead of looking at all the individual times, count how many people pass me every 10 seconds? Then I’ll have one number for every 10 seconds; what will the distribution of those numbers look like?

Here is that distribution for the 58 ten-second bins. This is “p” data only, that is, people walking right to left. Anticipating what we will need, we’ll also show the same distribution for randomly-generated data, and plot the means. We also plot the variances, which is weird, but lets us see the numerical values:

The means are identical, which is good, because we’ve intentionally made the random data to have the same number of cases over the same period of time.

But the two distributions look different, in the way we expect if we have clumping: the real data has more of the heavily-populated bins (6, 7, 8) where the clumps are, and more of the lightly-populated bins (0, 1) where the longer gaps are. This is less dramatic than it was with the stars. But we have much less data, and the data are real, which always makes things harder.

At any rate, more light bins and more heavy bins means that it makes sense to characterize clumpiness by using a measure of *spread*; if we pick variance as we did before (following the star reasoning and the lead from Neyman and Scott), we see that the real data have a variance of 3.25, which is much bigger than the mean of 2.17. The random data have a variance of 1.86, which is relatively close to the mean—and what we expect, that is, they should be Poisson distributed, though you don’t need to know that to do this analysis.

Of course, the random data will be different every time we re-randomize. Here are 100 variances for 100 runs of random data, with the “real” data’s variance plotted as well:

Yay! Does a variance of 3.25 give us evidence that the real data are not random? Yes. We can’t tell from this alone that it is because of clumping, but it is in the direction consistent with clumping.

We can do as we did before, and define an index of clumpiness the same way: the variance of the bin counts divided by the mean. For these data, and this bin size, we get 3.246/2.172, or about 1.5.

Can we apply this idea to the made-up heads and tails that we started this whole sequence with in our first post about clumpiness? Of course. Next post!

Gasstation… commented wisely that it may matter what your cell size is. You bet. It’s the same issue as when you make a histogram, or do any gridding or binning; the size you use may create or obscure patterns inadvertently. If you’re a student thinking about studying clumpiness as a project, this is a *great* topic to explore: how do the size and placement of your grid affect the results of your investigation?

On our first time through this line of reasoning, however, I did not address these issues. They’re really interesting when you’re doing the investigation yourself, but will probably be pretty tedious and distracting if you’re just reading about it.

But just so we see, I ran the analysis with different bin sizes:

The clumpiness becomes increasingly unstable with large bins, probably because the number of bins declines, that is, there are fewer and fewer data points to work with.

Bin size is not the only issue, however. We can also pay attention to where the bins *start*. That is, the “picket fence” of the bins can slide left and right—and that will change the counts in the bins. The “offset” could range from 0 to 10 seconds for our 10-second bins. We can calculate the index of clumpiness for each offset:

Isn’t THAT interesting! How should we use this? It’s not clear. Informally, we could say that an index of about 1.5 looks reasonable as a summary of the possible values. There’s another issue as well: as you slide the picket fence of binning over data, different numbers of bins cover the domain. Also, therefore, the two *end* bins will often have artificially low counts, since they extend into a data-free zone. So we should probably discard the end bins—but that’s too much trouble for us today.

]]>

- In the first, we discussed the problem in general and used a measure of clumpiness created by taking the mean of the distances from the stars to their nearest neighbors. The smaller this number, the clumpier the field.
- In the second, we divided the field up into bins (“cells”) and found the variance of the counts in the bins. The larger this number, the clumpier the field.

Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.

We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by *reducing its dimension* from 2 to 1. This is often a good strategy for making things more understandable.

Where do we see one-dimensional clumpiness? Here’s an example:

One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.

I watched people going from my right to my left. And I watched people going left to right. ¡*Que oportunidad*! This was data! I pulled out my computer, started up Fathom, and created a Fathom “experiment” so that I could record the time that a person passed my location simply by pressing a key. I used different keys for left-to-right and for right-to-left, and whether there was a cart involved. (These are the IAH carts that whisk people the vast distances they need to travel. Essential service. Obnoxious implementation.)

Here’s the graph:

It sure looks as if there is clumping. But is there really? And can we detect it? This is in time instead of space, but that doesn’t matter. It’s really the same issue. And the people-see-patterns-in-randomness problem is the same.

Before we do any calculations, let’s also think about the context. These are people walking between gates in an airport. Some are traveling alone, but many are probably traveling with colleagues, in couples, or in families. Another issue is whether fast walkers get stuck behind slow walkers, creating a clump because of traffic, not affiliation. Then, at a longer time scale, there should be an overall increase in traffic after a plane arrives. In any case, we have reason to believe that there will be clumping, so we hope we can detect it easily in the data.

As before, let’s first look at individual cases rather than bins. It’s easiest (in Fathom) to compute the time from the *previous* pedestrian rather than the time to the *nearest* pedestrian. Hoping that doesn’t make too much of a difference (famous last words…), here are graphs of the observed distribution of time gaps, called **gap**, and a distribution of gaps from a set of random times:

These are definitely different distributions—and in ways that imply clumping! The pile of short-interval points seems like a smoking gun.

We need a single number to characterize clumpiness. What would be good? The graphs show the mean times, which are identical. We might explain this with the idea that, in the clumpy group, there are more short gaps (reducing the mean), but also, because the people are closer together, more long gaps (increasing the mean). And we see that in the graph. But the real reason is that we have a given amount of time and a given number of people. In that case, the average time between people is the time divided by *N*. So we might in fact do better taking the trouble to find the minimum gap rather than the previous gap as we did with the stars.

That does show a difference, but instead let’s stick with the plain, “previous” gaps and look at the *median* of that distribution. That’s better than the mean in this case. (Think: why?)

In the graphs, we see that the “real” median is less than the random median, which is what we expect from clumping. Let’s do randomization-based inference again! We redo the random case many times, and build up a sampling distribution of that median gap. Is our observed median of 3.217 *plausible* if we assume the random, null hypothesis? Nope. Here is a set of 100 median gaps from random data:

We’ve chosen the median for two reasons: it makes sense, and it’s the first thing we tried that worked. What other measures might be just as good, or better, at characterizing the clumpiness? What else can we learn from the distribution? Here’s one telling question: suppose every traveler was in a couple, and no one traveled alone. What would the distribution of gaps look like? What would its median be? Sometimes, looking at extreme cases like this is a powerful tool.

When we did the stars, a completely different approach—putting the data into bins—worked well. Lets do that for this one-dimensional, real data set. Next post.

]]>

What other measures could we use?

It turns out that the Professionals have some. I bet there are a lot of them, but the one I dimly remembered from my undergraduate days was the “index of clumpiness,” made popular—at least among astronomy students—by Neyman (*that* Neyman), Scott, and Shane in the mid-50s. They were studying Shane (& Wirtanen)’s catalog of galaxies and studying the galaxies’ clustering. We are simply asking, is there clustering? They went much further, and asked, how much clustering is there, and what are its characteristics?

They are the Big Dogs in this park, so we will take lessons from them. They began with a lovely idea: instead of looking at the galaxies (or stars) as individuals, divide up the sky into smaller regions, and count how many fall in each region.

We can do that. We divide that square of sky into a 10-by-10 grid. 100 cells. Now, instead of dealing with 1000 ordered pairs of numbers, we have 100 integers, the numbers of dots in each cell. Much easier.

This shows a star field with *K* = 0, divided into a grid of 100 cells. The graph is the distribution of counts from those cells. Notice how the distribution centers at about 10. This should make sense: we have 1000 stars distributed among 100 cells: the average number of stars in each cell is (exactly) 10.

Now let’s look at the same things but for *K* = 1.0, that is, very clumped:

Wow. The distribution sure is different; if you think about it, it’s clear why. The densest cells are lots denser than in the uniform *K* = 0 case, and to make up for that, there are a huge number of cells with very few stars.

To do inference on this, though, we need a single number (or measure, or statistic) that characterizes the distribution. Should we use mean, as we did with minimum distance? NO! The mean is 10 in both cases!

There are many possible measures to take, such as the maximum count, or, maybe the 10th-highest count (the 90th percentile). Those might work, and you should try them.

But the big dogs—Neyman and Scott—used a *measure of spread*. You could use standard deviation or IQR. But they used *variance*, for a really sweet reason, which we’ll get to later.

Meanwhile, recall that the variance is the square of the standard deviation. It’s the *mean square deviation* from the overall mean.

In our case, we have 100 numbers: 100 counts of how many stars fell in a particular cell. The overall mean is 10, as we have discussed. To compute the variance, go to each cell; figure out how far that count is from 10; square that amount; add up the 100 squares; then divide by 100 (or 99, if you’re that way).

Let’s look at *K* = 0. Of course, every random star field will have a different variance. So we made many (200) random star fields, and for each one, counted the stars in the 100 cells to get a variance for each field. The “typical” variance was about 10. In the 200 fields, about 5% had a variance below 7.7, and 5% were above 12.4. That is, 90% of all *K* = 0.0 random star fields had a variance in the interval [7.7, 12.4].

We computed this interval for many values of *K*.

You can also see that a typical variance at *K* = 1.0 is above 200, which makes sense looking back at the super skewed and spread out distribution above. (Remember, that 200+ is the *square* of the standard deviation.)

Let’s look in detail at an intermediate case. In fact, let’s look at *K* = 0.14, the value for which, last time, we could not distinguish from randomness using our mean-minimum technique.

Here is a star field at *K* = 0.14:

And now, its distribution of counts, with the distribution for a *K* = 0.0 field for comparison:

You can see that there are four cells on the right that are higher that anything at *K* = 0.0, and if you squint, you might believe that the peak—a typical number of stars for most of the cells—is a little lower, maybe 9 instead of 10. (And that makes sense; if you drain off 40+ “extra” stars for the center of the cluster, the cells on the outside will be depleted.)

And if we compute the variance? 12.9, which is outside that [7.7, 12.4] “90%” window we computed from doing lots of iterations with *K* = 0.

That is, with this measure, *we can detect the clumpiness*, whereas we could not with the one we invented in the previous post.

Whew. That had some hard ideas. As teachers, we look for easier approaches. And good old Pólya often suggested looking for lower dimensionality. Great idea! Let’s do this in one dimension instead of two. Next time.

Meanwhile:

Now. Why is variance cool here? Turns out that if you place things randomly into bins, and look at the distribution of counts—which is what we did here for *K* = 0.0—the distribution of numbers follows a Poisson distribution. Here’s the formula for the distribution for a Poisson-ly distributed random variable *X*:

In our case, . So the probability of getting 12 (say) in a cell is

.

You can check that it makes sense in the distribution above. But what you really need to know is that the mean of this distribution is and *so is the variance*. That is, we know what the variance is supposed to be if the stars are random, and that variance is just the mean: the total number of stars divided by the number of cells. In our case, 10.

And this means that for any field of 1000 stars cut into 100 cells, where the counts in the cells are , we can define the *index of clumpiness* as a ratio:

.

If a pattern is non-clumpy and random, i.e., Poisson, this index will be close to 1.0; when it’s really clumpy, will be large.

To generalize: if we have *N* stars in *n* cells, the average count is . With some algebra, the general formula becomes:

.

Although this is not part of the introductory stats curriculum, it’s pretty interesting, and maybe more accessible, with all our technology, than it used to be. I’d be curious to know more applications. Scientists must look at clumpiness in all sorts of contexts. Traffic, queueing, ecological models, who knows? (You do. Let me know.)

]]>

There really is such a thing. Some background: The illustration shows a random collection of 1000 dots. Each coordinate (*x* and *y*) is a (pseudo-)random number in the range [0, 1) — multiplied by 300 to get a reasonable number of pixels.

The point is that we can all see patterns in it. Me, I see curves and channels and little clumps. If they were stars, I’d think the clumps were star *clusters, *gravitationally bound to each other.

But they’re not. They’re random. The patterns we see are self-deception. This is related to an activity many stats teachers have used, in which the students are to secretly record a set of 100 coin flips, in order, and also *make up* a set of 100 random coin flips. The teacher returns to the room and can instantly tell which is the real one and which is the fake. It’s a nice trick, but easy: students usually make the coin flips too uniform. There aren’t enough streaks. Real randomness tends to have things that look non-random.

Here is a snap from a classroom activity:

That whole activity is worth some discussion, but not today. The question is, suppose we’ve learned our lesson: We know that streakiness can be random. We know that the stars can show clumps even when they’re random.

Now suppose there really is some clumping in the pattern of stars. How would we tell?

Could we do a test? Sure. But in order to do a traditional statistical test, we need a single number that characterizes how clumpy the pattern is.

If we had such a number (a *measure*, a *statistic*) we could then do the randomization dance: Make a lot of truly random patterns and compute the statistic. Assemble those stats into a *sampling distribution*. Then compute that same quantity for our actual pattern. If that *test statistic* falls outside the sampling distribution we made, it’s implausible that the distribution is random.

To decide what statistic would make sense, let’s look at some clumpy star fields. There are a lot of ways to make them clumpy, so I chose to make a single clump, right in the middle. I control the clumpiness with a parameter I call *K* (for clump), which is zero for no clumpiness, and 1 for total clumpiness (i.e., all stars are in the cluster). Here are *K* = zero, one-half, and one:

What statistic could you use? As usual, you should go off and think about this. But because I’m trying to record my thinking, I’m going to tell you what I came up with.

Here’s one idea: compute, for every star, the distance to the next closest star. Then I would look at that distribution—the distances to the nearest neighbors. It stands to reason that if the stars are clumped, the nearest neighbors would be *closer*, so the distributions would be centered lower. Maybe the mean of that distribution would work as a measure of clumpiness.

Sure enough! The means go down as the clumpiness goes up. Look at *K* = 0. Does that mean that, in a uniform random field, the mean distance to a neighbor is always 1.65 units? No: because there is randomness, the means will fluctuate. So we did this 100 times and recorded the means. And we found the 5th and 95th percentiles. This gives us a 90% “plausibility interval” for *K* = 0, which is roughly from 1.55 to 1.65. That is, our example (at 1.65) is a little unusual.

Skipping the step where we actually *see* those sampling distributions (if you’re doing this for a school project, don’t skip this step!), we can get the next graph, which shows the 5th and 95th percentiles for more values of *K*. Reading this graph, you can see that if *K* = 0.5, you’re nearly 90% certain to be outside that plausibility interval. (Oooh! Power!)

At a more modest value of *K*, however, it’s not so clear. The next illustration shows a field where *K* = 0.14. Its mean minimum distance is about 1.61—which is in the middle of the plausibility interval for *K* = 0, the random (null) case. According to our procedure, it’s completely plausible that this field is random.

If you know what to look for, you can kinda sorta see the cluster. But our statistic can’t find it. And if you didn’t know what to look for—like if you didn’t know the cluster was right in the middle—this field would not look any different than the *K* = 0 example up above.

Can we do better? You bet. Next post.

]]>

Seeing how the two approaches fit together, yet are so different, helps illuminate why confidence intervals can be so tricky.

Anyway, I promised a **Very Compelling Real-Life Application of This Technique**. I had thought about talking to fisheries people, but even though capture/recapture somehow is nearly always introduced in a fish context, of course it doesn’t have to be. Here we go:

I’ve just recently been introduced to an outfit called the Human Rights Data Analysis Group. Can’t beat them for statistics that matter, and I really have to say, a lot of the explanations and writing on their site is excellent. If you’re looking for Post-AP ideas, as well as caveats about data for everyone, this is a great place to go.

One of the things they do is try to figure out how many people get killed in various trouble areas and in particular events. You get one estimate from some left-leaning NGO. You get another from the Catholics. Information is hard to get, and lists of the dead are incomplete. So it’s not surprising that different groups get different estimates. Whom do you believe?

Suppose the LLNGO thinks that 20 civilians were killed during a protest. They give you a list of the victims. As the same time, the Catholics think that 25 civilians were killed. They have a list too.

Now the key thing: *you compare the lists*. Suppose the lists have five names in common. Now you’re super confident that at least 40 people were killed, because you have 40 names. But when you think about it, this is exactly the same as capture/recapture—and you can calculate that it’s likely that 100 people were killed.

And this, at root, is how these folks actually make better estimates for the effects of tyranny, torture, war crimes, and other institutionalized misconduct around the world. But only at root. The techniques get more sophisticated in interesting ways that I’m only starting to understand.

For example, one problem with what I just described is the same as when we get realistic about the fish. Why should we assume that every fish in the lake, tagged or untagged, has the same probability of getting caught?

In the human-rights context, this issue is called *list independence*. That is, do the LLNGO and the Catholics have the same chance of listing each person killed? Of course not. Relatives who might talk to the Catholics might not even speak to the LLNGO, and vice versa. Or they might have been doing their counts in different geographical parts of the protest.

It turns out that, using techniques that seem to be called MSE for Multiple Systems Estimation, you can try to account for list independence and other potential problems, provided you have *three or more lists*. I’m intrigued to study up on this and learn more, and you can too! Here is the first of a series of blog posts by Amelia Hoover Green. Follow the links to the next chapters. See what you think.

Meanwhile, think about one of the overarching problems from the last post: that the population estimate was so wide. I think that part of the problem is that when my brain is stuck in the fish context, I think it’s impractical to imagine tagging, say, half of the fish in the lake. But you can see from the graphs (and common sense) that if the fraction tagged is small, and the numbers are kind of small, the resulting distributions will be wide and sparse.

But in this context, we probably start out with the (wrong) assumption that we have almost all of the deaths; so our estimates will be better.

This graph shows estimates of population with, again, a true population of 100, but this time with half of the fish—50 cases—captured and tagged, and 50 recaptured. This is equivalent to each (independent) list having 50 casualties, with varying numbers of names (centered on 25) overlapping.

Notice how the interval has shrunk: it’s now (80, 135) instead of (60, 200).

]]>

If you’ve been awake and paying attention to stats education, you must have come across capture/recapture and associated classroom activities.

The idea is that you catch 20 fish in a lake and tag them. The next day, you catch 25 fish and note that 5 are tagged. The question is, how many fish are in the lake? The canonical answer is 100: having 5 tagged in the 25 suggests that 1/5 of all fish are tagged; if 20 fish are tagged, then the total number must be 100. Right?

Sort of. After all, we’ve made a lot of assumptions, such as that the fish instantly and perfectly mix, and that when you fish you catch a random sample of the fish in the lake. Not likely. But even supposing that were true, there must be sampling variability: if there were 20 out of 100 tagged, and you catch 25, you will not always catch 5 tagged fish; and then, looking at it the twisted, Bayesian-smelling other way, if you did catch 5, there are lots of other plausible numbers of fish there might be in the lake.

Let’s do those simulations.

We assume there are actually 100 fish in the lake, of which 20 are tagged. We sample 25 and count the tagged fish, then infer the size of the population.

Let *T* be total number of tagged fish (20), *R* be the number that you recapture (25), *P* be the population, and *Q* the number of tagged fish in the recaptured set. Our estimate for *P* is:

So if *Q* = 5, our estimate is 100.

In the simulation, then, we’ll let *Q* be a random binomial, choosing from *R* events with a probability of (*T*/100). Here are 100 estimates for the population P:

I look at this and think, crikey, that’s *terrible*! It’s really easy to be off by 50% or more.

Of course, things change if you change the setup. For example, if you tag a large fraction of the fish in the lake, the estimates get better and better. But the point is to be able to estimate the population well without actually counting them, right? (Well, that partly misses the point, as we will see in the next post. But that’s what I thought for a long time.)

Now we do it the other way. Suppose we know the population *P* and the two sample sizes *T* and *R*. That is, suppose we know there are 100 fish in the lake, that we will tag 20 of them, then, later, catch 25. We’ll get some number of tagged fish in the 25. We expect about 5, but it will vary.

That’s what you see in the figure: 100 examples of the number of tagged fish you will “recapture” under our initial assumption that there are 100 fish in the lake. The vertical lines at 2 and 8 are at the 5th and 95th percentiles. Notice that the expected value, 5, is in the middle of the distribution.

We wonder (*à la* doing a confidence interval): What is the range of populations for which getting 5 tagged fish out of 25 is *plausible*? We’ll vary the true population (*P* will vary up and down from 100) and see where that “5” is in the distribution.

We could figure it out exactly using the binomial distribution, but if we just simulate it, we get an answer of “between about 60 and 200,” which is also terrible. (I put the population on a slider, and slid it back and forth until the 5th and 95th percentiles stayed more or less at 5 fish. That gives me a 90% confidence interval.)

The next two illustrations show the distribution of *Q*, the number of tagged fish in the sample, for populations *P* of 60 and 200.

So sure, playing with and eating goldfish crackers in class is great, and you certainly could do it several times and average as in this fine post from ispeakmath. And no question that the proportional reasoning here is just the kind of thing we want kids to do. But doesn’t it bother you that a real fisheries person would probably not repeat this procedure and average in order to get a better estimate?

I was tempted to find out what real fisheries people DO do, but never got around to it. But then, a couple days ago, I came across a Very Compelling Real-Life Application Of This Technique. Stay tuned.

]]>