If you’ve been awake and paying attention to stats education, you must have come across capture/recapture and associated classroom activities.
The idea is that you catch 20 fish in a lake and tag them. The next day, you catch 25 fish and note that 5 are tagged. The question is, how many fish are in the lake? The canonical answer is 100: having 5 tagged in the 25 suggests that 1/5 of all fish are tagged; if 20 fish are tagged, then the total number must be 100. Right?
Sort of. After all, we’ve made a lot of assumptions, such as that the fish instantly and perfectly mix, and that when you fish you catch a random sample of the fish in the lake. Not likely. But even supposing that were true, there must be sampling variability: if there were 20 out of 100 tagged, and you catch 25, you will not always catch 5 tagged fish; and then, looking at it the twisted, Bayesian-smelling other way, if you did catch 5, there are lots of other plausible numbers of fish there might be in the lake.
Let’s do those simulations.
Easy one first
We assume there are actually 100 fish in the lake, of which 20 are tagged. We sample 25 and count the tagged fish, then infer the size of the population.
Let T be total number of tagged fish (20), R be the number that you recapture (25), P be the population, and Q the number of tagged fish in the recaptured set. Our estimate for P is:
So if Q = 5, our estimate is 100.
In the simulation, then, we’ll let Q be a random binomial, choosing from R events with a probability of (T/100). Here are 100 estimates for the population P:
I look at this and think, crikey, that’s terrible! It’s really easy to be off by 50% or more.
Of course, things change if you change the setup. For example, if you tag a large fraction of the fish in the lake, the estimates get better and better. But the point is to be able to estimate the population well without actually counting them, right? (Well, that partly misses the point, as we will see in the next post. But that’s what I thought for a long time.)
Now the Harder Version
Now we do it the other way. Suppose we know the population P and the two sample sizes T and R. That is, suppose we know there are 100 fish in the lake, that we will tag 20 of them, then, later, catch 25. We’ll get some number of tagged fish in the 25. We expect about 5, but it will vary.
That’s what you see in the figure: 100 examples of the number of tagged fish you will “recapture” under our initial assumption that there are 100 fish in the lake. The vertical lines at 2 and 8 are at the 5th and 95th percentiles. Notice that the expected value, 5, is in the middle of the distribution.
We wonder (à la doing a confidence interval): What is the range of populations for which getting 5 tagged fish out of 25 is plausible? We’ll vary the true population (P will vary up and down from 100) and see where that “5” is in the distribution.
We could figure it out exactly using the binomial distribution, but if we just simulate it, we get an answer of “between about 60 and 200,” which is also terrible. (I put the population on a slider, and slid it back and forth until the 5th and 95th percentiles stayed more or less at 5 fish. That gives me a 90% confidence interval.)
The next two illustrations show the distribution of Q, the number of tagged fish in the sample, for populations P of 60 and 200.
So sure, playing with and eating goldfish crackers in class is great, and you certainly could do it several times and average as in this fine post from ispeakmath. And no question that the proportional reasoning here is just the kind of thing we want kids to do. But doesn’t it bother you that a real fisheries person would probably not repeat this procedure and average in order to get a better estimate?
I was tempted to find out what real fisheries people DO do, but never got around to it. But then, a couple days ago, I came across a Very Compelling Real-Life Application Of This Technique. Stay tuned.