In the last two posts, we talked about clumpiness in two-dimensional “star fields.”
In the first, we discussed the problem in general and used a measure of clumpiness created by taking the mean of the distances from the stars to their nearest neighbors. The smaller this number, the clumpier the field.
In the second, we divided the field up into bins (“cells”) and found the variance of the counts in the bins. The larger this number, the clumpier the field.
Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.
We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by reducing its dimension from 2 to 1. This is often a good strategy for making things more understandable.
Where do we see one-dimensional clumpiness? Here’s an example:
One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.
Last time, we discussed random and not-so-random star fields, and saw how we could use the mean of the minimum distances between stars as a measure of clumpiness. The smaller the mean minimum distance, the more clumpy.
What other measures could we use?
It turns out that the Professionals have some. I bet there are a lot of them, but the one I dimly remembered from my undergraduate days was the “index of clumpiness,” made popular—at least among astronomy students—by Neyman (that Neyman), Scott, and Shane in the mid-50s. They were studying Shane (& Wirtanen)’s catalog of galaxies and studying the galaxies’ clustering. We are simply asking, is there clustering? They went much further, and asked, how much clustering is there, and what are its characteristics?
They are the Big Dogs in this park, so we will take lessons from them. They began with a lovely idea: instead of looking at the galaxies (or stars) as individuals, divide up the sky into smaller regions, and count how many fall in each region.
Trying to get yesterday’s post out quickly, I touched only lightly on how to set up the various simulations. So consider them exercises for the intermediate-level simulation maker. I find it interesting how, right after a semester of teaching this stuff, I still have to stop and think how it needs to work. What am I varying? What distribution am I looking at? What does it represent?
Seeing how the two approaches fit together, yet are so different, helps illuminate why confidence intervals can be so tricky.
Anyway, I promised a Very Compelling Real-Life Application of This Technique. I had thought about talking to fisheries people, but even though capture/recapture somehow is nearly always introduced in a fish context, of course it doesn’t have to be. Here we go:
Human Rights and Capture/Recapture
I’ve just recently been introduced to an outfit called the Human Rights Data Analysis Group. Can’t beat them for statistics that matter, and I really have to say, a lot of the explanations and writing on their site is excellent. If you’re looking for Post-AP ideas, as well as caveats about data for everyone, this is a great place to go.
One of the things they do is try to figure out how many people get killed in various trouble areas and in particular events. You get one estimate from some left-leaning NGO. You get another from the Catholics. Information is hard to get, and lists of the dead are incomplete. So it’s not surprising that different groups get different estimates. Whom do you believe?
If you’ve been awake and paying attention to stats education, you must have come across capture/recapture and associated classroom activities.
The idea is that you catch 20 fish in a lake and tag them. The next day, you catch 25 fish and note that 5 are tagged. The question is, how many fish are in the lake? The canonical answer is 100: having 5 tagged in the 25 suggests that 1/5 of all fish are tagged; if 20 fish are tagged, then the total number must be 100. Right?
Sort of. After all, we’ve made a lot of assumptions, such as that the fish instantly and perfectly mix, and that when you fish you catch a random sample of the fish in the lake. Not likely. But even supposing that were true, there must be sampling variability: if there were 20 out of 100 tagged, and you catch 25, you will not always catch 5 tagged fish; and then, looking at it the twisted, Bayesian-smelling other way, if you did catch 5, there are lots of other plausible numbers of fish there might be in the lake.
It’s such a joy when my daughter asks for help with math. It used to happen all the time; it’s rare now. She just started medical school, and had come home for the weekend to get a quiet space for concentrated study.
“Dad, I have a statistics question.” Be still, my heart!
“It’s asking, if you have a random mRNA sequence with 2000 base pairs, how many times do you expect the stop codon AUG to appear? How do you figure that out?”
I got her to explain enough about messenger RNA so that I could picture this random sequence of 2000 characters, each one A, U, G, or C, and remembered from somewhere that a codon was a chunk of three of these.
“I think it’s more of a probability, or combinatoric question than stats…” I said. (I was wrong about that; interval estimates come up later. Read on.)