In the last two posts, we talked about clumpiness in two-dimensional “star fields.”
- In the first, we discussed the problem in general and used a measure of clumpiness created by taking the mean of the distances from the stars to their nearest neighbors. The smaller this number, the clumpier the field.
- In the second, we divided the field up into bins (“cells”) and found the variance of the counts in the bins. The larger this number, the clumpier the field.
Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.
We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by reducing its dimension from 2 to 1. This is often a good strategy for making things more understandable.
Where do we see one-dimensional clumpiness? Here’s an example:
One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.
I watched people going from my right to my left. And I watched people going left to right. ¡Que oportunidad! This was data! I pulled out my computer, started up Fathom, and created a Fathom “experiment” so that I could record the time that a person passed my location simply by pressing a key. I used different keys for left-to-right and for right-to-left, and whether there was a cart involved. (These are the IAH carts that whisk people the vast distances they need to travel. Essential service. Obnoxious implementation.)
Here’s the graph:
It sure looks as if there is clumping. But is there really? And can we detect it? This is in time instead of space, but that doesn’t matter. It’s really the same issue. And the people-see-patterns-in-randomness problem is the same.
Before we do any calculations, let’s also think about the context. These are people walking between gates in an airport. Some are traveling alone, but many are probably traveling with colleagues, in couples, or in families. Another issue is whether fast walkers get stuck behind slow walkers, creating a clump because of traffic, not affiliation. Then, at a longer time scale, there should be an overall increase in traffic after a plane arrives. In any case, we have reason to believe that there will be clumping, so we hope we can detect it easily in the data.
As before, let’s first look at individual cases rather than bins. It’s easiest (in Fathom) to compute the time from the previous pedestrian rather than the time to the nearest pedestrian. Hoping that doesn’t make too much of a difference (famous last words…), here are graphs of the observed distribution of time gaps, called gap, and a distribution of gaps from a set of random times:
These are definitely different distributions—and in ways that imply clumping! The pile of short-interval points seems like a smoking gun.
We need a single number to characterize clumpiness. What would be good? The graphs show the mean times, which are identical. We might explain this with the idea that, in the clumpy group, there are more short gaps (reducing the mean), but also, because the people are closer together, more long gaps (increasing the mean). And we see that in the graph. But the real reason is that we have a given amount of time and a given number of people. In that case, the average time between people is the time divided by N. So we might in fact do better taking the trouble to find the minimum gap rather than the previous gap as we did with the stars.
That does show a difference, but instead let’s stick with the plain, “previous” gaps and look at the median of that distribution. That’s better than the mean in this case. (Think: why?)
In the graphs, we see that the “real” median is less than the random median, which is what we expect from clumping. Let’s do randomization-based inference again! We redo the random case many times, and build up a sampling distribution of that median gap. Is our observed median of 3.217 plausible if we assume the random, null hypothesis? Nope. Here is a set of 100 median gaps from random data:
We’ve chosen the median for two reasons: it makes sense, and it’s the first thing we tried that worked. What other measures might be just as good, or better, at characterizing the clumpiness? What else can we learn from the distribution? Here’s one telling question: suppose every traveler was in a couple, and no one traveled alone. What would the distribution of gaps look like? What would its median be? Sometimes, looking at extreme cases like this is a powerful tool.
When we did the stars, a completely different approach—putting the data into bins—worked well. Lets do that for this one-dimensional, real data set. Next post.