In the last three posts we’ve discussed clumpiness. Last time we studied people walking down a concourse at the big Houston airport, IAH, and found that they were clumped. We used the gaps in time between these people as our variable. Now, as we did two posts ago with stars, we’ll look at the same data, but by putting them in bins. To remind you, the raw data:

Suppose we take our data and instead of looking at all the individual times, count how many people pass me every 10 seconds? Then I’ll have one number for every 10 seconds; what will the distribution of those numbers look like?

Here is that distribution for the 58 ten-second bins. This is “p” data only, that is, people walking right to left. Anticipating what we will need, we’ll also show the same distribution for randomly-generated data, and plot the means. We also plot the variances, which is weird, but lets us see the numerical values:

The means are identical, which is good, because we’ve intentionally made the random data to have the same number of cases over the same period of time.

But the two distributions look different, in the way we expect if we have clumping: the real data has more of the heavily-populated bins (6, 7, 8) where the clumps are, and more of the lightly-populated bins (0, 1) where the longer gaps are. This is less dramatic than it was with the stars. But we have much less data, and the data are real, which always makes things harder.

At any rate, more light bins and more heavy bins means that it makes sense to characterize clumpiness by using a measure of *spread*; if we pick variance as we did before (following the star reasoning and the lead from Neyman and Scott), we see that the real data have a variance of 3.25, which is much bigger than the mean of 2.17. The random data have a variance of 1.86, which is relatively close to the mean—and what we expect, that is, they should be Poisson distributed, though you don’t need to know that to do this analysis.

Of course, the random data will be different every time we re-randomize. Here are 100 variances for 100 runs of random data, with the “real” data’s variance plotted as well:

Yay! Does a variance of 3.25 give us evidence that the real data are not random? Yes. We can’t tell from this alone that it is because of clumping, but it is in the direction consistent with clumping.

We can do as we did before, and define an index of clumpiness the same way: the variance of the bin counts divided by the mean. For these data, and this bin size, we get 3.246/2.172, or about 1.5.

Can we apply this idea to the made-up heads and tails that we started this whole sequence with in our first post about clumpiness? Of course. Next post!

**Extra Topic: What about Bin Sizes?**

Gasstation… commented wisely that it may matter what your cell size is. You bet. It’s the same issue as when you make a histogram, or do any gridding or binning; the size you use may create or obscure patterns inadvertently. If you’re a student thinking about studying clumpiness as a project, this is a *great* topic to explore: how do the size and placement of your grid affect the results of your investigation?

On our first time through this line of reasoning, however, I did not address these issues. They’re really interesting when you’re doing the investigation yourself, but will probably be pretty tedious and distracting if you’re just reading about it.

But just so we see, I ran the analysis with different bin sizes:

The clumpiness becomes increasingly unstable with large bins, probably because the number of bins declines, that is, there are fewer and fewer data points to work with.

Bin size is not the only issue, however. We can also pay attention to where the bins *start*. That is, the “picket fence” of the bins can slide left and right—and that will change the counts in the bins. The “offset” could range from 0 to 10 seconds for our 10-second bins. We can calculate the index of clumpiness for each offset:

Isn’t THAT interesting! How should we use this? It’s not clear. Informally, we could say that an index of about 1.5 looks reasonable as a summary of the possible values. There’s another issue as well: as you slide the picket fence of binning over data, different numbers of bins cover the domain. Also, therefore, the two *end* bins will often have artificially low counts, since they extend into a data-free zone. So we should probably discard the end bins—but that’s too much trouble for us today.