The Index of Clumpiness, Part Two

Last time, we discussed random and not-so-random star fields, and saw how we could use the mean of the minimum distances between stars as a measure of clumpiness. The smaller the mean minimum distance, the more clumpy.

Star fields of different clumpiness, from K = 0.0 (no stars are in the clump; they’re all random) to K = 0.5 to K = 1.0 (all stars are in the big clump)

What other measures could we use?

It turns out that the Professionals have some. I bet there are a lot of them, but the one I dimly remembered from my undergraduate days was the “index of clumpiness,” made popular—at least among astronomy students—by Neyman (that Neyman), Scott, and Shane in the mid-50s. They were studying Shane (& Wirtanen)’s catalog of galaxies and studying the galaxies’ clustering. We are simply asking, is there clustering? They went much further, and asked, how much clustering is there, and what are its characteristics?

They are the Big Dogs in this park, so we will take lessons from them. They began with a lovely idea: instead of looking at the galaxies (or stars) as individuals, divide up the sky into smaller regions, and count how many fall in each region.

We can do that. We divide that square of sky into a 10-by-10 grid. 100 cells. Now, instead of dealing with 1000 ordered pairs of numbers, we have 100 integers, the numbers of dots in each cell. Much easier.

This shows a star field with K = 0, divided into a grid of 100 cells. The graph is the distribution of counts from those cells. Notice how the distribution centers at about 10. This should make sense: we have 1000 stars distributed among 100 cells: the average number of stars in each cell is (exactly) 10.

Now let’s look at the same things but for K = 1.0, that is, very clumped:

Wow. The distribution sure is different; if you think about it, it’s clear why. The densest cells are lots denser than in the uniform K = 0 case, and to make up for that, there are a huge number of cells with very few stars.

To do inference on this, though, we need a single number (or measure, or statistic) that characterizes the distribution. Should we use mean, as we did with minimum distance? NO! The mean is 10 in both cases!

There are many possible measures to take, such as the maximum count, or, maybe the 10th-highest count (the 90th percentile). Those might work, and you should try them.

But the big dogs—Neyman and Scott—used a measure of spread. You could use standard deviation or IQR. But they used variance, for a really sweet reason, which we’ll get to later.

Meanwhile, recall that the variance is the square of the standard deviation. It’s the mean square deviation from the overall mean.

In our case, we have 100 numbers: 100 counts of how many stars fell in a particular cell.  The overall mean is 10, as we have discussed. To compute the variance, go to each cell; figure out how far that count is from 10; square that amount; add up the 100 squares; then divide by 100 (or 99, if you’re that way).

Let’s look at K = 0. Of course, every random star field will have a different variance. So we made many (200) random star fields, and for each one, counted the stars in the 100 cells to get a variance for each field. The “typical” variance was about 10. In the 200 fields, about 5% had a variance below 7.7, and 5% were above 12.4. That is, 90% of all K = 0.0 random star fields had a variance in the interval [7.7, 12.4].

We computed this interval for many values of K.

Var5,95 various K
5th and 95th percentiles of the variance for various values of K. On the left, the whole graph; on the right, we zoom in to small values of K.

You can also see that a typical variance at K = 1.0 is above 200, which makes sense looking back at the super skewed and spread out distribution above. (Remember, that 200+ is the square of the standard deviation.)

Let’s look in detail at an intermediate case. In fact, let’s look at K = 0.14, the value for which, last time, we could not distinguish from randomness using our mean-minimum technique.

Here is a star field at K = 0.14:


And now, its distribution of counts, with the distribution for a K = 0.0 field for comparison:


You can see that there are four cells on the right that are higher that anything at K = 0.0, and if you squint, you might believe that the peak—a typical number of stars for most of the cells—is a little lower, maybe 9 instead of 10. (And that makes sense; if you drain off 40+ “extra” stars for the center of the cluster, the cells on the outside will be depleted.)

And if we compute the variance? 12.9, which is outside that [7.7, 12.4] “90%” window we computed from doing lots of iterations with K = 0.

That is, with this measure, we can detect the clumpiness, whereas we could not with the one we invented in the previous post.

Whew. That had some hard ideas. As teachers, we look for easier approaches. And good old Pólya often suggested looking for lower dimensionality. Great idea! Let’s do this in one dimension instead of two. Next time.


Coda: Poisson Distribution and Some Formulas

Now. Why is variance cool here? Turns out that if you place things randomly into bins, and look at the distribution of counts—which is what we did here for K = 0.0—the distribution of numbers follows a Poisson distribution. Here’s the formula for the distribution for a Poisson-ly distributed random variable X:

f(k, \lambda) = \mathrm{Pr}(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

In our case, \lambda = 10. So the probability of getting 12 (say) in a cell is

f(12, 10) = \mathrm{Pr}(X = 12) = \frac{10^{12} e^{-10}}{12!} \approx 0.09.

You can check that it makes sense in the distribution above. But what you really need to know is that the mean of this distribution is \lambda and so is the variance. That is, we know what the variance is supposed to be if the stars are random, and that variance is just the mean: the total number of stars divided by the number of cells. In our case, 10.

And this means that for any field of 1000 stars cut into 100 cells, where the counts in the cells are c_i, we can define the index of clumpiness \mathcal{R} as a ratio:

\mathcal{R} = \frac{\textrm{the variance of those numbers}}{\textrm{the theoretical variance with no clumping}} = \frac{\mathrm{Var}(c_i)}{10}.

If a pattern is non-clumpy and random, i.e., Poisson, this index \mathcal{R} will be close to 1.0; when it’s really clumpy, \mathcal{R} will be large.

To generalize: if we have N stars in n cells, the average count is \bar{c} = N / n. With some algebra, the general formula becomes:

\mathcal{R} = \frac{\mathrm{Var}(c_i)}{\bar{c}} = \frac{1}{N} \sum_{i=1}^n{(c-\bar{c})^2}.

More Thoughts

Although this is not part of the introductory stats curriculum, it’s pretty interesting, and maybe more accessible, with all our technology, than it used to be. I’d be curious to know more applications. Scientists must look at clumpiness in all sorts of contexts. Traffic, queueing, ecological models, who knows? (You do. Let me know.)


Published by

Tim Erickson

Math-science ed freelancer and sometime math teacher. In 2014–15, at Mills College in Oakland, California.

5 thoughts on “The Index of Clumpiness, Part Two”

  1. This index only works well if the clumps are much larger than the grid. You probably need to compute the index with several different grid sizes to detect patterns like periodic clumping (which may look like random scatter if the grid is matched to the period).

  2. Depends on how you set up the simulation, that is, how strong the signal is. My “K” is the fraction of stars that are in the clump. If the clump is small, and they all fall in one square, you get a big variance, big enough to reject.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s