 # The Index of Clumpiness, Part One 1000 points. All random. The colors indicate how close the nearest neighbor is.

There really is such a thing. Some background: The illustration shows a random collection of 1000 dots. Each coordinate (x and y) is a (pseudo-)random number in the range [0, 1) — multiplied by 300 to get a reasonable number of pixels.

The point is that we can all see patterns in it. Me, I see curves and channels and little clumps. If they were stars, I’d think the clumps were star clusters, gravitationally bound to each other.

But they’re not. They’re random. The patterns we see are self-deception. This is related to an activity many stats teachers have used, in which the students are to secretly record a set of 100 coin flips, in order, and also make up a set of 100 random coin flips. The teacher returns to the room and can instantly tell which is the real one and which is the fake. It’s a nice trick, but easy: students usually make the coin flips too uniform. There aren’t enough streaks. Real randomness tends to have things that look non-random.

Here is a snap from a classroom activity: Scheaffer et al., 1996. Activity-Based Statistics, p 66

That whole activity is worth some discussion, but not today. The question is, suppose we’ve learned our lesson: We know that streakiness can be random. We know that the stars can show clumps even when they’re random.

Now suppose there really is some clumping in the pattern of stars. How would we tell?

Could we do a test? Sure. But in order to do a traditional statistical test, we need a single number that characterizes how clumpy the pattern is.

If we had such a number (a measure, a statistic) we could then do the randomization dance: Make a lot of truly random patterns and compute the statistic. Assemble those stats into a sampling distribution. Then compute that same quantity for our actual pattern. If that test statistic falls outside the sampling distribution we made, it’s implausible that the distribution is random.

To decide what statistic would make sense, let’s look at some clumpy star fields. There are a lot of ways to make them clumpy, so I chose to make a single clump, right in the middle. I control the clumpiness with a parameter I call K (for clump), which is zero for no clumpiness, and 1 for total clumpiness (i.e., all stars are in the cluster). Here are K = zero, one-half, and one: What statistic could you use? As usual, you should go off and think about this. But because I’m trying to record my thinking, I’m going to tell you what I came up with.

Here’s one idea: compute, for every star, the distance to the next closest star. Then I would look at that distribution—the distances to the nearest neighbors. It stands to reason that if the stars are clumped, the nearest neighbors would be closer, so the distributions would be centered lower. Maybe the mean of that distribution would work as a measure of clumpiness. Distributions of distances to nearest neighbors, for K = 0, 0.65, and 1.0. 1000 stars, dimensions of the field are 100 by 100.

Sure enough! The means go down as the clumpiness goes up. Look at K = 0. Does that mean that, in a uniform random field, the mean distance to a neighbor is always 1.65 units? No: because there is randomness, the means will fluctuate. So we did this 100 times and recorded the means. And we found the 5th and 95th percentiles. This gives us a 90% “plausibility interval” for K = 0, which is roughly from 1.55 to 1.65. That is, our example  (at 1.65) is a little unusual. Skipping the step where we actually see those sampling distributions (if you’re doing this for a school project, don’t skip this step!), we can get the next graph, which shows the 5th and 95th percentiles for more values of K. Reading this graph, you can see that if K = 0.5, you’re nearly 90% certain to be outside that plausibility interval. (Oooh! Power!)

At a more modest value of K, however, it’s not so clear. The next illustration shows a field where K = 0.14. Its mean minimum distance is about 1.61—which is in the middle of the plausibility interval for K = 0, the random (null) case. According to our procedure, it’s completely plausible that this field is random. Field for K = 0.14. Here, the mean minimum distance is about 1.61.

If you know what to look for, you can kinda sorta see the cluster. But our statistic can’t find it. And if you didn’t know what to look for—like if you didn’t know the cluster was right in the middle—this field would not look any different than the K = 0 example up above.

Can we do better? You bet. Next post. 