Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.
So there ought to be data science education. But what should we teach, and how should we teach it?
Let me address the second question first. There are at least three approaches to take:
I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:
We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.
That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.
You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these:
World trade |
Crime in Oakland |
Popularity of names beginning with “Max” |
What distinguishes this obvious data science from a good school data activity? Consider the graph (at right) of data from physics. Students rolled a cue ball down a ramp from various distances and measured the speed of the ball at the foot of the ramp.
The student even has a model and the corresponding residual plot. Great data. But not data science. Why not?
To answer that question, let’s look at one way some people have defined data science. The illustrations below, from Conway 2010 and Finzer 2013, show the thrust behind several definitions, namely, that data science lies in a region where content understanding, math and stats, and computer chops (Conway calls it “hacking”) overlap.
If we apply the left Venn diagram to the physics example, we can see that it belongs in “traditional research”: the work uses and illuminates physics principles and concepts from mathematics and statistics (the residuals, the curve-fitting) but does not require substantial computational skills.
But saying that is somehow not enough. The physics data example fails what we might call a “sniff test” for data science, but what, more specifically, does this sniff test entail? What are the ingredients that separate “regular” data from data science?
In our work over the last year and a half, we have begun to identify some of the telltale ingredients that make an activity smell like data science. And they come in (at least) three categories:
Although this formulation is certainly incomplete, it may be useful. Let’s look at each of these in turn.
Our obvious data science examples share a sense of being awash in data. The word brings up the image of a possibly tempestuous sea of data, or even waves of data breaking over us. This is subjective but still an evocative litmus test. More specifically, being awash might include:
It’s useful to ask what data science students should be doing with data. Again, an incomplete list:
What aspects of a dataset are characteristic of situations where we use data science? This is related to “awash” but more specific:
If we look again at the physics graph, we can see where it falls short. What did the students have to do? Make a graph of the two variables against one another. Plot a model. There was no search for a pattern, no sense of being overwhelmed by data, no subsets, no reorganization. The data are decidedly ruly.
That is not to say that working with this dataset is a bad idea! It’s great for understanding physics. But it’s not data science. And it points out the essential meaning of our sniff test: in this lab, students work with data, but they don’t have to do anything with the data to make sense of it. It’s science, but not data science.
Let’s alter that rolling-down-a-ramp activity. The next figure shows a graph from the same sort of data, but with an important change.
This spray of points has a very different feel (or smell) than the original one. We can see that, in general, the farther you have rolled, the faster you go, but the relationship is not clean. What could be making the difference? A look at the dataset shows that we now have more attributes, including wheel radius, the mass of the wheels, the mass of the cart, and, importantly, the angle of the ramp.
The ramps are at different angles? Maybe that makes a difference. Suppose we color the points by the angle of the ramp:
Now we see that the ramps range from about 5 to 45 degrees, and that, as we might expect, the steeper the ramp, the faster you go. We might be able to make predictions from this, but it would be nice to construct a graph like the original one. So we make a second graph, of just angle, and select the cases where the ramp angle is close to 20°.
Notice how the selected points show the square-rooty pattern of the original cue-ball graph. We can make a model for that specific ramp angle and plot it:
Let us re-apply our sniff test.
Awash in data? Look back at the first plot in this sequence with the eye of a first-year physics student. Confusing? Hard to tell where to start? You bet.
What are the data moves? We added a third dimension—color—to a plot to make a visualization we might not ever have seen before, and had to make sense of it. Then we found a way to look at a subset of the data, finding a clearer pattern there than in the dataset as a whole.
And data properties? We have more than two attributes, more than a couple dozen points. And depending on your experience, the data seem a bit unruly.
This may not reek of data science, but it has more than just a whiff.
But wait: at the end, we got a graph with a function, a good model, just like we had originally. That’s a lot of extra work just to get the same graph.
Reasonable teachers ask, is it worth the class time to learn all that computer stuff in addition to the physics? In fact, the way we set up a typical lab is designed precisely to avoid the problems of the messy data. Why have students struggle with a situation where the variables are not controlled? Why not just teach them to control variables as they should—and get better-organized data, intentionally?
Let me give a few responses:
It’s tempting to treat computer-data skills as an optional extra. Doing so creates an equity problem, because some people—people who look like me, mostly white and Asian boys—play with this stuff in our free time. Don’t let the playing field stay tilted.
That’s enough for now, but let us foreshadow where to go next:
If something else bothers you about this example, you’re not alone. I will describe what bothered me, and how that bothered Andee Rubin in a different way, soon.
Conway, Drew. 2010. Blog post at http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Finzer, William. 2013. “The Data Science Education Dilemma.” Technology Innovations in Statistics Education 7, no. 2. http://escholarship.org/uc/item/7gv0q9dc.
Jones, Seth. 2017. Discussion paper for DSET 2017. https://docs.google.com/document/d/1DbgPq8mBwHOXmagHKQ7kNLP49PujyPq-tKip5_HeD8E
And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.
This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.
In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter. We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.
(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)
Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.
My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.
Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.
Okay, Ph.D. students. That’s a good nugget for a dissertation.
Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:
(Also, get a micrometer on eBay and a sweet 0.1 gram food scale. They’re about $15 now.)
Long ago, I wrote about coins and said I would write about hexnuts. I wrote a book chapter, but never did the post. So here we go. What prompted me was thinking different kinds of models.
I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?
A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The more-or-less cubic part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density.
Most students, however, won’t have that instant insight. So we begin by asking them to predict. They will draw graphs that increase, which is a good start, and many student graphs will curve upwards. Great! Then when they measure, students see something like the graph. (Here is a Desmos document with the data.)
They will look at this graph, and many will say it looks like a parabola. Which it does. Kinda. But if you put a quadratic of the form on the graph, you will never get it to fit well, no matter what value of k you choose.
We’re doing this quickly, so let’s skip ahead to say that when you try , things go much better (though they aren’t perfect). Then you can explore why you think it “goes like” x cubed. You might get there by discussing the quarter-inch nut and the half-inch nut, and pass out samples, and ask, why isn’t the half-inch twice the weight of the quarter-inch. And so forth.
Now the discussion can go in two interesting and different directions. One goes to the wonderful snook data where you can do a similar exploration about fish, and (especially if you do a log transform) you discover that the best exponent for these fish is about 3.24 rather than 3.00. And you gotta wonder how that could be.
But we’re not going there. Instead we start with a great question that astute readers must have asked, but we have avoided until now: what do we mean by the size of a hexnut? The answer is, the diameter of the bolt that it fits. Right? A quarter-inch nut is the one that fits a quarter-inch bolt. It’s bigger (duh) than a quarter inch across.
So let’s consider modeling the geometry of the situation. A hexnut is a hexagonal prism, right? And it has a circular hole.
We can measure more than the weight. Suppose we measure the distance across the faces (f) and the thickness (t) as well. Then the area of the relevant hexagon is , and the resulting volume of the solid—taking the hole into account—is
.
So we can model the mass with , where is the density of the material. The next graph, from Fathom, shows this calculated model mass against the measured mass, along with and a residual plot. We have already slid the rho slider to a decent value.
I have positioned the slider so the first few points are flat. When I do that, they are not centered at zero. Also, the larger nuts deviate from the line more and more. This is an indication that there are systematic effects that our model doesn’t account for. I leave it to you to think about what those might be!
The value of rho? For this graph, 8.22 grams per cc, which is not bad (just a little high) for the density of steel.
And the modeling point? If you agree with my earlier assertion that simplification and abstraction are the hallmarks of modeling, then this is modeling even though we didn’t make a fancy function. We assumed that the nut is truly shaped like a hexagonal prism with a cylindrical hole cut in it. That’s patently false, but it’s a completely decent approximation, and much easier to deal with than the ugly, messy truth. What’s more, we can use the ways reality deviates from the simple, abstract model to decide whether our model values might be a little high or low.
Here are the data from that graph. Lengths are inches, mass is in grams:
boltSize flat thick mass 0.25 0.4298 0.2148 3.09 0.3125 0.4951 0.2651 4.72 0.375 0.5555 0.3236 6.73 0.4375 0.6807 0.3802 12.77 0.5 0.7407 0.4253 15.85 0.625 0.9223 0.5450 31.06 0.75 1.0920 0.6337 48.8
There is a lot more to say, but hey, it’s almost four hours so it’s way time to stop.
The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.
Here is the link. If you teach stats, make a bookmark: http://dasl.datadesk.com/
The site includes scores of data sets organized by content topic (e.g., sports, the environment) and by statistical technique (e.g., linear regression, ANOVA). It also includes famous data sets such as Hubble’s data on the radial velocity of distant galaxies.
One small hitch for Fathom users:
In the old days of DASL, you would simply drag the URL mini-icon from the browser’s address field into the Fathom document and amaze your friends with how Fathom parsed the page and converted the table of data on the web page into a table in Fathom. Ah, progress! The snazzy new and more sophisticated format for DASL puts the data inside a scrollable field — and as a result, the drag gesture no longer works in DASL.
Fear not, though: @gasstationwithoutpumps (comment below) realized you could drag the download button directly into Fathom. Here is a picture of a button on a typical DASL “datafile” page. Just drag it over your Fathom document and drop:
In addition, here are two workarounds:
Plan A:
Plan B:
Note: Plan B works for CODAP as well.
So. This is a book of 42 activities that connect geometry to functions through data. There are a lot of different ways to describe it, and in the course of finishing the book, the emotional roller-coaster took me from great pride in what a great idea this was to despair over how incredibly stupid I’ve been.
I’m obviously too close to the project.
For an idea of what drove some of the book, check out the posts on the “Chord Star.”
But you can also see the basic idea in the book cover. See the spiral made of triangles? Imagine measuring the hypotenuses of those triangles, and plotting the lengths as a function of “triangle number.” That’s the graph you see. What’s a good function for modeling that data?
If we’re experienced in these things, we say, oh, it’s exponential, and the base of the exponent is the square root of 2. But if we’re less experienced, there are a lot of connections to be made.
We might think it looks exponential, and use sliders to fit a curve (for example, in Desmos or Fathom. Here is a Desmos document with the data you can play with!) and discover that the base is close to 1.4. Why should it be 1.4? Maybe we notice that if we skip a triangle, the size seems to double. And that might lead us to think that 2 is involved, and gradually work it out that root 2 will help.
Or we might start geometrically, and reason about similar triangles. And from there gradually come to realize that the a/b = c/d trope we’ve used for years, in this situation, leads to an exponential function, which doesn’t look at all like setting up a proportion.
In either case, we get to make new connections about parts of math we’ve been learning about, and we get to see that (a) you can find functions that fit data and (b) often, there’s a good, underlying, understandable reason why that function is the one that works.
I will gradually enhance the pages on the eeps site to give more examples. And of course you can buy the book on Amazon! Just click the cover image above.
Suppose we take our data and instead of looking at all the individual times, count how many people pass me every 10 seconds? Then I’ll have one number for every 10 seconds; what will the distribution of those numbers look like?
Here is that distribution for the 58 ten-second bins. This is “p” data only, that is, people walking right to left. Anticipating what we will need, we’ll also show the same distribution for randomly-generated data, and plot the means. We also plot the variances, which is weird, but lets us see the numerical values:
The means are identical, which is good, because we’ve intentionally made the random data to have the same number of cases over the same period of time.
But the two distributions look different, in the way we expect if we have clumping: the real data has more of the heavily-populated bins (6, 7, 8) where the clumps are, and more of the lightly-populated bins (0, 1) where the longer gaps are. This is less dramatic than it was with the stars. But we have much less data, and the data are real, which always makes things harder.
At any rate, more light bins and more heavy bins means that it makes sense to characterize clumpiness by using a measure of spread; if we pick variance as we did before (following the star reasoning and the lead from Neyman and Scott), we see that the real data have a variance of 3.25, which is much bigger than the mean of 2.17. The random data have a variance of 1.86, which is relatively close to the mean—and what we expect, that is, they should be Poisson distributed, though you don’t need to know that to do this analysis.
Of course, the random data will be different every time we re-randomize. Here are 100 variances for 100 runs of random data, with the “real” data’s variance plotted as well:
Yay! Does a variance of 3.25 give us evidence that the real data are not random? Yes. We can’t tell from this alone that it is because of clumping, but it is in the direction consistent with clumping.
We can do as we did before, and define an index of clumpiness the same way: the variance of the bin counts divided by the mean. For these data, and this bin size, we get 3.246/2.172, or about 1.5.
Can we apply this idea to the made-up heads and tails that we started this whole sequence with in our first post about clumpiness? Of course. Next post!
Gasstation… commented wisely that it may matter what your cell size is. You bet. It’s the same issue as when you make a histogram, or do any gridding or binning; the size you use may create or obscure patterns inadvertently. If you’re a student thinking about studying clumpiness as a project, this is a great topic to explore: how do the size and placement of your grid affect the results of your investigation?
On our first time through this line of reasoning, however, I did not address these issues. They’re really interesting when you’re doing the investigation yourself, but will probably be pretty tedious and distracting if you’re just reading about it.
But just so we see, I ran the analysis with different bin sizes:
The clumpiness becomes increasingly unstable with large bins, probably because the number of bins declines, that is, there are fewer and fewer data points to work with.
Bin size is not the only issue, however. We can also pay attention to where the bins start. That is, the “picket fence” of the bins can slide left and right—and that will change the counts in the bins. The “offset” could range from 0 to 10 seconds for our 10-second bins. We can calculate the index of clumpiness for each offset:
Isn’t THAT interesting! How should we use this? It’s not clear. Informally, we could say that an index of about 1.5 looks reasonable as a summary of the possible values. There’s another issue as well: as you slide the picket fence of binning over data, different numbers of bins cover the domain. Also, therefore, the two end bins will often have artificially low counts, since they extend into a data-free zone. So we should probably discard the end bins—but that’s too much trouble for us today.
Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.
We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by reducing its dimension from 2 to 1. This is often a good strategy for making things more understandable.
Where do we see one-dimensional clumpiness? Here’s an example:
One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.
I watched people going from my right to my left. And I watched people going left to right. ¡Que oportunidad! This was data! I pulled out my computer, started up Fathom, and created a Fathom “experiment” so that I could record the time that a person passed my location simply by pressing a key. I used different keys for left-to-right and for right-to-left, and whether there was a cart involved. (These are the IAH carts that whisk people the vast distances they need to travel. Essential service. Obnoxious implementation.)
Here’s the graph:
It sure looks as if there is clumping. But is there really? And can we detect it? This is in time instead of space, but that doesn’t matter. It’s really the same issue. And the people-see-patterns-in-randomness problem is the same.
Before we do any calculations, let’s also think about the context. These are people walking between gates in an airport. Some are traveling alone, but many are probably traveling with colleagues, in couples, or in families. Another issue is whether fast walkers get stuck behind slow walkers, creating a clump because of traffic, not affiliation. Then, at a longer time scale, there should be an overall increase in traffic after a plane arrives. In any case, we have reason to believe that there will be clumping, so we hope we can detect it easily in the data.
As before, let’s first look at individual cases rather than bins. It’s easiest (in Fathom) to compute the time from the previous pedestrian rather than the time to the nearest pedestrian. Hoping that doesn’t make too much of a difference (famous last words…), here are graphs of the observed distribution of time gaps, called gap, and a distribution of gaps from a set of random times:
These are definitely different distributions—and in ways that imply clumping! The pile of short-interval points seems like a smoking gun.
We need a single number to characterize clumpiness. What would be good? The graphs show the mean times, which are identical. We might explain this with the idea that, in the clumpy group, there are more short gaps (reducing the mean), but also, because the people are closer together, more long gaps (increasing the mean). And we see that in the graph. But the real reason is that we have a given amount of time and a given number of people. In that case, the average time between people is the time divided by N. So we might in fact do better taking the trouble to find the minimum gap rather than the previous gap as we did with the stars.
That does show a difference, but instead let’s stick with the plain, “previous” gaps and look at the median of that distribution. That’s better than the mean in this case. (Think: why?)
In the graphs, we see that the “real” median is less than the random median, which is what we expect from clumping. Let’s do randomization-based inference again! We redo the random case many times, and build up a sampling distribution of that median gap. Is our observed median of 3.217 plausible if we assume the random, null hypothesis? Nope. Here is a set of 100 median gaps from random data:
We’ve chosen the median for two reasons: it makes sense, and it’s the first thing we tried that worked. What other measures might be just as good, or better, at characterizing the clumpiness? What else can we learn from the distribution? Here’s one telling question: suppose every traveler was in a couple, and no one traveled alone. What would the distribution of gaps look like? What would its median be? Sometimes, looking at extreme cases like this is a powerful tool.
When we did the stars, a completely different approach—putting the data into bins—worked well. Lets do that for this one-dimensional, real data set. Next post.
What other measures could we use?
It turns out that the Professionals have some. I bet there are a lot of them, but the one I dimly remembered from my undergraduate days was the “index of clumpiness,” made popular—at least among astronomy students—by Neyman (that Neyman), Scott, and Shane in the mid-50s. They were studying Shane (& Wirtanen)’s catalog of galaxies and studying the galaxies’ clustering. We are simply asking, is there clustering? They went much further, and asked, how much clustering is there, and what are its characteristics?
They are the Big Dogs in this park, so we will take lessons from them. They began with a lovely idea: instead of looking at the galaxies (or stars) as individuals, divide up the sky into smaller regions, and count how many fall in each region.
We can do that. We divide that square of sky into a 10-by-10 grid. 100 cells. Now, instead of dealing with 1000 ordered pairs of numbers, we have 100 integers, the numbers of dots in each cell. Much easier.
This shows a star field with K = 0, divided into a grid of 100 cells. The graph is the distribution of counts from those cells. Notice how the distribution centers at about 10. This should make sense: we have 1000 stars distributed among 100 cells: the average number of stars in each cell is (exactly) 10.
Now let’s look at the same things but for K = 1.0, that is, very clumped:
Wow. The distribution sure is different; if you think about it, it’s clear why. The densest cells are lots denser than in the uniform K = 0 case, and to make up for that, there are a huge number of cells with very few stars.
To do inference on this, though, we need a single number (or measure, or statistic) that characterizes the distribution. Should we use mean, as we did with minimum distance? NO! The mean is 10 in both cases!
There are many possible measures to take, such as the maximum count, or, maybe the 10th-highest count (the 90th percentile). Those might work, and you should try them.
But the big dogs—Neyman and Scott—used a measure of spread. You could use standard deviation or IQR. But they used variance, for a really sweet reason, which we’ll get to later.
Meanwhile, recall that the variance is the square of the standard deviation. It’s the mean square deviation from the overall mean.
In our case, we have 100 numbers: 100 counts of how many stars fell in a particular cell. The overall mean is 10, as we have discussed. To compute the variance, go to each cell; figure out how far that count is from 10; square that amount; add up the 100 squares; then divide by 100 (or 99, if you’re that way).
Let’s look at K = 0. Of course, every random star field will have a different variance. So we made many (200) random star fields, and for each one, counted the stars in the 100 cells to get a variance for each field. The “typical” variance was about 10. In the 200 fields, about 5% had a variance below 7.7, and 5% were above 12.4. That is, 90% of all K = 0.0 random star fields had a variance in the interval [7.7, 12.4].
We computed this interval for many values of K.
You can also see that a typical variance at K = 1.0 is above 200, which makes sense looking back at the super skewed and spread out distribution above. (Remember, that 200+ is the square of the standard deviation.)
Let’s look in detail at an intermediate case. In fact, let’s look at K = 0.14, the value for which, last time, we could not distinguish from randomness using our mean-minimum technique.
Here is a star field at K = 0.14:
And now, its distribution of counts, with the distribution for a K = 0.0 field for comparison:
You can see that there are four cells on the right that are higher that anything at K = 0.0, and if you squint, you might believe that the peak—a typical number of stars for most of the cells—is a little lower, maybe 9 instead of 10. (And that makes sense; if you drain off 40+ “extra” stars for the center of the cluster, the cells on the outside will be depleted.)
And if we compute the variance? 12.9, which is outside that [7.7, 12.4] “90%” window we computed from doing lots of iterations with K = 0.
That is, with this measure, we can detect the clumpiness, whereas we could not with the one we invented in the previous post.
Whew. That had some hard ideas. As teachers, we look for easier approaches. And good old Pólya often suggested looking for lower dimensionality. Great idea! Let’s do this in one dimension instead of two. Next time.
Meanwhile:
Now. Why is variance cool here? Turns out that if you place things randomly into bins, and look at the distribution of counts—which is what we did here for K = 0.0—the distribution of numbers follows a Poisson distribution. Here’s the formula for the distribution for a Poisson-ly distributed random variable X:
In our case, . So the probability of getting 12 (say) in a cell is
.
You can check that it makes sense in the distribution above. But what you really need to know is that the mean of this distribution is and so is the variance. That is, we know what the variance is supposed to be if the stars are random, and that variance is just the mean: the total number of stars divided by the number of cells. In our case, 10.
And this means that for any field of 1000 stars cut into 100 cells, where the counts in the cells are , we can define the index of clumpiness as a ratio:
.
If a pattern is non-clumpy and random, i.e., Poisson, this index will be close to 1.0; when it’s really clumpy, will be large.
To generalize: if we have N stars in n cells, the average count is . With some algebra, the general formula becomes:
.
Although this is not part of the introductory stats curriculum, it’s pretty interesting, and maybe more accessible, with all our technology, than it used to be. I’d be curious to know more applications. Scientists must look at clumpiness in all sorts of contexts. Traffic, queueing, ecological models, who knows? (You do. Let me know.)
There really is such a thing. Some background: The illustration shows a random collection of 1000 dots. Each coordinate (x and y) is a (pseudo-)random number in the range [0, 1) — multiplied by 300 to get a reasonable number of pixels.
The point is that we can all see patterns in it. Me, I see curves and channels and little clumps. If they were stars, I’d think the clumps were star clusters, gravitationally bound to each other.
But they’re not. They’re random. The patterns we see are self-deception. This is related to an activity many stats teachers have used, in which the students are to secretly record a set of 100 coin flips, in order, and also make up a set of 100 random coin flips. The teacher returns to the room and can instantly tell which is the real one and which is the fake. It’s a nice trick, but easy: students usually make the coin flips too uniform. There aren’t enough streaks. Real randomness tends to have things that look non-random.
Here is a snap from a classroom activity:
That whole activity is worth some discussion, but not today. The question is, suppose we’ve learned our lesson: We know that streakiness can be random. We know that the stars can show clumps even when they’re random.
Now suppose there really is some clumping in the pattern of stars. How would we tell?
Could we do a test? Sure. But in order to do a traditional statistical test, we need a single number that characterizes how clumpy the pattern is.
If we had such a number (a measure, a statistic) we could then do the randomization dance: Make a lot of truly random patterns and compute the statistic. Assemble those stats into a sampling distribution. Then compute that same quantity for our actual pattern. If that test statistic falls outside the sampling distribution we made, it’s implausible that the distribution is random.
To decide what statistic would make sense, let’s look at some clumpy star fields. There are a lot of ways to make them clumpy, so I chose to make a single clump, right in the middle. I control the clumpiness with a parameter I call K (for clump), which is zero for no clumpiness, and 1 for total clumpiness (i.e., all stars are in the cluster). Here are K = zero, one-half, and one:
What statistic could you use? As usual, you should go off and think about this. But because I’m trying to record my thinking, I’m going to tell you what I came up with.
Here’s one idea: compute, for every star, the distance to the next closest star. Then I would look at that distribution—the distances to the nearest neighbors. It stands to reason that if the stars are clumped, the nearest neighbors would be closer, so the distributions would be centered lower. Maybe the mean of that distribution would work as a measure of clumpiness.
Sure enough! The means go down as the clumpiness goes up. Look at K = 0. Does that mean that, in a uniform random field, the mean distance to a neighbor is always 1.65 units? No: because there is randomness, the means will fluctuate. So we did this 100 times and recorded the means. And we found the 5th and 95th percentiles. This gives us a 90% “plausibility interval” for K = 0, which is roughly from 1.55 to 1.65. That is, our example (at 1.65) is a little unusual.
Skipping the step where we actually see those sampling distributions (if you’re doing this for a school project, don’t skip this step!), we can get the next graph, which shows the 5th and 95th percentiles for more values of K. Reading this graph, you can see that if K = 0.5, you’re nearly 90% certain to be outside that plausibility interval. (Oooh! Power!)
At a more modest value of K, however, it’s not so clear. The next illustration shows a field where K = 0.14. Its mean minimum distance is about 1.61—which is in the middle of the plausibility interval for K = 0, the random (null) case. According to our procedure, it’s completely plausible that this field is random.
If you know what to look for, you can kinda sorta see the cluster. But our statistic can’t find it. And if you didn’t know what to look for—like if you didn’t know the cluster was right in the middle—this field would not look any different than the K = 0 example up above.
Can we do better? You bet. Next post.
Seeing how the two approaches fit together, yet are so different, helps illuminate why confidence intervals can be so tricky.
Anyway, I promised a Very Compelling Real-Life Application of This Technique. I had thought about talking to fisheries people, but even though capture/recapture somehow is nearly always introduced in a fish context, of course it doesn’t have to be. Here we go:
I’ve just recently been introduced to an outfit called the Human Rights Data Analysis Group. Can’t beat them for statistics that matter, and I really have to say, a lot of the explanations and writing on their site is excellent. If you’re looking for Post-AP ideas, as well as caveats about data for everyone, this is a great place to go.
One of the things they do is try to figure out how many people get killed in various trouble areas and in particular events. You get one estimate from some left-leaning NGO. You get another from the Catholics. Information is hard to get, and lists of the dead are incomplete. So it’s not surprising that different groups get different estimates. Whom do you believe?
Suppose the LLNGO thinks that 20 civilians were killed during a protest. They give you a list of the victims. As the same time, the Catholics think that 25 civilians were killed. They have a list too.
Now the key thing: you compare the lists. Suppose the lists have five names in common. Now you’re super confident that at least 40 people were killed, because you have 40 names. But when you think about it, this is exactly the same as capture/recapture—and you can calculate that it’s likely that 100 people were killed.
And this, at root, is how these folks actually make better estimates for the effects of tyranny, torture, war crimes, and other institutionalized misconduct around the world. But only at root. The techniques get more sophisticated in interesting ways that I’m only starting to understand.
For example, one problem with what I just described is the same as when we get realistic about the fish. Why should we assume that every fish in the lake, tagged or untagged, has the same probability of getting caught?
In the human-rights context, this issue is called list independence. That is, do the LLNGO and the Catholics have the same chance of listing each person killed? Of course not. Relatives who might talk to the Catholics might not even speak to the LLNGO, and vice versa. Or they might have been doing their counts in different geographical parts of the protest.
It turns out that, using techniques that seem to be called MSE for Multiple Systems Estimation, you can try to account for list independence and other potential problems, provided you have three or more lists. I’m intrigued to study up on this and learn more, and you can too! Here is the first of a series of blog posts by Amelia Hoover Green. Follow the links to the next chapters. See what you think.
Meanwhile, think about one of the overarching problems from the last post: that the population estimate was so wide. I think that part of the problem is that when my brain is stuck in the fish context, I think it’s impractical to imagine tagging, say, half of the fish in the lake. But you can see from the graphs (and common sense) that if the fraction tagged is small, and the numbers are kind of small, the resulting distributions will be wide and sparse.
But in this context, we probably start out with the (wrong) assumption that we have almost all of the deaths; so our estimates will be better.
This graph shows estimates of population with, again, a true population of 100, but this time with half of the fish—50 cases—captured and tagged, and 50 recaptured. This is equivalent to each (independent) list having 50 casualties, with varying numbers of names (centered on 25) overlapping.
Notice how the interval has shrunk: it’s now (80, 135) instead of (60, 200).