If you’ve been following along, and reading my mind over the past six months while I have been mostly not posting, you know I’m thinking a lot about data science education (as opposed to data science). In particular, I wonder what sorts of things we could do at K–12 —especially at high school — to help students think like data scientists.
To that end, the good people at The Concord Consortium are hosting a webinar series. And I’m hosting the third of these sessions Tuesday July 25 at 9 AM Pacific time.
Click this link to to to EventBrite to tell us you’re coming.
The main thing I’d like to do is to present some of our ideas about “data moves”—things students can learn to do with data that tend not to be taught in a statistics class, or anywhere, but might be characteristic of the sorts of things that underpin data science ideas—and let you, the participants, actually do them. Then we can discuss what happened and see whether you think these really do “smell like” data science, or not.
You could also think of this as trying to decide whether using some of these data skills, such as filtering a data set, or reorganizing its hierarchy, might also be examples of computational thinking.
The webinar (my first ever, crikey) is free, of course, and we will use CODAP, the Common Online Data Analysis Platform, which is web-based and also free and brought to you by Concord and by you, the taxpayer. Thanks, NSF!
We’ll explore data from NHANES, a national health survey, and from BART, the Bay Area Rapid Transit District. And whatever else I shoehorn in as I plan over the next day.
Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.
This is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.
I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.
At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking.
Aside: Rob tends towards R Studio as his weapon of choice, that is, have students use a real tool, but in a suitably scaffolded environment. There is much to be said for this, but I can imagine a lot of grappling with R’s syntax among many students I have taught; I think it would be better to focus on ideas such as “identifying and looking at a relevant subset of the data” in a more gesture-based environment (e.g., CODAP) and connecting that to the data, the data stories, claims, and questions—and then, later, doing it in code. To be sure, with coding you can do anything. Using CODAP will limit what you can do. But it might expand what you will do, and what you will understand.
We discussed this some more, and he made a great suggestion: see how our nascent “data moves” correspond to what the big dogs in the R world think. Hadley Wickham’s dplyr
R package implements a “grammar of data manipulation” with a goal of identifying “the most important data manipulation verbs.” Verbs! These are moves! So: what are dplyr
‘s most important verbs?
That’s easy: group_by
, summarise
, mutate
, filter
, select
and arrange
So let’s start with these.
Use this function to convert a table into a “grouped” table. You tell R the variable you want to group by, and R separates the table according to the values of that variable. For example, you can group by sex
, and subsequent calculations will treat females and males separately.
In the CODAP world,
sex
attribute into a graph, the points get colored according to the sex of the cases they represent.sex
attribute to an axis of a graph, the graph separates into two—one for females, one for males—and you see parallel univariate graphs such as dot plots.sex
leftwards in the table, CODAP restructures the table hierarchically into sub-tables for females and males, making sex
a “parent” attribute. New attributes at that parent level (with values defined by aggregate functions such as mean(height)
) have values for each group.So we see three different takes on grouping. The third is the most general, but arguably the hardest conceptually. Students easily understand coloring points by sex
(or by whatever) and don’t get stuck in any syntax trouble.
“Summarise multiple values to a single value.”
That is, do an aggregate calculation. If we had a table of people with sex
and height
, a pseudo-r command such as
summarise(group_by(sex), mean(height))
would produce a table showing the mean heights of the two sexes.
As suggested above, in the CODAP world, you would create a new attribute at the “parent” level of a table, and give it that mean(height)
formula. On the screen, it would look like this:
Use this function to add a new variable to an R table, typically with values computed by a formula. This is something like summarise
, but for the case-level data. For example, in the CODAP table above, where height is in centimeters, we could compute it in inches with pseudo-R like this:
mutate(NHANES, Height_in = Height/2.54)
and we would get a new column in the table.
Curiously, CODAP makes no distinction between making an aggregate computation as we did above to compute the mean height and a case-by-case calculation where R uses mutate
. In CODAP, whether you want to mutate
or summarise
, just add a column and write the formula. You must, however, add the column at the right “level” of the hierarchical table to get the result you need.
“Return rows with matching conditions.”
That is, focus on a subset of the data in R. If we just wanted the mean height of females, we might write:
summarise( filter(NHANES, Sex == "Female"), mean(Height))
The key thing is that the student writes a Boolean expression (Sex == "Female"
) that represents the subset. And get all the parentheses right. Now, I’m not against learning syntax or how to write a Boolean expression. And in fact, I’m pushing for Boolean-expression filtering in CODAP right now. But it would be great to experience how useful filtering is in order to see why you would ever want to learn that.
To that end, CODAP lets you filter using selection. That is, you select the cases (in R-speak, “rows”) in any representation of the data—table or graph—and then ask CODAP to display only the selected cases, or only the unselected ones. Here are data showing soil temperature and air temperature at a site in a California forest. It’s one year of data, taken every 30 minutes:
We can see that, in general, the higher the air temperature (vertical axis; maybe it would be better the other way…) the higher the soil temperature. That makes sense, but it’s a lot of data and the details are all mushed together. Now suppose we go to our table and select one day in July:
Now we do a “Hide unselected cases”—that is, filter the data to show only that one day—and rescale the graph:
Wow. What a difference. Looking at this graph, we see a much more detailed story from the data.
At the moment, in CODAP, this only works for graphs. You can’t, for example, filter a table to show only that one day; you have to actually delete the cases you don’t want to see. Maybe this will change.
Suppose you have a dataset in R—a table—with many variables, many columns. This function lets you make a copy of the table with only the columns you want. So this is not about selecting cases (rows) as we did in CODAP above.
CODAP does not have an equivalent of this function, but we have been talking about ways to help users de-clutter their tables, somehow letting them hide attributes (columns) without deleting them.
In R, “use desc to sort a variable in descending order.” CODAP (finally!) has a sort capability as a menu item in the table.
So it seems that, on the surface at least, CODAP can mimic a lot of the basic functionality that you will find in dplyr
. There is a lot more to the package than we have discussed here, but if dplyr
is a grammar of data manipulation, something like CODAP can at least order a meal and find a laundromat.
I’m not trying to say that you (or Rob Gould) should use CODAP instead of R. Sure, you can do cool things by dragging in CODAP that you have to do by remembering syntax in R. But R is a programming language, and you (for now, anyway) need all that syntax in order to express the things that you want to do in a save-able, repeatable, modular sequence of operations.
But wouldn’t it be great if before they had to learn the syntax, students understood why filtering is a good idea? Or what it means to compute some aggregate value?
And frankly, dear readers, how many times have we all tried to learn some new programming or data-analysis system? Given my tiny mind, at least, I have a devil of a time making that first scatter plot or drawing that first blue rectangle on the screen. Sure, once you know R Studio, it’s easy to do in R Studio. But how many people give up in those first few hours, or don’t give up but never quite know what’s going on? Maybe CODAP, or something like it, is a path to getting more people to stick around long enough to see how cool this all is.
Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.
So there ought to be data science education. But what should we teach, and how should we teach it?
Let me address the second question first. There are at least three approaches to take:
I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:
We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.
That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.
You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these:
World trade |
Crime in Oakland |
Popularity of names beginning with “Max” |
What distinguishes this obvious data science from a good school data activity? Consider the graph (at right) of data from physics. Students rolled a cue ball down a ramp from various distances and measured the speed of the ball at the foot of the ramp.
The student even has a model and the corresponding residual plot. Great data. But not data science. Why not?
To answer that question, let’s look at one way some people have defined data science. The illustrations below, from Conway 2010 and Finzer 2013, show the thrust behind several definitions, namely, that data science lies in a region where content understanding, math and stats, and computer chops (Conway calls it “hacking”) overlap.
If we apply the left Venn diagram to the physics example, we can see that it belongs in “traditional research”: the work uses and illuminates physics principles and concepts from mathematics and statistics (the residuals, the curve-fitting) but does not require substantial computational skills.
But saying that is somehow not enough. The physics data example fails what we might call a “sniff test” for data science, but what, more specifically, does this sniff test entail? What are the ingredients that separate “regular” data from data science?
In our work over the last year and a half, we have begun to identify some of the telltale ingredients that make an activity smell like data science. And they come in (at least) three categories:
Although this formulation is certainly incomplete, it may be useful. Let’s look at each of these in turn.
Our obvious data science examples share a sense of being awash in data. The word brings up the image of a possibly tempestuous sea of data, or even waves of data breaking over us. This is subjective but still an evocative litmus test. More specifically, being awash might include:
It’s useful to ask what data science students should be doing with data. Again, an incomplete list:
What aspects of a dataset are characteristic of situations where we use data science? This is related to “awash” but more specific:
If we look again at the physics graph, we can see where it falls short. What did the students have to do? Make a graph of the two variables against one another. Plot a model. There was no search for a pattern, no sense of being overwhelmed by data, no subsets, no reorganization. The data are decidedly ruly.
That is not to say that working with this dataset is a bad idea! It’s great for understanding physics. But it’s not data science. And it points out the essential meaning of our sniff test: in this lab, students work with data, but they don’t have to do anything with the data to make sense of it. It’s science, but not data science.
Let’s alter that rolling-down-a-ramp activity. The next figure shows a graph from the same sort of data, but with an important change.
This spray of points has a very different feel (or smell) than the original one. We can see that, in general, the farther you have rolled, the faster you go, but the relationship is not clean. What could be making the difference? A look at the dataset shows that we now have more attributes, including wheel radius, the mass of the wheels, the mass of the cart, and, importantly, the angle of the ramp.
The ramps are at different angles? Maybe that makes a difference. Suppose we color the points by the angle of the ramp:
Now we see that the ramps range from about 5 to 45 degrees, and that, as we might expect, the steeper the ramp, the faster you go. We might be able to make predictions from this, but it would be nice to construct a graph like the original one. So we make a second graph, of just angle, and select the cases where the ramp angle is close to 20°.
Notice how the selected points show the square-rooty pattern of the original cue-ball graph. We can make a model for that specific ramp angle and plot it:
Let us re-apply our sniff test.
Awash in data? Look back at the first plot in this sequence with the eye of a first-year physics student. Confusing? Hard to tell where to start? You bet.
What are the data moves? We added a third dimension—color—to a plot to make a visualization we might not ever have seen before, and had to make sense of it. Then we found a way to look at a subset of the data, finding a clearer pattern there than in the dataset as a whole.
And data properties? We have more than two attributes, more than a couple dozen points. And depending on your experience, the data seem a bit unruly.
This may not reek of data science, but it has more than just a whiff.
But wait: at the end, we got a graph with a function, a good model, just like we had originally. That’s a lot of extra work just to get the same graph.
Reasonable teachers ask, is it worth the class time to learn all that computer stuff in addition to the physics? In fact, the way we set up a typical lab is designed precisely to avoid the problems of the messy data. Why have students struggle with a situation where the variables are not controlled? Why not just teach them to control variables as they should—and get better-organized data, intentionally?
Let me give a few responses:
It’s tempting to treat computer-data skills as an optional extra. Doing so creates an equity problem, because some people—people who look like me, mostly white and Asian boys—play with this stuff in our free time. Don’t let the playing field stay tilted.
That’s enough for now, but let us foreshadow where to go next:
If something else bothers you about this example, you’re not alone. I will describe what bothered me, and how that bothered Andee Rubin in a different way, soon.
Conway, Drew. 2010. Blog post at http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Finzer, William. 2013. “The Data Science Education Dilemma.” Technology Innovations in Statistics Education 7, no. 2. http://escholarship.org/uc/item/7gv0q9dc.
Jones, Seth. 2017. Discussion paper for DSET 2017. https://docs.google.com/document/d/1DbgPq8mBwHOXmagHKQ7kNLP49PujyPq-tKip5_HeD8E
And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.
This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.
In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter. We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.
(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)
Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.
My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.
Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.
Okay, Ph.D. students. That’s a good nugget for a dissertation.
Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:
(Also, get a micrometer on eBay and a sweet 0.1 gram food scale. They’re about $15 now.)
Long ago, I wrote about coins and said I would write about hexnuts. I wrote a book chapter, but never did the post. So here we go. What prompted me was thinking different kinds of models.
I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?
A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The more-or-less cubic part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density.
Most students, however, won’t have that instant insight. So we begin by asking them to predict. They will draw graphs that increase, which is a good start, and many student graphs will curve upwards. Great! Then when they measure, students see something like the graph. (Here is a Desmos document with the data.)
They will look at this graph, and many will say it looks like a parabola. Which it does. Kinda. But if you put a quadratic of the form on the graph, you will never get it to fit well, no matter what value of k you choose.
We’re doing this quickly, so let’s skip ahead to say that when you try , things go much better (though they aren’t perfect). Then you can explore why you think it “goes like” x cubed. You might get there by discussing the quarter-inch nut and the half-inch nut, and pass out samples, and ask, why isn’t the half-inch twice the weight of the quarter-inch. And so forth.
Now the discussion can go in two interesting and different directions. One goes to the wonderful snook data where you can do a similar exploration about fish, and (especially if you do a log transform) you discover that the best exponent for these fish is about 3.24 rather than 3.00. And you gotta wonder how that could be.
But we’re not going there. Instead we start with a great question that astute readers must have asked, but we have avoided until now: what do we mean by the size of a hexnut? The answer is, the diameter of the bolt that it fits. Right? A quarter-inch nut is the one that fits a quarter-inch bolt. It’s bigger (duh) than a quarter inch across.
So let’s consider modeling the geometry of the situation. A hexnut is a hexagonal prism, right? And it has a circular hole.
We can measure more than the weight. Suppose we measure the distance across the faces (f) and the thickness (t) as well. Then the area of the relevant hexagon is , and the resulting volume of the solid—taking the hole into account—is
.
So we can model the mass with , where is the density of the material. The next graph, from Fathom, shows this calculated model mass against the measured mass, along with and a residual plot. We have already slid the rho slider to a decent value.
I have positioned the slider so the first few points are flat. When I do that, they are not centered at zero. Also, the larger nuts deviate from the line more and more. This is an indication that there are systematic effects that our model doesn’t account for. I leave it to you to think about what those might be!
The value of rho? For this graph, 8.22 grams per cc, which is not bad (just a little high) for the density of steel.
And the modeling point? If you agree with my earlier assertion that simplification and abstraction are the hallmarks of modeling, then this is modeling even though we didn’t make a fancy function. We assumed that the nut is truly shaped like a hexagonal prism with a cylindrical hole cut in it. That’s patently false, but it’s a completely decent approximation, and much easier to deal with than the ugly, messy truth. What’s more, we can use the ways reality deviates from the simple, abstract model to decide whether our model values might be a little high or low.
Here are the data from that graph. Lengths are inches, mass is in grams:
boltSize flat thick mass 0.25 0.4298 0.2148 3.09 0.3125 0.4951 0.2651 4.72 0.375 0.5555 0.3236 6.73 0.4375 0.6807 0.3802 12.77 0.5 0.7407 0.4253 15.85 0.625 0.9223 0.5450 31.06 0.75 1.0920 0.6337 48.8
There is a lot more to say, but hey, it’s almost four hours so it’s way time to stop.
The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.
Here is the link. If you teach stats, make a bookmark: http://dasl.datadesk.com/
The site includes scores of data sets organized by content topic (e.g., sports, the environment) and by statistical technique (e.g., linear regression, ANOVA). It also includes famous data sets such as Hubble’s data on the radial velocity of distant galaxies.
One small hitch for Fathom users:
In the old days of DASL, you would simply drag the URL mini-icon from the browser’s address field into the Fathom document and amaze your friends with how Fathom parsed the page and converted the table of data on the web page into a table in Fathom. Ah, progress! The snazzy new and more sophisticated format for DASL puts the data inside a scrollable field — and as a result, the drag gesture no longer works in DASL.
Fear not, though: @gasstationwithoutpumps (comment below) realized you could drag the download button directly into Fathom. Here is a picture of a button on a typical DASL “datafile” page. Just drag it over your Fathom document and drop:
In addition, here are two workarounds:
Plan A:
Plan B:
Note: Plan B works for CODAP as well.
So. This is a book of 42 activities that connect geometry to functions through data. There are a lot of different ways to describe it, and in the course of finishing the book, the emotional roller-coaster took me from great pride in what a great idea this was to despair over how incredibly stupid I’ve been.
I’m obviously too close to the project.
For an idea of what drove some of the book, check out the posts on the “Chord Star.”
But you can also see the basic idea in the book cover. See the spiral made of triangles? Imagine measuring the hypotenuses of those triangles, and plotting the lengths as a function of “triangle number.” That’s the graph you see. What’s a good function for modeling that data?
If we’re experienced in these things, we say, oh, it’s exponential, and the base of the exponent is the square root of 2. But if we’re less experienced, there are a lot of connections to be made.
We might think it looks exponential, and use sliders to fit a curve (for example, in Desmos or Fathom. Here is a Desmos document with the data you can play with!) and discover that the base is close to 1.4. Why should it be 1.4? Maybe we notice that if we skip a triangle, the size seems to double. And that might lead us to think that 2 is involved, and gradually work it out that root 2 will help.
Or we might start geometrically, and reason about similar triangles. And from there gradually come to realize that the a/b = c/d trope we’ve used for years, in this situation, leads to an exponential function, which doesn’t look at all like setting up a proportion.
In either case, we get to make new connections about parts of math we’ve been learning about, and we get to see that (a) you can find functions that fit data and (b) often, there’s a good, underlying, understandable reason why that function is the one that works.
I will gradually enhance the pages on the eeps site to give more examples. And of course you can buy the book on Amazon! Just click the cover image above.
Suppose we take our data and instead of looking at all the individual times, count how many people pass me every 10 seconds? Then I’ll have one number for every 10 seconds; what will the distribution of those numbers look like?
Here is that distribution for the 58 ten-second bins. This is “p” data only, that is, people walking right to left. Anticipating what we will need, we’ll also show the same distribution for randomly-generated data, and plot the means. We also plot the variances, which is weird, but lets us see the numerical values:
The means are identical, which is good, because we’ve intentionally made the random data to have the same number of cases over the same period of time.
But the two distributions look different, in the way we expect if we have clumping: the real data has more of the heavily-populated bins (6, 7, 8) where the clumps are, and more of the lightly-populated bins (0, 1) where the longer gaps are. This is less dramatic than it was with the stars. But we have much less data, and the data are real, which always makes things harder.
At any rate, more light bins and more heavy bins means that it makes sense to characterize clumpiness by using a measure of spread; if we pick variance as we did before (following the star reasoning and the lead from Neyman and Scott), we see that the real data have a variance of 3.25, which is much bigger than the mean of 2.17. The random data have a variance of 1.86, which is relatively close to the mean—and what we expect, that is, they should be Poisson distributed, though you don’t need to know that to do this analysis.
Of course, the random data will be different every time we re-randomize. Here are 100 variances for 100 runs of random data, with the “real” data’s variance plotted as well:
Yay! Does a variance of 3.25 give us evidence that the real data are not random? Yes. We can’t tell from this alone that it is because of clumping, but it is in the direction consistent with clumping.
We can do as we did before, and define an index of clumpiness the same way: the variance of the bin counts divided by the mean. For these data, and this bin size, we get 3.246/2.172, or about 1.5.
Can we apply this idea to the made-up heads and tails that we started this whole sequence with in our first post about clumpiness? Of course. Next post!
Gasstation… commented wisely that it may matter what your cell size is. You bet. It’s the same issue as when you make a histogram, or do any gridding or binning; the size you use may create or obscure patterns inadvertently. If you’re a student thinking about studying clumpiness as a project, this is a great topic to explore: how do the size and placement of your grid affect the results of your investigation?
On our first time through this line of reasoning, however, I did not address these issues. They’re really interesting when you’re doing the investigation yourself, but will probably be pretty tedious and distracting if you’re just reading about it.
But just so we see, I ran the analysis with different bin sizes:
The clumpiness becomes increasingly unstable with large bins, probably because the number of bins declines, that is, there are fewer and fewer data points to work with.
Bin size is not the only issue, however. We can also pay attention to where the bins start. That is, the “picket fence” of the bins can slide left and right—and that will change the counts in the bins. The “offset” could range from 0 to 10 seconds for our 10-second bins. We can calculate the index of clumpiness for each offset:
Isn’t THAT interesting! How should we use this? It’s not clear. Informally, we could say that an index of about 1.5 looks reasonable as a summary of the possible values. There’s another issue as well: as you slide the picket fence of binning over data, different numbers of bins cover the domain. Also, therefore, the two end bins will often have artificially low counts, since they extend into a data-free zone. So we should probably discard the end bins—but that’s too much trouble for us today.
Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.
We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by reducing its dimension from 2 to 1. This is often a good strategy for making things more understandable.
Where do we see one-dimensional clumpiness? Here’s an example:
One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.
I watched people going from my right to my left. And I watched people going left to right. ¡Que oportunidad! This was data! I pulled out my computer, started up Fathom, and created a Fathom “experiment” so that I could record the time that a person passed my location simply by pressing a key. I used different keys for left-to-right and for right-to-left, and whether there was a cart involved. (These are the IAH carts that whisk people the vast distances they need to travel. Essential service. Obnoxious implementation.)
Here’s the graph:
It sure looks as if there is clumping. But is there really? And can we detect it? This is in time instead of space, but that doesn’t matter. It’s really the same issue. And the people-see-patterns-in-randomness problem is the same.
Before we do any calculations, let’s also think about the context. These are people walking between gates in an airport. Some are traveling alone, but many are probably traveling with colleagues, in couples, or in families. Another issue is whether fast walkers get stuck behind slow walkers, creating a clump because of traffic, not affiliation. Then, at a longer time scale, there should be an overall increase in traffic after a plane arrives. In any case, we have reason to believe that there will be clumping, so we hope we can detect it easily in the data.
As before, let’s first look at individual cases rather than bins. It’s easiest (in Fathom) to compute the time from the previous pedestrian rather than the time to the nearest pedestrian. Hoping that doesn’t make too much of a difference (famous last words…), here are graphs of the observed distribution of time gaps, called gap, and a distribution of gaps from a set of random times:
These are definitely different distributions—and in ways that imply clumping! The pile of short-interval points seems like a smoking gun.
We need a single number to characterize clumpiness. What would be good? The graphs show the mean times, which are identical. We might explain this with the idea that, in the clumpy group, there are more short gaps (reducing the mean), but also, because the people are closer together, more long gaps (increasing the mean). And we see that in the graph. But the real reason is that we have a given amount of time and a given number of people. In that case, the average time between people is the time divided by N. So we might in fact do better taking the trouble to find the minimum gap rather than the previous gap as we did with the stars.
That does show a difference, but instead let’s stick with the plain, “previous” gaps and look at the median of that distribution. That’s better than the mean in this case. (Think: why?)
In the graphs, we see that the “real” median is less than the random median, which is what we expect from clumping. Let’s do randomization-based inference again! We redo the random case many times, and build up a sampling distribution of that median gap. Is our observed median of 3.217 plausible if we assume the random, null hypothesis? Nope. Here is a set of 100 median gaps from random data:
We’ve chosen the median for two reasons: it makes sense, and it’s the first thing we tried that worked. What other measures might be just as good, or better, at characterizing the clumpiness? What else can we learn from the distribution? Here’s one telling question: suppose every traveler was in a couple, and no one traveled alone. What would the distribution of gaps look like? What would its median be? Sometimes, looking at extreme cases like this is a powerful tool.
When we did the stars, a completely different approach—putting the data into bins—worked well. Lets do that for this one-dimensional, real data set. Next post.
What other measures could we use?
It turns out that the Professionals have some. I bet there are a lot of them, but the one I dimly remembered from my undergraduate days was the “index of clumpiness,” made popular—at least among astronomy students—by Neyman (that Neyman), Scott, and Shane in the mid-50s. They were studying Shane (& Wirtanen)’s catalog of galaxies and studying the galaxies’ clustering. We are simply asking, is there clustering? They went much further, and asked, how much clustering is there, and what are its characteristics?
They are the Big Dogs in this park, so we will take lessons from them. They began with a lovely idea: instead of looking at the galaxies (or stars) as individuals, divide up the sky into smaller regions, and count how many fall in each region.
We can do that. We divide that square of sky into a 10-by-10 grid. 100 cells. Now, instead of dealing with 1000 ordered pairs of numbers, we have 100 integers, the numbers of dots in each cell. Much easier.
This shows a star field with K = 0, divided into a grid of 100 cells. The graph is the distribution of counts from those cells. Notice how the distribution centers at about 10. This should make sense: we have 1000 stars distributed among 100 cells: the average number of stars in each cell is (exactly) 10.
Now let’s look at the same things but for K = 1.0, that is, very clumped:
Wow. The distribution sure is different; if you think about it, it’s clear why. The densest cells are lots denser than in the uniform K = 0 case, and to make up for that, there are a huge number of cells with very few stars.
To do inference on this, though, we need a single number (or measure, or statistic) that characterizes the distribution. Should we use mean, as we did with minimum distance? NO! The mean is 10 in both cases!
There are many possible measures to take, such as the maximum count, or, maybe the 10th-highest count (the 90th percentile). Those might work, and you should try them.
But the big dogs—Neyman and Scott—used a measure of spread. You could use standard deviation or IQR. But they used variance, for a really sweet reason, which we’ll get to later.
Meanwhile, recall that the variance is the square of the standard deviation. It’s the mean square deviation from the overall mean.
In our case, we have 100 numbers: 100 counts of how many stars fell in a particular cell. The overall mean is 10, as we have discussed. To compute the variance, go to each cell; figure out how far that count is from 10; square that amount; add up the 100 squares; then divide by 100 (or 99, if you’re that way).
Let’s look at K = 0. Of course, every random star field will have a different variance. So we made many (200) random star fields, and for each one, counted the stars in the 100 cells to get a variance for each field. The “typical” variance was about 10. In the 200 fields, about 5% had a variance below 7.7, and 5% were above 12.4. That is, 90% of all K = 0.0 random star fields had a variance in the interval [7.7, 12.4].
We computed this interval for many values of K.
You can also see that a typical variance at K = 1.0 is above 200, which makes sense looking back at the super skewed and spread out distribution above. (Remember, that 200+ is the square of the standard deviation.)
Let’s look in detail at an intermediate case. In fact, let’s look at K = 0.14, the value for which, last time, we could not distinguish from randomness using our mean-minimum technique.
Here is a star field at K = 0.14:
And now, its distribution of counts, with the distribution for a K = 0.0 field for comparison:
You can see that there are four cells on the right that are higher that anything at K = 0.0, and if you squint, you might believe that the peak—a typical number of stars for most of the cells—is a little lower, maybe 9 instead of 10. (And that makes sense; if you drain off 40+ “extra” stars for the center of the cluster, the cells on the outside will be depleted.)
And if we compute the variance? 12.9, which is outside that [7.7, 12.4] “90%” window we computed from doing lots of iterations with K = 0.
That is, with this measure, we can detect the clumpiness, whereas we could not with the one we invented in the previous post.
Whew. That had some hard ideas. As teachers, we look for easier approaches. And good old Pólya often suggested looking for lower dimensionality. Great idea! Let’s do this in one dimension instead of two. Next time.
Meanwhile:
Now. Why is variance cool here? Turns out that if you place things randomly into bins, and look at the distribution of counts—which is what we did here for K = 0.0—the distribution of numbers follows a Poisson distribution. Here’s the formula for the distribution for a Poisson-ly distributed random variable X:
In our case, . So the probability of getting 12 (say) in a cell is
.
You can check that it makes sense in the distribution above. But what you really need to know is that the mean of this distribution is and so is the variance. That is, we know what the variance is supposed to be if the stars are random, and that variance is just the mean: the total number of stars divided by the number of cells. In our case, 10.
And this means that for any field of 1000 stars cut into 100 cells, where the counts in the cells are , we can define the index of clumpiness as a ratio:
.
If a pattern is non-clumpy and random, i.e., Poisson, this index will be close to 1.0; when it’s really clumpy, will be large.
To generalize: if we have N stars in n cells, the average count is . With some algebra, the general formula becomes:
.
Although this is not part of the introductory stats curriculum, it’s pretty interesting, and maybe more accessible, with all our technology, than it used to be. I’d be curious to know more applications. Scientists must look at clumpiness in all sorts of contexts. Traffic, queueing, ecological models, who knows? (You do. Let me know.)