or, What I should have emphasized more at NCTM
I’m just back from NCTM 2018 in Washington DC where I gave a brief workshop that introduced ideas in data science education and the use of CODAP to a very nice group in a room that—well, NCTM and the Marriott Marquis were doing their best, but we really need a different way of doing technology at these big conferences.
Anyway: at the end of a fairly wide-ranging presentation in which my main goal was for participants to get their hands dirty—get into the data, get a feel for the tools, have data science on their radar—it was inevitable that I would feel:
- that I talked too much; and
- that there were important things I should have said.
Sigh. Let’s address the latter. Here is a take-away I wish I had set up better:
In data science, things are often too complicated. So one step is to simplify things; and some data moves, by their nature, simplify.
Complication is related to being awash in data (see this post…); it can come from the sheer quantity of data as well as things like being multivariate or otherwise just containing a lot of stuff we’re not interested in right now.
To cut through that complication, we often filter or summarize, and to do those, we often group. To give some examples, I will look again at the data that appeared in the cards metaphor post, but with a different slant.
Here we go: NHANES data on height, age, and sex. At the end of the process, we will see this graph:
And the graph tells a compelling story: boys and girls are roughly the same height—OK, girls are a little taller at ages 10–12—but starting at about age 13, girls’ heights level off, while the boys continue growing for about two more years.
We arrived at this after a bunch of analysis. But how did we start?
It began with 800 cases, in a table. (Here is a link to a fresh CODAP document with the data, if you want to play.) If we have the vague task of exploring the data, we might first make plots of various attributes such as Height. And because any bivariate plot is at least ten times as interesting as a univariate plot, we might separate the data by Sex:
Now we see that, as a group, males are taller than females. We quickly realize, though, that this is not a fair comparison, because age is too important to leave out. So we put Age on the horizontal axis instead of Sex, which gives us a very interesting plot, but loses Sex. So we drag Sex into the middle of the graph to see that difference (ah: three dimensions!) and get the second graph below.
That second graph has all of the information, but is now so busy and complicated, and has so many other problems, that it doesn’t tell the story as clearly as the one we began with.
One problem is that the dots overlap. Our ages do not have decimals, so the points are on top of one another. It looks as if the boys get plotted first, so when you plot the girls, a bunch of the boys are invisible. There are graphical things we might do about this, so that we can see both distributions at each age, but that’s another topic—and it would still be messy.
We can, however, simplify this graph in various ways.
One approach is to look at only a subset of the data: to filter the dataset. For example, we could make separate plots for the boys and girls. When we try this, they are hard to compare.
Another approach is to look at the ages one at a time. Suppose we start with age 19:
The left-hand graph shows what happens when we impose that filter: it looks just like the previous graph, with most of the points invisible. Of course, we wanted to see the distribution of height for the two sexes, so now that we don’t need Age (they’re all age 19), we re-replace Age with Sex on the horizontal axis to get the graph on the right. We’ve added lines for the means.
This graph is much simpler. It shows the relationship between only two variables: Sex and Height, and is of a more manageable number of points (68 instead of 800). With a little experience with data, we are no longer awash; this setup we understand.
But every simplification comes at a cost. In this case, we have lost the age information. We can tell a clear story about height at age 19, but we can no longer talk about growth over time. To do an age comparison, we have to do this for every age, for example, ages 5 and 10…
…and then compare those graphs. Each graph, individually, is relatively simple. Together, there are all sorts of visual difficulties, for example, the graphs all look kind of the same until you check the vertical scales. But you could, eventually, uncover the story that the first graph told.
Summarizing (a.k.a. aggregation)
Making 15 separate comparison graphs is a pain, but that is exactly what we did at Lick-Wilmerding High School in the Applied Math class with Ernie Chen and 20 thoughtful students. Each group got a different age, and wrote the mean heights for the males and females on the whiteboard; then they typed in these values and made the plot with which we began this post.
The new element is the mean. The students wanted to put the means on the single-age graphs, and it’s not a bad idea. It gives a point of comparison. So it was natural to record the means as the end results of each group’s investigation of a single age.
Once we had done that, I showed them how to reorganize the CODAP table so that they could make the computer do all those calculations at once, and get all of the means, for all ages and both sexes, in a form they could use to make the graph.
And the graph is simpler still: 30 points instead of 68 or 800, and more importantly, it tells the story quickly and elegantly.
If simplification comes at a cost, what is it this time? Here, it comes from the aggregation: when you compute the mean, you are finding a single number to stand in for the entire group’s data. You lose the distribution. You are more likely to say that girls are taller at age 12 and boys are taller at age 15, and not keep in mind that there are taller girls and shrimpier boys at every age.
I think I’m onto something here: that simplification is one of the engines that drives what you do in data science, and that you simplify in order to tell a compelling story.
How shall we teach students about simplification? One way might be through comparing different visualizations with different types of simplification, and asking them to decide what pictures tell what stories, and how effectively. This is about developing good taste in graphics, and paying attention to the communication part of data science.
Another way might be through reflecting on data moves, which, in this example, were filtering and summarizing, both supported by grouping. We can ask students to think further about the costs of simplification and how to soften their blows. For example, what could we do to our nice clean graph to show information about the distributions without making the graph too busy? What about losing the dots and plotting bars that show just the IQR? That’s the sort of combination of thinking about data, communication, and construction of visualizations that I would like to see more students experience.
And one more thing: didn’t we say, somewhere, that simplification is a hallmark of modeling? Remember Box: all models are wrong, but some are useful.