In the Data Science Games project, we started talking, early, about what we called data moves. We weren’t quite sure what they were exactly, but we recognized some when we did them.
In CODAP, for example (like in Fathom), there is this thing we learn to do where you select points in one graph and, since when you select data in CODAP, the data are selected everywhere, the same points are selected in all other graphs—and you can see patterns that were otherwise hidden.
You can use that same selection idea to hide the selected or unselected points, thereby filtering the data that you’re seeing. Anyway, that felt like a data move, a tool in our data toolbox. We could imagine pointing them out to students as a frequently-useful action to take.
I’ve mentioned the idea in a couple of posts because it seemed to me that data moves were characteristic of data science, or at least the proto-data-science that we have been trying to do: we use data moves to make sense of rich data where things can get confusing; we use data moves to help when we are awash in data. In traditional intro stats, you don’t need data moves because you generally are given exactly the data you need.
Anyway, we listed a bunch of them—filtering, grouping, making aggregate measures, merging datasets, making hierarchies, making new attributes, stacking, etc. It seemed that there was something essential in common with all of these, something about the structure of the data, or the values, or about changing the organization of the data set. But every time I thought up a definition, it was lacking something, or could easily be misunderstood.
And then there was the issue of making new visualizations, which is also characteristic of data science. I had it in my original list from February 2017, and then Rob Gould (bless his heart) said that he didn’t think that was a data move, but more of a “data analysis move.” Naturally I thought for weeks that he was just splitting hairs until I came to my senses. And now I agree it’s not. You do data moves in order to prepare to make visualizations. With technology, data moves and making graphs might seem to happen at the same time, but conceptually, the data moves come before.
And with that we come to the point of this post: a metaphor about data moves that helps me, at least, think about what’s a data move and what is not. But first
Consider the NHANES data…
We have 800 kids from NHANES in 2003, aged 5–19, with (among other attributes) height, sex, and age. Suppose we want to make a graph that shows how children grow, and how that differs for boys and girls. How would we do that?
A great way to think about it is to draw the graph you think you will get. In this case, I will show you the answer:
In CODAP, the procedure is to drag
Age—the attribute names—to the left to “promote” them in a hierarchical table:
Then create a new attribute (
mean_ht) at that level, and give it the formula
mean(Height). Then just drag
Age to the horizontal axis of a new graph, and
mean_ht to the vertical, and drop
Sex in the middle. Done.
So the data moves are: group by sex and age, then compute the aggregate measures, the means. Then, when you go to make the graph, there is a different set of moves—the choices you make about how you’re going to represent the different attributes you have available. You probably made the data moves in anticipation of the graphing moves you would make. They are tightly linked.
Of course, you don’t have to use CODAP to do this. If you were using RStudio with
ggplot2, you might write
summ <- nhanes %>% group_by(Age, Sex) %>% summarise(mean(Height, na.rm = TRUE)) names(summ) = "mean_ht" qplot(summ$Age, summ$mean_ht ,color=summ$Sex)
The first line has the data moves: grouping (
group_by) and finding aggregate values (
summarise). In CODAP, this is equivalent to dragging attributes in the table to make hierarchies and then making a new attribute with a formula. The second line of R is housekeeping. The last line is all about the graphics, telling what goes on which axis and what determines the color.
The point being that there is an underlying Platonic ideal of a data move, with different implementations depending on the tool. There is also an underlying Platonic graphing thing (it’s called a Grammar of Graphics: Wilkerson 2005; yet Hadley Wickham et al.’s ggplot2 is practical and easier to understand) but I claim that it’s different from a data move, and I will not treat it here.
Okay, so a data move is about data, not representation. It’s about how the data are organized, and it’s also about calculating things like aggregate values. That is, it creates and organizes the numbers that you’re going to use to make a graphic.
Finally, the cards
That’s still kind of slippery, so try this one on:
Imagine that your dataset is a deck of little cards, one for each case. In the NHANES data, that’s 800 cards, one for each kid. Each card has Sex, Age, Height, and other stuff.
Remember how we drew the graph we wanted? Now imagine the steps you would go through to make that graph—using the cards. It’s actually pretty easy:
- Separate the cards into two piles, males and females.
- Separate those piles into piles by age, so you now have 30 piles—from age 5 to age 19 for 2 sexes.
- For each pile, make a label so you know what’s in the pile. It will read something like Sex : Female, Age : 12.
- For each pile, look though the cards and compute the mean of Height. Better write it down too. Put the mean height on the label.
And you’re done! You now have all the numbers (and letters) you need to make your graph.
Notice: we have discovered that there are two fundamental types of data move: those that are about rearranging cards, and those that need a pen. There is one central “combo” move, grouping, where when you make a pile you need a new card (and a pen) to make a label.
Anything beyond that, such as plotting a point, that’s not a data move.
Let’s push on this operational metaphor. Suppose you now want to classify each kid—each card—by whether he or she is tall. We’ll define tall as “having a height greater than the mean for your age and sex.”
So we look though all of the cards, and for each one, we compare
mean_ht (from the label of the group). If
Height > mean_ht, we’ll write
tall : true on the kid’s card. Otherwise we’ll write
tall : false. By our metaphor, this is a “pen” data move. We’re calculating a new attribute value, but this time for every card as opposed to the aggregate value we calculated and wrote on the label.
Let’s do a tougher investigation. Are students who play basketball more likely to be “tall” than other students? We need more data.
- Assume that our height/sex/age cards also have student names.
- Get the database of sports enrollments (a new deck of cards, these are blue). Each card has a student name and a sport.
- Sort the new sports deck into ones with basketball and ones without. Label the two piles basketball and other. (The grouping move.)
- Go through all of the student cards and check the names against all of the names in the basketball sports pile. If there is a match, put the student card in a pile next to the “basketball” pile, otherwise, put it in a pile next to the “other” pile.
- Now go through the student cards for each pile and calculate the proportion of cards that have
tall : true. Put that proportion on the label. (Aggregation. A “pen” data move.)
The new move is in step 4. It’s a “cards” move we are calling merging for now, but you may recognize it as a
Suppose we wanted to see how this difference in proportions depends on age. Easy: interpose a new grouping move after step 4. For the new move, split the two piles by Age. You will need new labels where you can write the age-specific proportions.
Why I care
You could say that
dplyr or CODAP or some Python library have all of this functionality, and you would be correct. Why do we need to recognize data moves? Why not just think about selecting and dragging or using
I’d love to hear your responses in comments, but I’ll start with what I think:
- I doubt that it would be useful to beginners if we make a big deal about learning some Platonic abstraction. But it still might be good for them at least to get an inkling that there is an underlying action that’s different from the computer command or gesture.
- In CODAP, at least (I barely know any R) there is usually more than one way to accomplish something. It might help students to hear us talk about making aggregate measures and pointing out that we can do it by calculation or (for example) by adding a feature to a graph. It might reduce some cognitive load to realize that there are really only about seven things you do instead of twenty.
- Even if it’s not useful to students, it might be useful to us as instructors:
- As we talk with confused students, thinking about data moves instead of a particular computer implementation might help us diagnose where a student is misunderstanding.
- As we design curricula and individual lessons, thinking about data moves might help us draw connections to other lessons, adjust the balance of tasks, or avoid overloading a lesson with too many different things.
- The cards metaphor itself may disconfuse some students. (Credit: it’s inspired by the extended use of imaginary cards in Freedman, Pisani, and Purves, Statistics. Any edition will do.)
That’s way too much to be writing, and this has taken way too long, but I think it helps me to get it on the page. If you’re still reading, well done, you glutton. Let me know what you think.
And if you can, think and write about this: I’ve implied that some students have trouble partly because, although we assume they understand the underlying meaning of data moves, in fact they don’t. It’s not that data moves are hard, but they may be alien, and we pile a lot of data moves together—implicitly, without acknowledging them—at the same time as we’re asking students to make sense of the actual data context and to cope with learning computer commands. I’d like to think that taking a breath and shining a light on data moves could help some students who might have been feeling stupid, doggedly trying to follow instructions, hoping that one day it will all make sense.
This conjecture is largely based on intuition. Do you think it’s true?
6 thoughts on “Data Moves: the cards metaphor”
Before reading this, it seemed that anything I did as part of an investigation could be seen as a data move. I really value that you’ve focussed data moves in the realm of preparing the data for visualization and analysis.
Thanks; though as I learn more about the tidyverse, I learn that this territory is well-trodden. Although there are differences, we’re talking about what Grolemund and Wickham (2017. R for Data Science. O’Reilly, but also http://r4ds.had.co.nz/) identify as “tidying” and “transforming,” or, collectively, “munging.” Visualizing and modeling they leave out, much as I do here.
So another case of independently discovering or constructing something that other smart people have already gotten into print 😛 (That said, I’m trying to figure out whether any of my thinking is useful for R users, or is it all covered by Hadley…? And whether users of any tool should just read G&W.)
tidyverse user here, with an interest in developing materials to help others better learn R for data science by integrating best practices from K-12 education.
the thinking is incredibly useful – I see people struggling with data wrangling and making sense of what the various functions do all the time, and I think a lot of it is because the majority of data science and programming education relies on a “read this and follow along” approach. there’s a heavy emphasis on *how* to do things, but very rarely do we see information or insights into *why* we should do things.
I’m envisioning a tutorial where a user can print out cards at home, and follow along with a video and/or blog post to help create their knowledge and understanding of how the various dplyr() verbs work. this could align with both CODAP and R, giving the learner multi-modal opportunities to deepen their understanding.
Very cool idea. I wonder if we could come up with a dataset or two whose cards could be used for a wide array of data moves/verbs, so the learner could reuse it in many different lessons, and return to it when things got confusing (like when trying to figure out which join_ to use…)
One thing I’ve found is that it often works better to have students start by separating into groups (so doing dplyr filter instead of group_by). Then calculate the statistics. (This is like the showing/hiding some of the data idea.) Then seeing that putting the subset results next to each other would give you a more compact table and make it easier to compare, so trying the same thing with group_by. One place I’ve tried something like this is to address the “which way to calculate the percent” problem students are confronted with when given already constructed contingency tables. Also proportions themselves are really based on a group_by and aggregate that is implicit in your discussion and can get especially hard for students when there are more than two attributes.
I really like the idea of doing physical cards that match an actual set of real data. I’d think about maybe having a geographic identifier to do a join on for many to one.