I oversimplify, of course, but this is what I’m thinking about; and this came as a result of attending an advisory meeting about a cool project called Data Clubs. And as usual for this blog, we are using CODAP.
Elsewhere (especially in Awash in Data) I have talked about being awash because there are so many cases. When there are a lot of rows in the table, when you make a graph, there may be so many dots that you can’t see the pattern. One remedy to try is to use data moves such as filtering or grouping-and-summarizing (making aggregates), both of which reduce how many dots appear in your plots.
Another kind of awash happens when you have a lot of attributes. You have trouble figuring out what you even want to plot because there are so many choices. Sometimes you want to essentially filter columns in the table, or, alternatively, find a way to combine multiple columns into one. We developed a remedy for the former in work on the Writing Data Stories project in the form of a CODAP plug-in called Choosy that lets you hide columns easily. For the latter, write a formula to combine many columns into one—and then use Choosy to hide the many. (I call this a calculation data move, mutate for those of you in the tidyverse.)
But today, I saw something I had seen before, but never in quite this way—a third way of being awash.
We were looking at this dataset from the National Health Interview Survey. It’s about injuries sustained at leisure or during sports. And here is a typical kind of CODAP graph some students might make:
I kind of hate this kind of graph. It’s really hard to see anything in it, and it’s hard to use it to help make an argument. There are just too many cells and too many labels to read. Why? Because it’s two categoricals, and each one has a bunch of categories. What’s worse, some novice students seem to like these graphs a lot, maybe because they look very official and complicated. Michelle set me straight about part of this, though: although making sense of the numbers of dots in the cells is a fool’s errand, it’s interesting to see patterns in which cells have anything and which have nothing. So I don’t hate them anymore. They’re not useless.
Still, looking at it, I’m certainly awash, but in this new way: not from too many cases (rows) or attributes (columns), but from too many categories.
We don’t have great tools for this. One possibility is to do a (different) calculation data move, recoding the data into fewer categories (here is an example from Awash, so I guess I have thought about this, just not this way). But recoding is not always a good idea, and it may be a bridge too far for the middle-school students who are the target population for the Data Clubs project.
So I write this to remind me about this insight, and maybe inspire someone to figure out other ways to cope with a surfeit of categories.