I oversimplify, of course, but this is what I’m thinking about; and this came as a result of attending an advisory meeting about a cool project called Data Clubs. And as usual for this blog, we are using CODAP.
Elsewhere (especially in Awash in Data) I have talked about being awash because there are so many cases. When there are a lot of rows in the table, when you make a graph, there may be so many dots that you can’t see the pattern. One remedy to try is to use data moves such as filtering or grouping-and-summarizing (making aggregates), both of which reduce how many dots appear in your plots.
Another kind of awash happens when you have a lot of attributes. You have trouble figuring out what you even want to plot because there are so many choices. Sometimes you want to essentially filter columns in the table, or, alternatively, find a way to combine multiple columns into one. We developed a remedy for the former in work on the Writing Data Stories project in the form of a CODAP plug-in called Choosy that lets you hide columns easily. For the latter, write a formula to combine many columns into one—and then use Choosy to hide the many. (I call this a calculation data move, mutate for those of you in the tidyverse.)
But today, I saw something I had seen before, but never in quite this way—a third way of being awash.
We were looking at this dataset from the National Health Interview Survey. It’s about injuries sustained at leisure or during sports. And here is a typical kind of CODAP graph some students might make:
I kind of hate this kind of graph. It’s really hard to see anything in it, and it’s hard to use it to help make an argument. There are just too many cells and too many labels to read. Why? Because it’s two categoricals, and each one has a bunch of categories. What’s worse, some novice students seem to like these graphs a lot, maybe because they look very official and complicated. Michelle set me straight about part of this, though: although making sense of the numbers of dots in the cells is a fool’s errand, it’s interesting to see patterns in which cells have anything and which have nothing. So I don’t hate them anymore. They’re not useless.
Still, looking at it, I’m certainly awash, but in this new way: not from too many cases (rows in the data table) or attributes (columns in the table), but from too many categories—which produce rows and columns in the graph.
We don’t have great tools for this. One possibility is to do a (different) calculation data move, recoding the data into fewer categories (here is an example from Awash, so I guess I have thought about this, just not this way). But recoding is not always a good idea, and it may be a bridge too far for the middle-school students who are the target population for the Data Clubs project.
So I write this to remind me about this insight, and maybe inspire someone to figure out other ways to cope with a surfeit of categories.
4 thoughts on “How can you be awash in data? Let me count the ways.”
I think the graph would be more useful if the categories were sorted by total number of cases instead of alphabetical. That way you can see combinations that deviate from the general trend.
Sorting, as Tessa suggested, is one way to improve the presentation. Another is to provide 1D projections onto each axis (row and column totals).
I’ve no idea what the vertical position of each dot in each cell is intended to convey, so I’ve no suggestions on how to convey whatever that is more clearly. There is also some mysterious color code going on with one blue dot.
Ha! Yeah, sorry, the points in the cells are just piled up, so a higher pile indicates more points. There are so many, they overlap. If the graph were much taller, you would see them as distinct points, one per case.
And the one blue dot had been selected.
One more thing; CODAP does let you put the count in each cell (so there would be 160 numbers on the graph) as well as row or column or overall percentages. So the (middle-grades) students looking at this can and do make some of these additions. They help, but they don’t have the power with this kind of confusion that, say, filtering can have when you have a large number of cases.
These can overlap, of course! One student decided to focus on ankle injuries, so filtered for those, eliminating all the rest of the categories. This reduced the number of cases and made the body-part categories irrelevant.