Thinking about yesterday’s post, I was struck with an idea that may be obvious to many readers, and has doubtless been well-explored, but it was new to me (or I had forgotten it) so here I go, writing to help me think and remember:
The post touched on the notion that communication is an important part of data science, and that simplicity aids in communication. Furthermore, simplification is part of modelmaking.
That is, we look at unruly data with a purpose: to understand some phenomenon or to answer a question. And often, the next step is to communicate that understanding or answer to a client, be it the person who is paying us or just ourselves. “Communicating the understanding” means, essentially, encapsulating what we have found out so that we don’t have to go through the entire process again.
So we might boil the data down and make a really cool, elegant visualization. We hold onto that graphic, and carry it with us mentally in order to understand the underlying phenomenon, for example, that graph of mean height by sex and age in order to have an internal idea—a model—for sex differences in human growth.
But every model leaves something out. In this case, we don’t see the spread in heights at each age, and we don’t see the overlap between females and males. So we could go further, and include more data in the graph, but eventually we would get a graph that was so unwieldy that we couldn’t use it to maintain that same ease of understanding. It would require more study every time we needed it. Of course, the appropriate level of detail depends on the context, the stakes, and the audience.
So there’s a tradeoff. As we make our analysis more complex, it becomes more faithful to the original data and to the world, but it also becomes harder to understand.
Which suggests this graphic:
I know, this simple model-about-modeling is high on clarity and low on fidelity. But it might help us understand why we do what we do.
For example, a data rube (or a politician) might use a model at the left edge of this spectrum: “boys are taller!” “winter is cold!” “the world is flat!”, providing clarity from the simplest use of the data: gross, undifferentiated averages; anecdote; feelings about how things are. But someone whose nose is too close to the data—who can’t even see the trees for the pine needles on the forest floor—is on the right edge: they see that Esmerelda is taller than Sven, even though she is younger, so they are reluctant to make choices based on a generalization when the details might be important to the individuals.
The tradeoff, then, is to get as much clarity as possible at a minimum cost in fidelity. This comes from good choices in data analysis and representation. Let’s change our diagram so that clarity and fidelity are the axes. Then we might represent better data science as having more of both, as in the next diagram:
In terms of data moves, we want students not only to learn how to do data moves, but also to choose their data moves judiciously. A data analysis process starts in the lower-right corner of this new diagram, with all the data and no clarity. Summarizing and filtering simplify—by smoothing things out, by reducing the number of data points. They generally move us up and to the left in our diagram. The angle depends on how good our choice is: how much clarity do we get from that summary or that specific filter? How much fidelity do we lose? Could we choose a different data move where the tradeoff was better?
The same goes for making visualizations. Different visualizations provide different amounts of clarity at different costs in fidelity. Gapminder graphs, for example, show fairly precise values of many variables (= high fidelity) with remarkable clarity—at least for their specific purposes.
One more musing before we go: I like the words “fidelity” and “clarity” here, but they might not be best. By fidelity I do mean, closer to the actual phenomenon, as represented by the data. By clarity I mean, it’s easy to understand. But those are not really the only issues. For example, let’s ask, why would you not use any and all available data in order to make a decision? There are at least three reasons:
- The representation is too hard to understand. Clarity is low. This is what we’ve been talking about. But also:
- It’s too impractical or costly to get the data you need. For example, suppose you want to predict heights of kids for some reason. You can do a pretty good job using their age and sex. You could do a better job if you had their upper arm lengths—but getting that data is as hard as just getting the heights.
- Finally, if you use the data in too much detail, you’re in danger of overfitting. This is where sample size matters, where the data are so idiosyncratic that the structure we see is misleading and no longer represents the phenomenon.
Enough! Maybe I can remember this now. See you later. Oh: if you know where Bears of More Brain than I have already talked about this, let me know.
P.S.: I also like the epiphany that simplification is a sign that doing data science with a communication purpose is a kind of modelmaking. Note that this is an excuse for anyone using the Common Core or NGSS to include data science in your curriculum: it’s modeling! (Thanks, Wayne Nirode!)