Fidelity versus Clarity

Thinking about yesterday’s post, I was struck with an idea that may be obvious to many readers, and has doubtless been well-explored, but it was new to me (or I had forgotten it) so here I go, writing to help me think and remember:

The post touched on the notion that communication is an important part of data science, and that simplicity aids in communication. Furthermore, simplification is part of modelmaking.

That is, we look at unruly data with a purpose: to understand some phenomenon or to answer a question. And often, the next step is to communicate that understanding or answer to a client, be it the person who is paying us or just ourselves. “Communicating the understanding” means, essentially, encapsulating what we have found out so that we don’t have to go through the entire process again.

nhanes 800 means
Mean height by sex and age; 800 cases aged 5–19. NHANES, 2003.

So we might boil the data down and make a really cool, elegant visualization. We hold onto that graphic, and carry it with us mentally in order to understand the underlying phenomenon, for example, that graph of mean height by sex and age in order to have an internal idea—a model—for sex differences in human growth.

But every model leaves something out. In this case, we don’t see the spread in heights at each age, and we don’t see the overlap between females and males. So we could go further, and include more data in the graph, but eventually we would get a graph that was so unwieldy that we couldn’t use it to maintain that same ease of understanding. It would require more study every time we needed it. Of course, the appropriate level of detail depends on the context, the stakes, and the audience.

So there’s a tradeoff. As we make our analysis more complex, it becomes more faithful to the original data and to the world, but it also becomes harder to understand.

Which suggests this graphic:

Graphic showing that as complexity increases, clarity goes down, but fidelity goes up
The data science design tradeoff

Continue reading Fidelity versus Clarity

Advertisements

Data Moves and Simplification

or, What I should have emphasized more at NCTM

I’m just back from NCTM 2018 in Washington DC where I gave a brief workshop that introduced ideas in data science education and the use of CODAP to a very nice group in a room that—well, NCTM and the Marriott Marquis were doing their best, but we really need a different way of doing technology at these big conferences.

Anyway: at the end of a fairly wide-ranging presentation in which my main goal was for participants to get their hands dirty—get into the data, get a feel for the tools, have data science on their radar—it was inevitable that I would feel:

  • that I talked too much; and
  • that there were important things I should have said.

Sigh. Let’s address the latter. Here is a take-away I wish I had set up better:

In data science, things are often too complicated. So one step is to simplify things; and some data moves, by their nature, simplify.

Complication is related to being awash in data (see this post…); it can come from the sheer quantity of data as well as things like being multivariate or otherwise just containing a lot of stuff we’re not interested in right now.

To cut through that complication, we often filter or summarize, and to do those, we often group. To give some examples, I will look again at the data that appeared in the cards metaphor post, but with a different slant.

Here we go: NHANES data on height, age, and sex. At the end of the process, we will see this graph:

nhanes 800 means
Mean height by sex and age; 800 cases aged 5–19. NHANES, 2003.

And the graph tells a compelling story: boys and girls are roughly the same height—OK, girls are a little taller at ages 10–12—but starting at about age 13, girls’ heights level off, while the boys continue growing for about two more years.

We arrived at this after a bunch of analysis. But how did we start?

Continue reading Data Moves and Simplification

Data Moves: the cards metaphor

In the Data Science Games project, we started talking, early, about what we called data moves. We weren’t quite sure what they were exactly, but we recognized some when we did them.

In CODAP, for example (like in Fathom), there is this thing we learn to do where you select points in one graph and, since when you select data in CODAP, the data are selected everywhere, the same points are selected in all other graphs—and you can see patterns that were otherwise hidden.

You can use that same selection idea to hide the selected or unselected points, thereby filtering the data that you’re seeing. Anyway, that felt like a data move, a tool in our data toolbox. We could imagine pointing them out to students as a frequently-useful action to take.

I’ve mentioned the idea in a couple of posts because it seemed to me that data moves were characteristic of data science, or at least the proto-data-science that we have been trying to do: we use data moves to make sense of rich data where things can get confusing; we use data moves to help when we are awash in data. In traditional intro stats, you don’t need data moves because you generally are given exactly the data you need.

Continue reading Data Moves: the cards metaphor

More about Data Moves—and R

In the previous post (Smelling Like Data Science) we said that one characteristic of doing data science might be the kinds of things you do with data. We called these “data moves,” and they include things such as filtering data, transposing it, or reorganizing it in some way. The moves we’re talking about are not, typically, ones that get covered in much depth, if at all, in a traditional stats course; perhaps we consider them too trivial or beside the point. In stats, we’re more interested in focusing on distribution and variability, or on stats moves such as creating estimates or tests, or even, in these enlightened times, doing resampling and probability modeling.

Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.

DS GraphicThis is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.

I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.

At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking. Continue reading More about Data Moves—and R