## Data Moves and Simplification

or, What I should have emphasized more at NCTM

I’m just back from NCTM 2018 in Washington DC where I gave a brief workshop that introduced ideas in data science education and the use of CODAP to a very nice group in a room that—well, NCTM and the Marriott Marquis were doing their best, but we really need a different way of doing technology at these big conferences.

Anyway: at the end of a fairly wide-ranging presentation in which my main goal was for participants to get their hands dirty—get into the data, get a feel for the tools, have data science on their radar—it was inevitable that I would feel:

• that I talked too much; and
• that there were important things I should have said.

Sigh. Let’s address the latter. Here is a take-away I wish I had set up better:

In data science, things are often too complicated. So one step is to simplify things; and some data moves, by their nature, simplify.

Complication is related to being awash in data (see this post…); it can come from the sheer quantity of data as well as things like being multivariate or otherwise just containing a lot of stuff we’re not interested in right now.

To cut through that complication, we often filter or summarize, and to do those, we often group. To give some examples, I will look again at the data that appeared in the cards metaphor post, but with a different slant.

Here we go: NHANES data on height, age, and sex. At the end of the process, we will see this graph:

And the graph tells a compelling story: boys and girls are roughly the same height—OK, girls are a little taller at ages 10–12—but starting at about age 13, girls’ heights level off, while the boys continue growing for about two more years.

We arrived at this after a bunch of analysis. But how did we start?

## Data Moves: the cards metaphor

In the Data Science Games project, we started talking, early, about what we called data moves. We weren’t quite sure what they were exactly, but we recognized some when we did them.

In CODAP, for example (like in Fathom), there is this thing we learn to do where you select points in one graph and, since when you select data in CODAP, the data are selected everywhere, the same points are selected in all other graphs—and you can see patterns that were otherwise hidden.

You can use that same selection idea to hide the selected or unselected points, thereby filtering the data that you’re seeing. Anyway, that felt like a data move, a tool in our data toolbox. We could imagine pointing them out to students as a frequently-useful action to take.

I’ve mentioned the idea in a couple of posts because it seemed to me that data moves were characteristic of data science, or at least the proto-data-science that we have been trying to do: we use data moves to make sense of rich data where things can get confusing; we use data moves to help when we are awash in data. In traditional intro stats, you don’t need data moves because you generally are given exactly the data you need.

## Trees. And. Diagnosis. (Live!)

I’ve been invited to give a webinar about our work on trees; it will include material from the previous two posts.

Here’s the blurb:

Data, Decisions, and Trees

We often say that we want to make decisions “based on data.” What does that really mean? We’ll look at a simple approach to data-based decisionmaking using a representation we might not use every day: the tree. In this webinar, you’ll use data to make trees, and then use the trees to diagnose diseases.

On the surface, trees are very simple. But for some reason — perhaps because we’re less familiar with using trees — people (and by that we mean us) have more trouble than we expect. Anticipate having a couple of “wait a second, let me think about this!” moments.

## Trees. And. Diagnosis. (Part two)

Last time we introduced decision trees and a tool we’ve made to explore them. With that tool, embedded in a simple game (Arbor), you can generate data from alien creatures with a simulated malady, figure out its predictors, and make a decision tree that will let you automate its diagnosis. (Here is the link to that not-quite-game.)

Your job was to get through the diseases ague and botulosis. Today I want to reflect on those two scenarios.

## Ague

Ague is ridiculously simple, and with that ridiculous simplicity, the user is supposed to be able to learn the basics of the game, that is, how to “drive” the tools. One way to figure out the disease is to sort the table by health and see what matches health. Here is what the sorted table looks like:

Just scanning the various columns, you can see that health is associated with hair color.  Pink means sick, blue means well. With that insight, you can go on to diagnose individual creatures and then make a simple tree, which looks like this:

Although there is a lot of information in the tree, users can generally figure it out. If they (or you) have trouble, they can get additional information by hovering over the boxes or the links.

## Trees. And. Diagnosis. (Part one)

(This is part one. Link to part two.)

In the Data Science Games project, we have recently been exploring decision trees. It’s been great fun, and it’s time to post about it so you dear readers (all three or so of you) can play as well. There is even a working online not-quite-game you can play, and its URL will probably endure even as the software gets upgraded, so in a year it might even still work.

Here’s the genesis of all this: my German colleague Laura Martignon has been doing research on trees and learning, related to work by Gerd Gigerenzer at the Harding Center for Risk Literacy. A typical context is that of a doctor making a diagnosis. The doctor asks a series of questions; each question gets a binary, yes-no answer, which leads either to a diagnosis or a further question. The diagnosis could be either positive (the doc thinks you have the disease) or negative (the doc thinks you don’t).

The risk comes in because the doctor might be wrong. The diagnosis could be a false positive or a false negative. Furthermore, these two forms of failure are generally not equivalent.

Anyway, you can represent the sequence of questions as a decision tree, a kind of flowchart to follow as you diagnose a patient. And it’s a special kind of tree: all branchings are binary—there are always two choices—and all of the ends—the leaves, the “terminal nodes”—are one of two types: positive or negative.

The task is to design the tree. There are fancy ways (such as CART and Random Forest) to do this automatically using machine learning techniques. These techniques use a “training set”—a collection of cases where you know the correct diagnosis—to produce the tree according to some optimization criteria (such as how bad false positives and false negatives are relative to one another).  So it’s a data science thing.

But in data science education, a question arises: what if you don’t really understand what a tree is? How can you learn?

That’s where our game comes in. It lets you build trees by hand, starting with simple situations. Your trees will not in general be optimal, but that’s not the point. You get to mess around with the tree and see how well it works on the training set, using whatever criteria you like to judge the tree. Then, in the game, you can let the tree diagnose a fresh set of cases and see how it does.

That’s enough for now. Your job is to play around with the tool. It will look like this to start:

The first few scenarios are designed so that it’s possible to make perfect diagnoses. No false positives, no false negatives. So it’s all about logic, and not about risk or statistics. But even that much is really interesting. As you mess around, think about the representation, and how amazingly hard it can be to think about what’s going on.

There are instructions on the left in the tan-colored “tile” labeled ArborWorkshop. Start with those. There is also a help panel in the tree tile on the right. It may not be up to date. All of the software is under development.

The first disease scenario, ague, is very simple. The next one, botulosis, is almost as simple, and worth reflecting on. That will happen soon, I hope after you have tried it.

Note: if you are unfamiliar with this platform, CODAP, go to the link, then to the “hamburger” menu. Upper left. Choose New. Then Open Document or Browse Examples. Then Getting Started with CODAP. That should be enough for now.

## A Calculus Rant (with stats at the end)

Let’s look at a simple optimization problem. Bear with me, because the point is not the problem itself, but in what we have to know and do in order to solve it. Here we go:

Suppose you have a string 12 cm long. You form it into the shape of a rectangle. What shape gives you the maximum area?

Traditionally, how do we expect students to solve this in a calculus class? Here is one of several approaches, in excruciating detail: Continue reading A Calculus Rant (with stats at the end)

## Dare I Even Suggest…?

If you’ve been following along, and reading my mind over the past six months while I have been mostly not posting, you know I’m thinking a lot about data science education (as opposed to data science). In particular, I wonder what sorts of things we could do at K–12 —especially at high school — to help students think like data scientists.

To that end, the good people at The Concord Consortium are hosting a webinar series. And I’m hosting the third of these sessions Tuesday July 25 at 9 AM Pacific time.

The main thing I’d like to do is to present some of our ideas about “data moves”—things students can learn to do with data that tend not to be taught in a statistics class, or anywhere, but might be characteristic of the sorts of things that underpin data science ideas—and let you, the participants, actually do them. Then we can discuss what happened and see whether you think these really do “smell like” data science, or not.

You could also think of this as trying to decide whether using some of these data skills, such as filtering a data set, or reorganizing its hierarchy, might also be examples of computational thinking.

The webinar (my first ever, crikey) is free, of course, and we will use CODAP, the Common Online Data Analysis Platform, which is web-based and also free and brought to you by Concord and by you, the taxpayer. Thanks, NSF!

We’ll explore data from NHANES, a national health survey, and from BART, the Bay Area Rapid Transit District. And whatever else I shoehorn in as I plan over the next day.

## More about Data Moves—and R

In the previous post (Smelling Like Data Science) we said that one characteristic of doing data science might be the kinds of things you do with data. We called these “data moves,” and they include things such as filtering data, transposing it, or reorganizing it in some way. The moves we’re talking about are not, typically, ones that get covered in much depth, if at all, in a traditional stats course; perhaps we consider them too trivial or beside the point. In stats, we’re more interested in focusing on distribution and variability, or on stats moves such as creating estimates or tests, or even, in these enlightened times, doing resampling and probability modeling.

Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.

This is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.

I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.

At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking. Continue reading More about Data Moves—and R

## Smelling Like Data Science

(Adapted from a panel after-dinner talk for the in the opening session to DSET 2017)

Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.

So there ought to be data science education. But what should we teach, and how should we teach it?

Let me address the second question first. There are at least three approaches to take:

• students use data tools (i.e., pre-data-science)
• students use data science data products
• students do data science

I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:

We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.

That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.

You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these: Continue reading Smelling Like Data Science

## Reflection on 538, Trump, and Bayes

Was the run-up to the recent election an example of failed statistics? Pundits have been saying how bad the polling was. Sure, there might have been some things pollsters could have done better, but consider: FiveThirtyEight, on the morning of the election, gave Trump a 28.6% chance of winning.

And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.

This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.

In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter.  We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.

(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)

Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.

My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.

Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.

Okay, Ph.D. students. That’s a good nugget for a dissertation.

Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:

• Do you take an umbrella?
• If it doesn’t rain, do you think, “the prediction was wrong?”