In the Data Science Games project, we started talking, early, about what we called data moves. We weren’t quite sure what they were exactly, but we recognized some when we did them.
In CODAP, for example (like in Fathom), there is this thing we learn to do where you select points in one graph and, since when you select data in CODAP, the data are selected everywhere, the same points are selected in all other graphs—and you can see patterns that were otherwise hidden.
You can use that same selection idea to hide the selected or unselected points, thereby filtering the data that you’re seeing. Anyway, that felt like a data move, a tool in our data toolbox. We could imagine pointing them out to students as a frequently-useful action to take.
I’ve mentioned the idea in a couple of posts because it seemed to me that data moves were characteristic of data science, or at least the proto-data-science that we have been trying to do: we use data moves to make sense of rich data where things can get confusing; we use data moves to help when we are awash in data. In traditional intro stats, you don’t need data moves because you generally are given exactly the data you need.
In the previous post (Smelling Like Data Science) we said that one characteristic of doing data science might be the kinds of things you do with data. We called these “data moves,” and they include things such as filtering data, transposing it, or reorganizing it in some way. The moves we’re talking about are not, typically, ones that get covered in much depth, if at all, in a traditional stats course; perhaps we consider them too trivial or beside the point. In stats, we’re more interested in focusing on distribution and variability, or on stats moves such as creating estimates or tests, or even, in these enlightened times, doing resampling and probability modeling.
Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.
This is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.
I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.
At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking. Continue reading More about Data Moves—and R
(Adapted from a panel after-dinner talk for the in the opening session to DSET 2017)
Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.
So there ought to be data science education. But what should we teach, and how should we teach it?
Let me address the second question first. There are at least three approaches to take:
students use data tools (i.e., pre-data-science)
students use data science data products
students do data science
I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:
We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.
That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.
Was the run-up to the recent election an example of failed statistics? Pundits have been saying how bad the polling was. Sure, there might have been some things pollsters could have done better, but consider: FiveThirtyEight, on the morning of the election, gave Trump a 28.6% chance of winning.
And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.
This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.
In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter. We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.
(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)
Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.
My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.
Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.
Okay, Ph.D. students. That’s a good nugget for a dissertation.
Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:
Do you take an umbrella?
If it doesn’t rain, do you think, “the prediction was wrong?”
There really is such a thing. Some background: The illustration shows a random collection of 1000 dots. Each coordinate (x and y) is a (pseudo-)random number in the range [0, 1) — multiplied by 300 to get a reasonable number of pixels.
The point is that we can all see patterns in it. Me, I see curves and channels and little clumps. If they were stars, I’d think the clumps were star clusters, gravitationally bound to each other.
But they’re not. They’re random. The patterns we see are self-deception. This is related to an activity many stats teachers have used, in which the students are to secretly record a set of 100 coin flips, in order, and also make up a set of 100 random coin flips. The teacher returns to the room and can instantly tell which is the real one and which is the fake. It’s a nice trick, but easy: students usually make the coin flips too uniform. There aren’t enough streaks. Real randomness tends to have things that look non-random.
Trying to get yesterday’s post out quickly, I touched only lightly on how to set up the various simulations. So consider them exercises for the intermediate-level simulation maker. I find it interesting how, right after a semester of teaching this stuff, I still have to stop and think how it needs to work. What am I varying? What distribution am I looking at? What does it represent?
Seeing how the two approaches fit together, yet are so different, helps illuminate why confidence intervals can be so tricky.
Anyway, I promised a Very Compelling Real-Life Application of This Technique. I had thought about talking to fisheries people, but even though capture/recapture somehow is nearly always introduced in a fish context, of course it doesn’t have to be. Here we go:
Human Rights and Capture/Recapture
I’ve just recently been introduced to an outfit called the Human Rights Data Analysis Group. Can’t beat them for statistics that matter, and I really have to say, a lot of the explanations and writing on their site is excellent. If you’re looking for Post-AP ideas, as well as caveats about data for everyone, this is a great place to go.
One of the things they do is try to figure out how many people get killed in various trouble areas and in particular events. You get one estimate from some left-leaning NGO. You get another from the Catholics. Information is hard to get, and lists of the dead are incomplete. So it’s not surprising that different groups get different estimates. Whom do you believe?