Smelling Like Data Science

(Adapted from a panel after-dinner talk for the in the opening session to DSET 2017)

Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.

So there ought to be data science education. But what should we teach, and how should we teach it?

Let me address the second question first. There are at least three approaches to take:

  • students use data tools (i.e., pre-data-science)
  • students use data science data products 
  • students do data science

I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:

We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.

That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.

You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these: Continue reading Smelling Like Data Science

Reflection on 538, Trump, and Bayes

Was the run-up to the recent election an example of failed statistics? Pundits have been saying how bad the polling was. Sure, there might have been some things pollsters could have done better, but consider: FiveThirtyEight, on the morning of the election, gave Trump a 28.6% chance of winning.

And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.

Prediction by FiveThirtyEight on the morning of election day.

This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.

In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter.  We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.

(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)

Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.

My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.

Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.

Okay, Ph.D. students. That’s a good nugget for a dissertation.

Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:

  • Do you take an umbrella?
  • If it doesn’t rain, do you think, “the prediction was wrong?”

The Index of Clumpiness, Part One

1000 points. All random. The colors indicate how close the nearest neighbor is.

There really is such a thing. Some background: The illustration shows a random collection of 1000 dots. Each coordinate (x and y) is a (pseudo-)random number in the range [0, 1) — multiplied by 300 to get a reasonable number of pixels.

The point is that we can all see patterns in it. Me, I see curves and channels and little clumps. If they were stars, I’d think the clumps were star clusters, gravitationally bound to each other.

But they’re not. They’re random. The patterns we see are self-deception. This is related to an activity many stats teachers have used, in which the students are to secretly record a set of 100 coin flips, in order, and also make up a set of 100 random coin flips. The teacher returns to the room and can instantly tell which is the real one and which is the fake. It’s a nice trick, but easy: students usually make the coin flips too uniform. There aren’t enough streaks. Real randomness tends to have things that look non-random.

Here is a snap from a classroom activity: Continue reading The Index of Clumpiness, Part One

Capture/Recapture Part Two

Trying to get yesterday’s post out quickly, I touched only lightly on how to set up the various simulations. So consider them exercises for the intermediate-level simulation maker. I find it interesting how, right after a semester of teaching this stuff, I still have to stop and think how it needs to work. What am I varying? What distribution am I looking at? What does it represent?

Seeing how the two approaches fit together, yet are so different, helps illuminate why confidence intervals can be so tricky.

Anyway, I promised a Very Compelling Real-Life Application of This Technique. I had thought about talking to fisheries people, but even though capture/recapture somehow is nearly always introduced in a fish context, of course it doesn’t have to be. Here we go:

Human Rights and Capture/Recapture

I’ve just recently been introduced to an outfit called the Human Rights Data Analysis Group. Can’t beat them for statistics that matter, and I really have to say, a lot of the explanations and writing on their site is excellent. If you’re looking for Post-AP ideas, as well as caveats about data for everyone, this is a great place to go.

One of the things they do is try to figure out how many people get killed in various trouble areas and in particular events. You get one estimate from some left-leaning NGO. You get another from the Catholics. Information is hard to get, and lists of the dead are incomplete. So it’s not surprising that different groups get different estimates. Whom do you believe?

Continue reading Capture/Recapture Part Two

Capture/Recapture Part One

Kids doing capture/recapture. From Dan Meyer.

If you’ve been awake and paying attention to stats education, you must have come across capture/recapture and associated classroom activities.

The idea is that you catch 20 fish in a lake and tag them. The next day, you catch 25 fish and note that 5 are tagged. The question is, how many fish are in the lake? The canonical answer is 100: having 5 tagged in the 25 suggests that 1/5 of all fish are tagged; if 20 fish are tagged, then the total number must be 100. Right?

Sort of. After all, we’ve made a lot of assumptions, such as that the fish instantly and perfectly mix, and that when you fish you catch a random sample of the fish in the lake. Not likely. But even supposing that were true, there must be sampling variability: if there were 20 out of 100 tagged, and you catch 25, you will not always catch 5 tagged fish; and then, looking at it the twisted, Bayesian-smelling other way, if you did catch 5, there are lots of other plausible numbers of fish there might be in the lake.

Let’s do those simulations.

Continue reading Capture/Recapture Part One

Talking is so not enough

We’re careening towards to the end of the semester in calculus, and I know I’m mostly posting about stats, but this just happened in calc and it applies everywhere.

We’ve been doing related rate problems, and had one of those classic calculus-book problems that involves a cone. Sand is being added to a pile, and we’re given that the radius of the pile is increasing at 3 inches per minute. The current radius is 3 feet; the height is 4/3 the radius; at what rate is sand being added to the pile?

Never mind that no pile of sand is shaped like that—on Earth, anyway. I gave them a sheet of questions about the pile to introduce the angle of repose, etc. I think it’s interesting and useful to be explicitly critical of problems and use that to provoke additional calculation and figuring stuff out. But I digress.

Continue reading Talking is so not enough

Coming (Back) to Our Census

Reflecting on the continuing, unexpected, and frustrating malaise that is Math 102, Probability and Statistics, one of my ongoing problems has been the deterioration of Fathom. It shouldn’t matter that much that we can’t get Census data any more, but I find that I miss it a great deal; and I think that it was a big part of what made stats so engaging at Lick.

So I’ve tried to make it accessible in kinda the same way I did the NHANES data years ago.

This time we have Census data instead of health. At this page here, you specify what variables you want to download, then you see a 10-case preview of the data to see if it’s what you want, and then you can get up to 1000 cases. I’m drawing them from a 21,000 case extract from the 2013 American Community Survey, all from California. (There are a lot more cases in the file I downloaded; I just took the first 21,000 or so so we could get an idea what’s going on.)

Continue reading Coming (Back) to Our Census