Dare I Even Suggest…?

April 15, 2015: BART ridership between Rockridge and Embarcadero stations, by hour.

If you’ve been following along, and reading my mind over the past six months while I have been mostly not posting, you know I’m thinking a lot about data science education (as opposed to data science). In particular, I wonder what sorts of things we could do at K–12 —especially at high school — to help students think like data scientists.

To that end, the good people at The Concord Consortium are hosting a webinar series. And I’m hosting the third of these sessions Tuesday July 25 at 9 AM Pacific time.

Click this link to to to EventBrite to tell us you’re coming.

The main thing I’d like to do is to present some of our ideas about “data moves”—things students can learn to do with data that tend not to be taught in a statistics class, or anywhere, but might be characteristic of the sorts of things that underpin data science ideas—and let you, the participants, actually do them. Then we can discuss what happened and see whether you think these really do “smell like” data science, or not.

You could also think of this as trying to decide whether using some of these data skills, such as filtering a data set, or reorganizing its hierarchy, might also be examples of computational thinking.

The webinar (my first ever, crikey) is free, of course, and we will use CODAP, the Common Online Data Analysis Platform, which is web-based and also free and brought to you by Concord and by you, the taxpayer. Thanks, NSF!

We’ll explore data from NHANES, a national health survey, and from BART, the Bay Area Rapid Transit District. And whatever else I shoehorn in as I plan over the next day.

More about Data Moves—and R

In the previous post (Smelling Like Data Science) we said that one characteristic of doing data science might be the kinds of things you do with data. We called these “data moves,” and they include things such as filtering data, transposing it, or reorganizing it in some way. The moves we’re talking about are not, typically, ones that get covered in much depth, if at all, in a traditional stats course; perhaps we consider them too trivial or beside the point. In stats, we’re more interested in focusing on distribution and variability, or on stats moves such as creating estimates or tests, or even, in these enlightened times, doing resampling and probability modeling.

Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.

DS GraphicThis is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.

I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.

At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking. Continue reading More about Data Moves—and R

Smelling Like Data Science

(Adapted from a panel after-dinner talk for the in the opening session to DSET 2017)

Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.

So there ought to be data science education. But what should we teach, and how should we teach it?

Let me address the second question first. There are at least three approaches to take:

  • students use data tools (i.e., pre-data-science)
  • students use data science data products 
  • students do data science

I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:

We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.

That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.

You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these: Continue reading Smelling Like Data Science

Reflection on 538, Trump, and Bayes

Was the run-up to the recent election an example of failed statistics? Pundits have been saying how bad the polling was. Sure, there might have been some things pollsters could have done better, but consider: FiveThirtyEight, on the morning of the election, gave Trump a 28.6% chance of winning.

And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.

Prediction by FiveThirtyEight on the morning of election day.

This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.

In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter.  We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.

(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)

Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.

My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.

Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.

Okay, Ph.D. students. That’s a good nugget for a dissertation.

Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:

  • Do you take an umbrella?
  • If it doesn’t rain, do you think, “the prediction was wrong?”

Modeling Hexnut Mass

HexnutIntroLet me encourage you to go to your hardware store and get some hexnuts. You won’t regret it. Now let’s see if I can write a post about it in under, like, four hours.

(Also, get a micrometer on eBay and a sweet 0.1 gram food scale. They’re about $15 now.)

Long ago, I wrote about coins and said I would write about hexnuts. I wrote a book chapter, but never did the post. So here we go. What prompted me was thinking different kinds of models.

I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?

A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The more-or-less cubic part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density. Continue reading Modeling Hexnut Mass

DASL Updated. Mostly improved.

Smoking and cancer graph.
Data from DASL, graph from CODAP. LUNG is lung cancer deaths per 100,000. CIG is number of cigarettes sold (hundreds per person). Data from 1960.

The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.

Here is the link. If you teach stats, make a bookmark: http://dasl.datadesk.com/

The site includes scores of data sets organized by content topic (e.g., sports, the environment) and by statistical technique (e.g., linear regression, ANOVA). It also includes famous data sets such as Hubble’s data on the radial velocity of distant galaxies.

One small hitch for Fathom users:

In the old days of DASL, you would simply drag the URL mini-icon from the browser’s address field into the Fathom document and amaze your friends with how Fathom parsed the page and converted the table of data on the web page into a table in Fathom. Ah, progress! The snazzy new and more sophisticated format for DASL puts the data inside a scrollable field — and as a result, the drag gesture no longer works in DASL.

Fear not, though: @gasstationwithoutpumps (comment below) realized you could drag the download button directly into Fathom. Here is a picture of a button on a typical DASL “datafile” page. Just drag it over your Fathom document and drop:


In addition, here are two workarounds:

Plan A:

  • Place your cursor in that scrollable box. Select All. Copy.
  • Switch to Fathom. Create a new, empty collection by dragging the collection icon off the shelf.
  • With that empty collection selected, Paste. Done!

Plan B:

  • Use their Download button to download the .txt file.
  • Drag that file into your Fathom document.

Note: Plan B works for CODAP as well.

Model Shop! One volume done!

The Model Shop, Volume 1Hooray, I have finally finished what used to be called EGADs and is now the first volume of The Model Shop. Calling it the first volume is, of course, a treacherous decision.

So. This is a book of 42 activities that connect geometry to functions through data. There are a lot of different ways to describe it, and in the course of finishing the book, the emotional roller-coaster took me from great pride in what a great idea this was to despair over how incredibly stupid I’ve been.

I’m obviously too close to the project.

For an idea of what drove some of the book, check out the posts on the “Chord Star.”

But you can also see the basic idea in the book cover. See the spiral made of triangles? Imagine measuring the hypotenuses of those triangles, and plotting the lengths as a function of “triangle number.” That’s the graph you see. What’s a good function for modeling that data?

If we’re experienced in these things, we say, oh, it’s exponential, and the base of the exponent is the square root of 2. But if we’re less experienced, there are a lot of connections to be made.

We might think it looks exponential, and use sliders to fit a curve (for example, in Desmos or Fathom. Here is a Desmos document with the data you can play with!) and discover that the base is close to 1.4. Why should it be 1.4? Maybe we notice that if we skip a triangle, the size seems to double. And that might lead us to think that 2 is involved, and gradually work it out that root 2 will help.

Or we might start geometrically, and reason about similar triangles. And from there gradually come to realize that the a/b = c/d trope we’ve used for years, in this situation, leads to an exponential function, which doesn’t look at all like setting up a proportion.

In either case, we get to make new connections about parts of math we’ve been learning about, and we get to see that (a) you can find functions that fit data and (b) often, there’s a good, underlying, understandable reason why that function is the one that works.

I will gradually enhance the pages on the eeps site to give more examples. And of course you can buy the book on Amazon! Just click the cover image above.