Ack! I don’t have time to do justice to this right now, but any readers need to know if you don’t already that the geniuses at Desmos seem to be making a matrix calculator: https://www.desmos.com/matrix.
Having read that, you might rightly say, I can’t get to everything in my curriculum as it is, why are you bringing up matrices? (You might also say, Tim, I thought you were a data guy, what does this have to do with data?)
Let me address that first question (and forget the second): I’m about to go do a week of inservice in a district that, for reasons known only to them, have put matrices in their learning goals for high-school math. Their goal seems to be to learn procedures for using matrices to solve systems of linear equations.
I look at that and think, surely there are more interesting things to do with matrices. And there are!
Long ago (2007) Bryan Cooley and I wrote a set of physics labs; in one of them we had students bounce a ping-pong ball. You know the sound; it’s like this:
For the lab, we had students record the sound at 1000 points per second using a Vernier microphone. Using the resulting data, students could identify the times of the “pocks” and then see how the times between the pocks — the “interpock intervals” — decreased exponentially. This is a cool take on the old Algebra 2/Precalculus activity about bouncing balls where you measure drop heights; using sound and the technology, you can get more bounces and more accuracy.
A typical graph of the sound looks like this:
And a graph of the interpock intervals looks like this:
The concentration of CO2 in the atmosphere is rising, and we have good data on that from, among other sources, atmospheric measurements that have been taken near the summit of Mauna Loa, in Hawaii, for decades.
Each of the 726 dots in the graph represents the average value for one month of data.
What do we have to do—what data moves can we make—to make better sense of the data? One thing that any beginning stats person might do is to fit a line to the data. I won’t do that here, but you can imagine what happens: the data curve upward, so the line is a poor model, but the positive slope of the line (about 1.5, which is in ppm per year) is a useful average rate of increase over the interval we’re looking at. You could consider fitting a curve, or a sequence of line segments, but we won’t do that either.
Instead, let’s point out that the swath of points is wide. There are lots of overlapping points. We should zoom in and see if there is a pattern.
I’m just back from NCTM 2018 in Washington DC where I gave a brief workshop that introduced ideas in data science education and the use of CODAP to a very nice group in a room that—well, NCTM and the Marriott Marquis were doing their best, but we really need a different way of doing technology at these big conferences.
Anyway: at the end of a fairly wide-ranging presentation in which my main goal was for participants to get their hands dirty—get into the data, get a feel for the tools, have data science on their radar—it was inevitable that I would feel:
that I talked too much; and
that there were important things I should have said.
Sigh. Let’s address the latter. Here is a take-away I wish I had set up better:
In data science, things are often too complicated. So one step is to simplify things; and some data moves, by their nature, simplify.
Complication is related to being awash in data (see this post…); it can come from the sheer quantity of data as well as things like being multivariate or otherwise just containing a lot of stuff we’re not interested in right now.
To cut through that complication, we often filter or summarize, and to do those, we often group. To give some examples, I will look again at the data that appeared in the cards metaphor post, but with a different slant.
Here we go: NHANES data on height, age, and sex. At the end of the process, we will see this graph:
And the graph tells a compelling story: boys and girls are roughly the same height—OK, girls are a little taller at ages 10–12—but starting at about age 13, girls’ heights level off, while the boys continue growing for about two more years.
We arrived at this after a bunch of analysis. But how did we start?
Last time we introduced decision trees and a tool we’ve made to explore them. With that tool, embedded in a simple game (Arbor), you can generate data from alien creatures with a simulated malady, figure out its predictors, and make a decision tree that will let you automate its diagnosis. (Here is the link to that not-quite-game.)
Your job was to get through the diseases ague and botulosis. Today I want to reflect on those two scenarios.
Ague is ridiculously simple, and with that ridiculous simplicity, the user is supposed to be able to learn the basics of the game, that is, how to “drive” the tools. One way to figure out the disease is to sort the table by health and see what matches health. Here is what the sorted table looks like:
Just scanning the various columns, you can see that health is associated with hair color. Pink means sick, blue means well. With that insight, you can go on to diagnose individual creatures and then make a simple tree, which looks like this:
Although there is a lot of information in the tree, users can generally figure it out. If they (or you) have trouble, they can get additional information by hovering over the boxes or the links.
(Adapted from a panel after-dinner talk for the in the opening session to DSET 2017)
Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.
So there ought to be data science education. But what should we teach, and how should we teach it?
Let me address the second question first. There are at least three approaches to take:
students use data tools (i.e., pre-data-science)
students use data science data products
students do data science
I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:
We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.
That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.
I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?
A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The more-or-less cubic part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density. Continue reading Modeling Hexnut Mass