I need to write up this Very Small Thought in order to get it off my to-do list. The basic thesis is: when we ask students to do rich, open-ended projects, we often insist that they write “research questions.” Sometimes this is a terrible idea.
Don’t get me wrong: asking students to come up with research questions can be important. Many frameworks for how science works have “formulate a research question” as an early step. Furthermore, when you grow up, some grant proposal RFPs insist that you specify your research questions.
Long ago (2007) Bryan Cooley and I wrote a set of physics labs; in one of them we had students bounce a ping-pong ball. You know the sound; it’s like this:
For the lab, we had students record the sound at 1000 points per second using a Vernier microphone. Using the resulting data, students could identify the times of the “pocks” and then see how the times between the pocks — the “interpock intervals” — decreased exponentially. This is a cool take on the old Algebra 2/Precalculus activity about bouncing balls where you measure drop heights; using sound and the technology, you can get more bounces and more accuracy.
A typical graph of the sound looks like this:
And a graph of the interpock intervals looks like this:
The concentration of CO2 in the atmosphere is rising, and we have good data on that from, among other sources, atmospheric measurements that have been taken near the summit of Mauna Loa, in Hawaii, for decades.
Each of the 726 dots in the graph represents the average value for one month of data.
What do we have to do—what data moves can we make—to make better sense of the data? One thing that any beginning stats person might do is to fit a line to the data. I won’t do that here, but you can imagine what happens: the data curve upward, so the line is a poor model, but the positive slope of the line (about 1.5, which is in ppm per year) is a useful average rate of increase over the interval we’re looking at. You could consider fitting a curve, or a sequence of line segments, but we won’t do that either.
Instead, let’s point out that the swath of points is wide. There are lots of overlapping points. We should zoom in and see if there is a pattern.
One day, over 50 years ago, we were visiting Lake Tahoe as a family, and dad went across the border to play keno. He came back elated: he had hit seven out of eight on one of his tickets, and won eleven hundred dollars. He proudly laid out fifty twenties and two fifties on the kitchen table. It was a magnificent sight.
The details of keno are unimportant here, except to note that keno is not a game of skill. Of course the house has an edge. In the long run, you will lose money playing keno no matter how you do it. Even my dad, who over the years has played a lot of keno, and won even bigger payouts, would probably admit that he might have a net lifetime loss.
So why do people play? There are lots of reasons, I’m sure, but one of them must be connected to that heartwarming anecdote: fifty years later, I remember the event clearly, as one of joy and wonder.
Let’s explore that using roulette, which is much simpler than keno. A roulette wheel has 18 red and 18 black numbered slots, plus a smaller number of green slots (often two). You can make many different bets, but we will stick with red and black. If you place a $1 bet on red, and it comes up red, you get $2 back (winning $1); if it comes up black or green, you lose your dollar.
Thinking about yesterday’s post, I was struck with an idea that may be obvious to many readers, and has doubtless been well-explored, but it was new to me (or I had forgotten it) so here I go, writing to help me think and remember:
The post touched on the notion that communication is an important part of data science, and that simplicity aids in communication. Furthermore, simplification is part of modelmaking.
That is, we look at unruly data with a purpose: to understand some phenomenon or to answer a question. And often, the next step is to communicate that understanding or answer to a client, be it the person who is paying us or just ourselves. “Communicating the understanding” means, essentially, encapsulating what we have found out so that we don’t have to go through the entire process again.
So we might boil the data down and make a really cool, elegant visualization. We hold onto that graphic, and carry it with us mentally in order to understand the underlying phenomenon, for example, that graph of mean height by sex and age in order to have an internal idea—a model—for sex differences in human growth.
But every model leaves something out. In this case, we don’t see the spread in heights at each age, and we don’t see the overlap between females and males. So we could go further, and include more data in the graph, but eventually we would get a graph that was so unwieldy that we couldn’t use it to maintain that same ease of understanding. It would require more study every time we needed it. Of course, the appropriate level of detail depends on the context, the stakes, and the audience.
So there’s a tradeoff. As we make our analysis more complex, it becomes more faithful to the original data and to the world, but it also becomes harder to understand.
I’m just back from NCTM 2018 in Washington DC where I gave a brief workshop that introduced ideas in data science education and the use of CODAP to a very nice group in a room that—well, NCTM and the Marriott Marquis were doing their best, but we really need a different way of doing technology at these big conferences.
Anyway: at the end of a fairly wide-ranging presentation in which my main goal was for participants to get their hands dirty—get into the data, get a feel for the tools, have data science on their radar—it was inevitable that I would feel:
that I talked too much; and
that there were important things I should have said.
Sigh. Let’s address the latter. Here is a take-away I wish I had set up better:
In data science, things are often too complicated. So one step is to simplify things; and some data moves, by their nature, simplify.
Complication is related to being awash in data (see this post…); it can come from the sheer quantity of data as well as things like being multivariate or otherwise just containing a lot of stuff we’re not interested in right now.
To cut through that complication, we often filter or summarize, and to do those, we often group. To give some examples, I will look again at the data that appeared in the cards metaphor post, but with a different slant.
Here we go: NHANES data on height, age, and sex. At the end of the process, we will see this graph:
And the graph tells a compelling story: boys and girls are roughly the same height—OK, girls are a little taller at ages 10–12—but starting at about age 13, girls’ heights level off, while the boys continue growing for about two more years.
We arrived at this after a bunch of analysis. But how did we start?