In CODAP, for example (like in Fathom), there is this thing we learn to do where you select points in one graph and, since when you select data in CODAP, the data are selected everywhere, the same points are selected in all other graphs—and you can see patterns that were otherwise hidden.
You can use that same selection idea to hide the selected or unselected points, thereby filtering the data that you’re seeing. Anyway, that felt like a data move, a tool in our data toolbox. We could imagine pointing them out to students as a frequently-useful action to take.
I’ve mentioned the idea in a couple of posts because it seemed to me that data moves were characteristic of data science, or at least the proto-data-science that we have been trying to do: we use data moves to make sense of rich data where things can get confusing; we use data moves to help when we are awash in data. In traditional intro stats, you don’t need data moves because you generally are given exactly the data you need.
Anyway, we listed a bunch of them—filtering, grouping, making aggregate measures, merging datasets, making hierarchies, making new attributes, stacking, etc. It seemed that there was something essential in common with all of these, something about the structure of the data, or the values, or about changing the organization of the data set. But every time I thought up a definition, it was lacking something, or could easily be misunderstood.
And then there was the issue of making new visualizations, which is also characteristic of data science. I had it in my original list from February 2017, and then Rob Gould (bless his heart) said that he didn’t think that was a data move, but more of a “data analysis move.” Naturally I thought for weeks that he was just splitting hairs until I came to my senses. And now I agree it’s not. You do data moves in order to prepare to make visualizations. With technology, data moves and making graphs might seem to happen at the same time, but conceptually, the data moves come before.
And with that we come to the point of this post: a metaphor about data moves that helps me, at least, think about what’s a data move and what is not. But first
We have 800 kids from NHANES in 2003, aged 5–19, with (among other attributes) height, sex, and age. Suppose we want to make a graph that shows how children grow, and how that differs for boys and girls. How would we do that?
A great way to think about it is to draw the graph you think you will get. In this case, I will show you the answer:
In CODAP, the procedure is to drag Sex
and Age
—the attribute names—to the left to “promote” them in a hierarchical table:
Then create a new attribute (mean_ht
) at that level, and give it the formula mean(Height)
. Then just drag Age
to the horizontal axis of a new graph, and mean_ht
to the vertical, and drop Sex
in the middle. Done.
So the data moves are: group by sex and age, then compute the aggregate measures, the means. Then, when you go to make the graph, there is a different set of moves—the choices you make about how you’re going to represent the different attributes you have available. You probably made the data moves in anticipation of the graphing moves you would make. They are tightly linked.
Of course, you don’t have to use CODAP to do this. If you were using RStudio with dplyr
and ggplot2
, you might write
summ <- nhanes %>% group_by(Age, Sex) %>% summarise(mean(Height, na.rm = TRUE)) names(summ)[3] = "mean_ht" qplot(summ$Age, summ$mean_ht ,color=summ$Sex)
The first line has the data moves: grouping (group_by
) and finding aggregate values (summarise
). In CODAP, this is equivalent to dragging attributes in the table to make hierarchies and then making a new attribute with a formula. The second line of R is housekeeping. The last line is all about the graphics, telling what goes on which axis and what determines the color.
The point being that there is an underlying Platonic ideal of a data move, with different implementations depending on the tool. There is also an underlying Platonic graphing thing (it’s called a Grammar of Graphics: Wilkerson 2005; yet Hadley Wickham et al.’s ggplot2 is practical and easier to understand) but I claim that it’s different from a data move, and I will not treat it here.
Okay, so a data move is about data, not representation. It’s about how the data are organized, and it’s also about calculating things like aggregate values. That is, it creates and organizes the numbers that you’re going to use to make a graphic.
That’s still kind of slippery, so try this one on:
Imagine that your dataset is a deck of little cards, one for each case. In the NHANES data, that’s 800 cards, one for each kid. Each card has Sex, Age, Height, and other stuff.
Remember how we drew the graph we wanted? Now imagine the steps you would go through to make that graph—using the cards. It’s actually pretty easy:
And you’re done! You now have all the numbers (and letters) you need to make your graph.
Notice: we have discovered that there are two fundamental types of data move: those that are about rearranging cards, and those that need a pen. There is one central “combo” move, grouping, where when you make a pile you need a new card (and a pen) to make a label.
Anything beyond that, such as plotting a point, that’s not a data move.
Let’s push on this operational metaphor. Suppose you now want to classify each kid—each card—by whether he or she is tall. We’ll define tall as “having a height greater than the mean for your age and sex.”
So we look though all of the cards, and for each one, we compare Height
with mean_ht
(from the label of the group). If Height > mean_ht
, we’ll write tall : true
on the kid’s card. Otherwise we’ll write tall : false
. By our metaphor, this is a “pen” data move. We’re calculating a new attribute value, but this time for every card as opposed to the aggregate value we calculated and wrote on the label.
Let’s do a tougher investigation. Are students who play basketball more likely to be “tall” than other students? We need more data.
tall : true
. Put that proportion on the label. (Aggregation. A “pen” data move.)The new move is in step 4. It’s a “cards” move we are calling merging for now, but you may recognize it as a join
.
Suppose we wanted to see how this difference in proportions depends on age. Easy: interpose a new grouping move after step 4. For the new move, split the two piles by Age. You will need new labels where you can write the age-specific proportions.
You could say that dplyr
or CODAP or some Python library have all of this functionality, and you would be correct. Why do we need to recognize data moves? Why not just think about selecting and dragging or using group_by
and summarise
?
I’d love to hear your responses in comments, but I’ll start with what I think:
That’s way too much to be writing, and this has taken way too long, but I think it helps me to get it on the page. If you’re still reading, well done, you glutton. Let me know what you think.
And if you can, think and write about this: I’ve implied that some students have trouble partly because, although we assume they understand the underlying meaning of data moves, in fact they don’t. It’s not that data moves are hard, but they may be alien, and we pile a lot of data moves together—implicitly, without acknowledging them—at the same time as we’re asking students to make sense of the actual data context and to cope with learning computer commands. I’d like to think that taking a breath and shining a light on data moves could help some students who might have been feeling stupid, doggedly trying to follow instructions, hoping that one day it will all make sense.
This conjecture is largely based on intuition. Do you think it’s true?
]]>Here’s the Eventbrite link. Get your free ticket!
Here’s the blurb:
]]>Data, Decisions, and Trees
We often say that we want to make decisions “based on data.” What does that really mean? We’ll look at a simple approach to data-based decisionmaking using a representation we might not use every day: the tree. In this webinar, you’ll use data to make trees, and then use the trees to diagnose diseases.
On the surface, trees are very simple. But for some reason — perhaps because we’re less familiar with using trees — people (and by that we mean us) have more trouble than we expect. Anticipate having a couple of “wait a second, let me think about this!” moments.
Your job was to get through the diseases ague and botulosis. Today I want to reflect on those two scenarios.
Ague is ridiculously simple, and with that ridiculous simplicity, the user is supposed to be able to learn the basics of the game, that is, how to “drive” the tools. One way to figure out the disease is to sort the table by health and see what matches health. Here is what the sorted table looks like:
Just scanning the various columns, you can see that health is associated with hair color. Pink means sick, blue means well. With that insight, you can go on to diagnose individual creatures and then make a simple tree, which looks like this:
Although there is a lot of information in the tree, users can generally figure it out. If they (or you) have trouble, they can get additional information by hovering over the boxes or the links.
Sorting works well in this case, but it gets unwieldy with more complicated diseases and larger data sets. Graphs are more versatile, and since we’re in CODAP, graphs are easy to make. Here is a graph of health (the dependent variable, after all) against hair:
This is not a typical school graph, but with very little practice, students (we hope) can see that it shows a clear separation between the healthy and sick creatures, and that hair color is characteristic of that separation. We could also say that hair color perfectly predicts health—at least with this initial data set of 10 cases.
Making a graph is, in general, a great way to start exploring data. In fact, I think that making a graph early is good evidence of some kind of data smarts. Putting the dependent variable—the thing you’re trying to predict—on the graph is another indicator, showing that you know what you’re looking for. Yet in general, we do not see students instinctively making graphs as an early step in their analysis; I hope that experience with activities like this will help prompt that strategy.
Let’s do the next-harder disease, botulosis. We make a graph, put health on the vertical axis, and then try various other variables (predictors such as hair, eyes, weight…) on the horizontal axis.
To make things easy, this disease is also related to hair; that first graph we made for ague looks like this for botulosis (on the left):
There is a clear separation here, but it’s not as complete. What can we tell from this graph? It looks as if all blue-haired creatures are well—but we can’t tell about pink-haired creatures. Where do we go from here? There are a number of paths to take. Here’s one:
Suppose we start making a tree, leaving the pink branch incomplete (illustration above right). We’re now done with the blue group and want to restrict our analysis to the pink-haired creatures. So we click on the pink box (lower left in the tree) to select those cases in the graph. Then, in the graph, we use the “eyeball” menu and choose Hide unselected cases. (We could also just select them in the graph.)
With only the pink cases remaining, we can try other variables on the horizontal axis. When we try eye color, we see this:
Aha! A complete separation—for the pink subset. Orange eyes means well, purple means sick. So we add another branching to the tree, only to the pink side (left):
This tree will correctly diagnose all future cases. You could also have done eyes first, and gotten a different, equivalent tree (right).
What other graph could we have made? One great one is to put eyes on one axis and hair on the other, and drop health into the middle, making a colored legend:
We have seen users reason with this graph in two ways.
We made these very very very simple situations because when we used the tool on more complicated situations, even though we designed the tool and have decades of experience with data and technical things, we got easily confused. It was amazing how we ground to a halt and had to say, “wait—what does this mean?”
And I imagine that all of you, when you played with botulosis (okay, except for gasstationwithoutpumps, who gets everything immediately) had a similar experience. There was at least a pause where you had to think, what does this graph mean? Or how do I express my rule in the tree?
Part of this is limitations in the tool. But I think a lot of it comes from the amazing importance of alternative representations, and how great it is to wrestle with unfamiliar ones.
Let’s elaborate on the theme of representations to think about a connection to computational thinking. As coders, we look at the Boolean statement above—sick creatures have purple eyes and pink hair—and think, “easy peasy, it’s an and statement.” But there is a pause when we see the graph and try to map that graph onto our understanding of and, even though the graph itself looks exactly like the truth tables we saw ages ago. And for me, there is an even longer pause when I try to map the and onto the tree. For me, conceptually, and is symmetrical, but the tree is not. (And then I realize, crikey, deep down inside, and is probably implemented asymmetrically.) I know that the tree is how we might represent the doctor’s procedure for diagnosing these diseases—but before playing with the trees, it didn’t look like and to me.
And if we are wresting with deeper understanding of a Boolean and, what is it like for a student who has not put in his or her 10,000 hours?
I think this is an example of a spot in data science education where non-coders can get tripped up. Therefore it’s a perfect place to slow down and point out how our understanding of the colloquial and maps onto the Boolean and, and how the “and” concept appears in other representations such as trees and graphs.
(As an aside, note that the two user renderings of the botulosis rule—the and and the or versions—are an example of deMorgan’s rule, the essential connection between those two operators.)
There are many things to do with the trees and the Arbor game beyond this. Some examples:
So go forth and play! See what you think!
]]>In the Data Science Games project, we have recently been exploring decision trees. It’s been great fun, and it’s time to post about it so you dear readers (all three or so of you) can play as well. There is even a working online not-quite-game you can play, and its URL will probably endure even as the software gets upgraded, so in a year it might even still work.
Here’s the genesis of all this: my German colleague Laura Martignon has been doing research on trees and learning, related to work by Gerd Gigerenzer at the Harding Center for Risk Literacy. A typical context is that of a doctor making a diagnosis. The doctor asks a series of questions; each question gets a binary, yes-no answer, which leads either to a diagnosis or a further question. The diagnosis could be either positive (the doc thinks you have the disease) or negative (the doc thinks you don’t).
The risk comes in because the doctor might be wrong. The diagnosis could be a false positive or a false negative. Furthermore, these two forms of failure are generally not equivalent.
Anyway, you can represent the sequence of questions as a decision tree, a kind of flowchart to follow as you diagnose a patient. And it’s a special kind of tree: all branchings are binary—there are always two choices—and all of the ends—the leaves, the “terminal nodes”—are one of two types: positive or negative.
The task is to design the tree. There are fancy ways (such as CART and Random Forest) to do this automatically using machine learning techniques. These techniques use a “training set”—a collection of cases where you know the correct diagnosis—to produce the tree according to some optimization criteria (such as how bad false positives and false negatives are relative to one another). So it’s a data science thing.
But in data science education, a question arises: what if you don’t really understand what a tree is? How can you learn?
That’s where our game comes in. It lets you build trees by hand, starting with simple situations. Your trees will not in general be optimal, but that’s not the point. You get to mess around with the tree and see how well it works on the training set, using whatever criteria you like to judge the tree. Then, in the game, you can let the tree diagnose a fresh set of cases and see how it does.
That’s enough for now. Your job is to play around with the tool. It will look like this to start:
The first few scenarios are designed so that it’s possible to make perfect diagnoses. No false positives, no false negatives. So it’s all about logic, and not about risk or statistics. But even that much is really interesting. As you mess around, think about the representation, and how amazingly hard it can be to think about what’s going on.
There are instructions on the left in the tan-colored “tile” labeled ArborWorkshop. Start with those. There is also a help panel in the tree tile on the right. It may not be up to date. All of the software is under development.
Here is the link:
http://codap.concord.org/releases/latest/static/dg/en/cert/index.html#shared=31771.
The first disease scenario, ague, is very simple. The next one, botulosis, is almost as simple, and worth reflecting on. That will happen soon, I hope after you have tried it.
(Part two. About Ague and Botulosis.)
]]>Note: if you are unfamiliar with this platform, CODAP, go to the link, then to the “hamburger” menu. Upper left. Choose New. Then Open Document or Browse Examples. Then Getting Started with CODAP. That should be enough for now.
Suppose you have a string 12 cm long. You form it into the shape of a rectangle. What shape gives you the maximum area?
Traditionally, how do we expect students to solve this in a calculus class? Here is one of several approaches, in excruciating detail:
Ignoring all of the other ways you could solve this problem more humanely, notice that of the 12 steps in this solution, only two have anything to do with calculus. In (7), we actually take a derivative, and in (8) we use the important calculus concept that the derivative is zero at an extreme.
We could say, “the rest is just algebra.” True. But looking more deeply, I think it’s a combination of two things:
I claim that many calculus students—mine, anyway—have much more trouble with the 10 non-calculus steps than the two calculus ones. They have trouble getting started. They’re not sure how many variables to use. They don’t know how to express the geometrical relationships. They can’t see what relationships they need in order to get what they want. And at the end of the problem, they have a value for x, and can’t close the deal: they appear to lose sight of the original goal—the shape of the rectangle—and can’t find it in the algebra. In this case, they need to find y and then notice that the result is a square.
Aside: In calculus texts, we often see problems with cones. Typical: “A tank in the shape of an inverted right circular cone has a base radius of 8 cm and a height of 20 cm. It is filled to a depth of h cm. It is draining fluid at 15 cubic cm per second…” Just envisioning the setup requires a lot of geometry. And then, depending on the question, you probably have to visualize cross-sections of the cone and set up proportions that arise from similar triangles. Crikey. The actual derivative is a cakewalk in comparison.
This leads to another realization: In my own work, I rarely need calculus (except to teach calculus), but I need the lead-up to calculus and the wind-down—the mathematicization, the re-interpretation of results, the de-mathematicization, the modeling—all the time. That is, the geometry stuff, and the strategic thinking, are more important, and more useful, than the calculus itself.
So if students founder (or flounder ) before they ever get to the calculus steps, and that’s the most useful math anyway, why don’t we spend more instructional effort on those parts of the problems?
I bet there are at least two reasons:
Both of these reasons are terrible, and dangerously elitist. They lead to an attitude that justifies high failure rates.
A better explanation, I think, is that calculus is a deep study of change, as expressed using mathematical functions. It takes years to really understand functions, and calculus provides a universe of situations in which to practice. As you do this practice, you accumulate understanding of calculus itself, a kind of next-level-up functional understanding and, not incidentally, one of the crowning achievements of the human intellect. Even if you never do calculus later, having done calculus gives you mastery of simpler understandings and techniques, and a broader perspective. (Kind of like, when you study music theory, you analyze Bach chorales. You may never write a chorale after Music 101, but the underlying principles seem to govern a wide swath of the musical landscape.)
Sounds good, huh? And yet the paternalism shines through. And it doesn’t explain why we maintain this nutty focus on getting through so much calculus when the students have such trouble in more foundational mathematics.
I think we should instead try to figure out two things (yes, another list of two things):
Finally, why, when this is a blog mostly about teaching stats, am I writing about calculus? Two reasons:
The parallel to the calculus-substitute Advanced Modeling? Maybe Modeling with Data, or, dare we suggest, Introduction to Data Science.
]]>If you’ve been following along, and reading my mind over the past six months while I have been mostly not posting, you know I’m thinking a lot about data science education (as opposed to data science). In particular, I wonder what sorts of things we could do at K–12 —especially at high school — to help students think like data scientists.
To that end, the good people at The Concord Consortium are hosting a webinar series. And I’m hosting the third of these sessions Tuesday July 25 at 9 AM Pacific time.
Click this link to to to EventBrite to tell us you’re coming.
The main thing I’d like to do is to present some of our ideas about “data moves”—things students can learn to do with data that tend not to be taught in a statistics class, or anywhere, but might be characteristic of the sorts of things that underpin data science ideas—and let you, the participants, actually do them. Then we can discuss what happened and see whether you think these really do “smell like” data science, or not.
You could also think of this as trying to decide whether using some of these data skills, such as filtering a data set, or reorganizing its hierarchy, might also be examples of computational thinking.
The webinar (my first ever, crikey) is free, of course, and we will use CODAP, the Common Online Data Analysis Platform, which is web-based and also free and brought to you by Concord and by you, the taxpayer. Thanks, NSF!
We’ll explore data from NHANES, a national health survey, and from BART, the Bay Area Rapid Transit District. And whatever else I shoehorn in as I plan over the next day.
]]>Instead, the data-science-y data moves are more about data manipulation. [By the way: I’m not talking about obtaining and cleaning the data right now, often called data wrangling, as important as it is. Let’s assume the data are clean and complete. There are still data moves to make.] And interestingly, these moves, these days, all require technology to be practical.
This is a sign that there is something to the Venn diagram definitions of data science. That is, it seems that the data moves we have collected all seem to require computational thinking in some form. You have to move across the arc into the Wankel-piston intersection in the middle.
I claim that we can help K–12, and especially 9–12, students learn about these moves and their underlying concepts. And we can do it without coding, if we have suitable tools. (For me, CODAP is, by design, a suitable tool.) And if we do so, two great things could happen: more students will have a better chance of doing well when they study data science with coding later on; and citizens who never study full-blown data science will better comprehend what data science can do for—or to—them.
At this point, Rob Gould pushed back to say that he wasn’t so sure that it was a good idea, or possible, to think of this without coding. It’s worth listening to Rob because he has done a lot of thinking and development about data science in high school, and about the role of computational thinking.
Aside: Rob tends towards R Studio as his weapon of choice, that is, have students use a real tool, but in a suitably scaffolded environment. There is much to be said for this, but I can imagine a lot of grappling with R’s syntax among many students I have taught; I think it would be better to focus on ideas such as “identifying and looking at a relevant subset of the data” in a more gesture-based environment (e.g., CODAP) and connecting that to the data, the data stories, claims, and questions—and then, later, doing it in code. To be sure, with coding you can do anything. Using CODAP will limit what you can do. But it might expand what you will do, and what you will understand.
We discussed this some more, and he made a great suggestion: see how our nascent “data moves” correspond to what the big dogs in the R world think. Hadley Wickham’s dplyr
R package implements a “grammar of data manipulation” with a goal of identifying “the most important data manipulation verbs.” Verbs! These are moves! So: what are dplyr
‘s most important verbs?
That’s easy: group_by
, summarise
, mutate
, filter
, select
and arrange
So let’s start with these.
Use this function to convert a table into a “grouped” table. You tell R the variable you want to group by, and R separates the table according to the values of that variable. For example, you can group by sex
, and subsequent calculations will treat females and males separately.
In the CODAP world,
sex
attribute into a graph, the points get colored according to the sex of the cases they represent.sex
attribute to an axis of a graph, the graph separates into two—one for females, one for males—and you see parallel univariate graphs such as dot plots.sex
leftwards in the table, CODAP restructures the table hierarchically into sub-tables for females and males, making sex
a “parent” attribute. New attributes at that parent level (with values defined by aggregate functions such as mean(height)
) have values for each group.So we see three different takes on grouping. The third is the most general, but arguably the hardest conceptually. Students easily understand coloring points by sex
(or by whatever) and don’t get stuck in any syntax trouble.
“Summarise multiple values to a single value.”
That is, do an aggregate calculation. If we had a table of people with sex
and height
, a pseudo-r command such as
summarise(group_by(sex), mean(height))
would produce a table showing the mean heights of the two sexes.
As suggested above, in the CODAP world, you would create a new attribute at the “parent” level of a table, and give it that mean(height)
formula. On the screen, it would look like this:
Use this function to add a new variable to an R table, typically with values computed by a formula. This is something like summarise
, but for the case-level data. For example, in the CODAP table above, where height is in centimeters, we could compute it in inches with pseudo-R like this:
mutate(NHANES, Height_in = Height/2.54)
and we would get a new column in the table.
Curiously, CODAP makes no distinction between making an aggregate computation as we did above to compute the mean height and a case-by-case calculation where R uses mutate
. In CODAP, whether you want to mutate
or summarise
, just add a column and write the formula. You must, however, add the column at the right “level” of the hierarchical table to get the result you need.
“Return rows with matching conditions.”
That is, focus on a subset of the data in R. If we just wanted the mean height of females, we might write:
summarise( filter(NHANES, Sex == "Female"), mean(Height))
The key thing is that the student writes a Boolean expression (Sex == "Female"
) that represents the subset. And get all the parentheses right. Now, I’m not against learning syntax or how to write a Boolean expression. And in fact, I’m pushing for Boolean-expression filtering in CODAP right now. But it would be great to experience how useful filtering is in order to see why you would ever want to learn that.
To that end, CODAP lets you filter using selection. That is, you select the cases (in R-speak, “rows”) in any representation of the data—table or graph—and then ask CODAP to display only the selected cases, or only the unselected ones. Here are data showing soil temperature and air temperature at a site in a California forest. It’s one year of data, taken every 30 minutes:
We can see that, in general, the higher the air temperature (vertical axis; maybe it would be better the other way…) the higher the soil temperature. That makes sense, but it’s a lot of data and the details are all mushed together. Now suppose we go to our table and select one day in July:
Now we do a “Hide unselected cases”—that is, filter the data to show only that one day—and rescale the graph:
Wow. What a difference. Looking at this graph, we see a much more detailed story from the data.
At the moment, in CODAP, this only works for graphs. You can’t, for example, filter a table to show only that one day; you have to actually delete the cases you don’t want to see. Maybe this will change.
Suppose you have a dataset in R—a table—with many variables, many columns. This function lets you make a copy of the table with only the columns you want. So this is not about selecting cases (rows) as we did in CODAP above.
CODAP does not have an equivalent of this function, but we have been talking about ways to help users de-clutter their tables, somehow letting them hide attributes (columns) without deleting them.
In R, “use desc to sort a variable in descending order.” CODAP (finally!) has a sort capability as a menu item in the table.
So it seems that, on the surface at least, CODAP can mimic a lot of the basic functionality that you will find in dplyr
. There is a lot more to the package than we have discussed here, but if dplyr
is a grammar of data manipulation, something like CODAP can at least order a meal and find a laundromat.
I’m not trying to say that you (or Rob Gould) should use CODAP instead of R. Sure, you can do cool things by dragging in CODAP that you have to do by remembering syntax in R. But R is a programming language, and you (for now, anyway) need all that syntax in order to express the things that you want to do in a save-able, repeatable, modular sequence of operations.
But wouldn’t it be great if before they had to learn the syntax, students understood why filtering is a good idea? Or what it means to compute some aggregate value?
And frankly, dear readers, how many times have we all tried to learn some new programming or data-analysis system? Given my tiny mind, at least, I have a devil of a time making that first scatter plot or drawing that first blue rectangle on the screen. Sure, once you know R Studio, it’s easy to do in R Studio. But how many people give up in those first few hours, or don’t give up but never quite know what’s going on? Maybe CODAP, or something like it, is a path to getting more people to stick around long enough to see how cool this all is.
]]>Nobody knows what data science is, but it permeates our lives, and it’s increasingly clear that understanding data science, and its powers and limitations, is key to good citizenship. It’s how the 21st century finds its way. Also, there are lots of jobs—good jobs—where “data scientist” is the title.
So there ought to be data science education. But what should we teach, and how should we teach it?
Let me address the second question first. There are at least three approaches to take:
I think all three are important, but let’s focus on the third choice. It has a problem: students in school aren’t ready to do “real” data science. At least not in 2017. So I will make this claim:
We can design lessons and activities in which regular high-school students can do what amounts to proto-data-science. The situations and data might be simplified, and they might not require coding expertise, but students can actually do what they will later see as parts of sophisticated data science investigation.
That’s still pretty vague. What does this “data science lite” consist of? What “parts” can students do? To clarify this, let me admit that I have made any number of activities involving data and technology that, however good they may be—and I don’t know a better way to say this—do not smell like data science.
You know what I mean. Some things reek of data science. Google searches. Recommendation engines. The way a map app routes your car. Or dynamic visualizations like these:
World trade |
Crime in Oakland |
Popularity of names beginning with “Max” |
What distinguishes this obvious data science from a good school data activity? Consider the graph (at right) of data from physics. Students rolled a cue ball down a ramp from various distances and measured the speed of the ball at the foot of the ramp.
The student even has a model and the corresponding residual plot. Great data. But not data science. Why not?
To answer that question, let’s look at one way some people have defined data science. The illustrations below, from Conway 2010 and Finzer 2013, show the thrust behind several definitions, namely, that data science lies in a region where content understanding, math and stats, and computer chops (Conway calls it “hacking”) overlap.
If we apply the left Venn diagram to the physics example, we can see that it belongs in “traditional research”: the work uses and illuminates physics principles and concepts from mathematics and statistics (the residuals, the curve-fitting) but does not require substantial computational skills.
But saying that is somehow not enough. The physics data example fails what we might call a “sniff test” for data science, but what, more specifically, does this sniff test entail? What are the ingredients that separate “regular” data from data science?
In our work over the last year and a half, we have begun to identify some of the telltale ingredients that make an activity smell like data science. And they come in (at least) three categories:
Although this formulation is certainly incomplete, it may be useful. Let’s look at each of these in turn.
Our obvious data science examples share a sense of being awash in data. The word brings up the image of a possibly tempestuous sea of data, or even waves of data breaking over us. This is subjective but still an evocative litmus test. More specifically, being awash might include:
It’s useful to ask what data science students should be doing with data. Again, an incomplete list:
What aspects of a dataset are characteristic of situations where we use data science? This is related to “awash” but more specific:
If we look again at the physics graph, we can see where it falls short. What did the students have to do? Make a graph of the two variables against one another. Plot a model. There was no search for a pattern, no sense of being overwhelmed by data, no subsets, no reorganization. The data are decidedly ruly.
That is not to say that working with this dataset is a bad idea! It’s great for understanding physics. But it’s not data science. And it points out the essential meaning of our sniff test: in this lab, students work with data, but they don’t have to do anything with the data to make sense of it. It’s science, but not data science.
Let’s alter that rolling-down-a-ramp activity. The next figure shows a graph from the same sort of data, but with an important change.
This spray of points has a very different feel (or smell) than the original one. We can see that, in general, the farther you have rolled, the faster you go, but the relationship is not clean. What could be making the difference? A look at the dataset shows that we now have more attributes, including wheel radius, the mass of the wheels, the mass of the cart, and, importantly, the angle of the ramp.
The ramps are at different angles? Maybe that makes a difference. Suppose we color the points by the angle of the ramp:
Now we see that the ramps range from about 5 to 45 degrees, and that, as we might expect, the steeper the ramp, the faster you go. We might be able to make predictions from this, but it would be nice to construct a graph like the original one. So we make a second graph, of just angle, and select the cases where the ramp angle is close to 20°.
Notice how the selected points show the square-rooty pattern of the original cue-ball graph. We can make a model for that specific ramp angle and plot it:
Let us re-apply our sniff test.
Awash in data? Look back at the first plot in this sequence with the eye of a first-year physics student. Confusing? Hard to tell where to start? You bet.
What are the data moves? We added a third dimension—color—to a plot to make a visualization we might not ever have seen before, and had to make sense of it. Then we found a way to look at a subset of the data, finding a clearer pattern there than in the dataset as a whole.
And data properties? We have more than two attributes, more than a couple dozen points. And depending on your experience, the data seem a bit unruly.
This may not reek of data science, but it has more than just a whiff.
But wait: at the end, we got a graph with a function, a good model, just like we had originally. That’s a lot of extra work just to get the same graph.
Reasonable teachers ask, is it worth the class time to learn all that computer stuff in addition to the physics? In fact, the way we set up a typical lab is designed precisely to avoid the problems of the messy data. Why have students struggle with a situation where the variables are not controlled? Why not just teach them to control variables as they should—and get better-organized data, intentionally?
Let me give a few responses:
It’s tempting to treat computer-data skills as an optional extra. Doing so creates an equity problem, because some people—people who look like me, mostly white and Asian boys—play with this stuff in our free time. Don’t let the playing field stay tilted.
That’s enough for now, but let us foreshadow where to go next:
If something else bothers you about this example, you’re not alone. I will describe what bothered me, and how that bothered Andee Rubin in a different way, soon.
Conway, Drew. 2010. Blog post at http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.
Finzer, William. 2013. “The Data Science Education Dilemma.” Technology Innovations in Statistics Education 7, no. 2. http://escholarship.org/uc/item/7gv0q9dc.
Jones, Seth. 2017. Discussion paper for DSET 2017. https://docs.google.com/document/d/1DbgPq8mBwHOXmagHKQ7kNLP49PujyPq-tKip5_HeD8E
]]>And things with a probability of 1 in 4 (or, in this case, 2 in 7:) happen all the time.
This post is not about what the pollsters could have done better, but rather, how should we communicate uncertainty to the public? We humans seem to want certainty that isn’t there, so stats gives us ways of telling the consumer how much certainty there is.
In a traditional stats class, we learn about confidence intervals: a poll does not tell us the true population proportion, but we can calculate a range of plausible values for that unknown parameter. We attach that range to poll results as a margin of error: Hillary is leading 51–49, but there’s a 4% margin of error.
(Pundits say it’s a “statistical dead heat,” but that is somehow unsatisfying. As a member of the public, I still think, “but she is still ahead, right?”)
Bayesians might say that the 28.6% figure (a posterior probability, based on the evidence in the polls) represents what people really want to know, closer to human understanding than a confidence interval or P-value.
My “d’oh!” epiphany of a couple days ago was that the Bayesian percentage and the idea of a margin of error are both ways of expressing uncertainty in the prediction. They mean somewhat different things, but they serve that same purpose.
Yet which is better? Which way of expressing uncertainty is more likely to give a member of the public (or me) the wrong idea, and lead me to be more surprised than I should be? My gut feeling is that the probability formulation is less misleading, but that it is not enough: we still need to learn to interpret results of uncertain events and get a better intuition for what that probability means.
Okay, Ph.D. students. That’s a good nugget for a dissertation.
Meanwhile, consider: we read predictions for rain, which always come in the form of probabilities. Suppose they say there’s a 50% (or whatever) chance of rain this afternoon. Two questions:
(Also, get a micrometer on eBay and a sweet 0.1 gram food scale. They’re about $15 now.)
Long ago, I wrote about coins and said I would write about hexnuts. I wrote a book chapter, but never did the post. So here we go. What prompted me was thinking different kinds of models.
I have been focusing on using functions to model data plotted on a Cartesian plane, so let’s start there. Suppose you go to the hardware store and buy hexnuts in different sizes. Now you weigh them. How will the size of the nut be related to the weight?
A super-advanced, from-the-hip answer we’d like high-schoolers to give is, “probably more or less cubic, but we should check.” The more-or-less cubic part (which less-experienced high-schoolers will not offer) comes from several assumptions we make, which it would be great to force advanced students to acknowledge, namely, the hexnuts are geometrically similar, and they’re made from the same material, so they’ll have the same density.
Most students, however, won’t have that instant insight. So we begin by asking them to predict. They will draw graphs that increase, which is a good start, and many student graphs will curve upwards. Great! Then when they measure, students see something like the graph. (Here is a Desmos document with the data.)
They will look at this graph, and many will say it looks like a parabola. Which it does. Kinda. But if you put a quadratic of the form on the graph, you will never get it to fit well, no matter what value of k you choose.
We’re doing this quickly, so let’s skip ahead to say that when you try , things go much better (though they aren’t perfect). Then you can explore why you think it “goes like” x cubed. You might get there by discussing the quarter-inch nut and the half-inch nut, and pass out samples, and ask, why isn’t the half-inch twice the weight of the quarter-inch. And so forth.
Now the discussion can go in two interesting and different directions. One goes to the wonderful snook data where you can do a similar exploration about fish, and (especially if you do a log transform) you discover that the best exponent for these fish is about 3.24 rather than 3.00. And you gotta wonder how that could be.
But we’re not going there. Instead we start with a great question that astute readers must have asked, but we have avoided until now: what do we mean by the size of a hexnut? The answer is, the diameter of the bolt that it fits. Right? A quarter-inch nut is the one that fits a quarter-inch bolt. It’s bigger (duh) than a quarter inch across.
So let’s consider modeling the geometry of the situation. A hexnut is a hexagonal prism, right? And it has a circular hole.
We can measure more than the weight. Suppose we measure the distance across the faces (f) and the thickness (t) as well. Then the area of the relevant hexagon is , and the resulting volume of the solid—taking the hole into account—is
.
So we can model the mass with , where is the density of the material. The next graph, from Fathom, shows this calculated model mass against the measured mass, along with and a residual plot. We have already slid the rho slider to a decent value.
I have positioned the slider so the first few points are flat. When I do that, they are not centered at zero. Also, the larger nuts deviate from the line more and more. This is an indication that there are systematic effects that our model doesn’t account for. I leave it to you to think about what those might be!
The value of rho? For this graph, 8.22 grams per cc, which is not bad (just a little high) for the density of steel.
And the modeling point? If you agree with my earlier assertion that simplification and abstraction are the hallmarks of modeling, then this is modeling even though we didn’t make a fancy function. We assumed that the nut is truly shaped like a hexagonal prism with a cylindrical hole cut in it. That’s patently false, but it’s a completely decent approximation, and much easier to deal with than the ugly, messy truth. What’s more, we can use the ways reality deviates from the simple, abstract model to decide whether our model values might be a little high or low.
Here are the data from that graph. Lengths are inches, mass is in grams:
boltSize flat thick mass 0.25 0.4298 0.2148 3.09 0.3125 0.4951 0.2651 4.72 0.375 0.5555 0.3236 6.73 0.4375 0.6807 0.3802 12.77 0.5 0.7407 0.4253 15.85 0.625 0.9223 0.5450 31.06 0.75 1.0920 0.6337 48.8
There is a lot more to say, but hey, it’s almost four hours so it’s way time to stop.
]]>