The concentration of CO2 in the atmosphere is rising, and we have good data on that from, among other sources, atmospheric measurements that have been taken near the summit of Mauna Loa, in Hawaii, for decades.
Here is a link to monthly data through September 2018, as a CODAP document. There’s a clear upward trend.
Each of the 726 dots in the graph represents the average value for one month of data.
What do we have to do—what data moves can we make—to make better sense of the data? One thing that any beginning stats person might do is to fit a line to the data. I won’t do that here, but you can imagine what happens: the data curve upward, so the line is a poor model, but the positive slope of the line (about 1.5, which is in ppm per year) is a useful average rate of increase over the interval we’re looking at. You could consider fitting a curve, or a sequence of line segments, but we won’t do that either.
Instead, let’s point out that the swath of points is wide. There are lots of overlapping points. We should zoom in and see if there is a pattern.
I’m going to pick 1990 and 1991. Here is what they look like:
Aha! There is an oscillating pattern—and it persists through all of the data. The concentration is highest in May and lowest in roughly September—but the whole cycle trends upwards from year to year.
From a data moves perspective, we have just used filtering: we’ve decided to look at a subset of the data in order to get a better idea of what’s going on. If we had a huge screen, we could have blown up the graph and seen this, but with the limited space of a blog post, this was effective. It’s also characteristic of data science as opposed to, say, school introductory statistics, where you tend to get all the data you need (and no more) to answer the question you’re facing, and where, if you make a graph of your data, you can see pretty much everything that can be seen.
Now suppose we want to characterize that oscillation. We do not need to guess the period (and use techniques that appear here); we know it’s one year. But what’s the overall shape? If we plot the values against month, we see this:
Hmm. The overall range of the data is so great that it obscures the signal we’re looking for. What we really want is how much the CO2 concentration differs from the underlying trend. Alas, that’s harder than I want to show right now, but here’s a simpler alternative: let’s compute how the CO2 concentration is different from the mean of that year. This turns out to be subtle enough, and requires some interesting data manipulation.
Our plan is this:
- For each year, compute the mean. We’ll call it meanCO2.
- For each month, compute the difference between the CO2 concentration and the mean value for that year. Call this residual.
- Plot those results: residual against month.
In CODAP, we do this by making our data table hierarchical. Essentially, we split the table into many (61) mini-tables, one for each year. Then, for each table, we compute the mean CO2. The CODAP table looks like this when we’re done:
This is a sophisticated data move, involving reorganizing the data (grouping it) and then summarizing each of the groups. By putting the meanCO2 variable in the left side of the table, we also ensure that we know which value goes with which year.
Now we compute the difference from that mean to the data. That will be a new column, but this time it has to go on the right side of the table, because there is a value for every month instead of every year, as was the case for the annual mean. (Data move: calculating.) Here is the resulting table:
Again, 1989 is selected. Notice how the residual values reach a maximum in May (month = 5).
Now we can simply plot residual against month:
Now we can see more easily that a typical annual oscillation is about plus or minus 3 ppm, much larger than the 1.5 ppm per year trend.
But think about what we have done! We have chopped up the data into 61 years, overlaid those 61 plots, then adjusted them up and down automatically according to the average value on the individual plot. For a student, there are two principal challenges:
- Realizing that this is what they want to do, and
- Figuring out how to accomplish it given their particular tools.
(We should point out that there is another challenge that might be an important element of computational literacy: when you realize you can’t do what you want, figuring out some stand-in that you can do with the tools at hand. Or, another one: when you succeed in getting what you thought you wanted but, on reflection, recognizing that it’s not actually useful, and using that to make something better.)
In any case, I think the cognitively hard part of this is recognizing and effectively using the hierarchy (whether you use CODAP or not): keeping the “month” variables and the “year” variables clear in your head, and knowing how to maintain that distinction in the technology.
Another aside: This scheme we used, which essentially brings us into the realm of “seasonally adjusted values,” was impossible in Fathom. There was no way to calculate the mean of an attribute for a particular value of a different attribute; in this case, the mean of avg for each year. It’s surprisingly subtle and can be confusing, I think because of the possible confusion of variables with their values.