Welcome to the third in a soon-to-end series in which I figure out what I think about time series data, and how it is different from the data we usually encounter in school assignments. We’re exploring what tools and techniques we might use with time series that we don’t traditionally cover in stats or science, and wondering whether maybe we should. For me and my CODAP buddies, it also raises the question of whether a tool like CODAP should have more built-in support for time series.

## Smoothing

One of these tools is *smoothing*. There is smoothing (for example Loess smoothing) for other, non-function-y data, but smoothing is easier to understand with time series (or any other data that’s fundamentally a function; the previous post in this series explains this obsession with functions).

Since it’s December 2021, let’s stick with COVID and look at the number of new cases in the US by day (CODAP file here):

What a funny-looking graph! The question marks point to a region in about September 2021 where there seem to be several strands of case numbers. What’s going on? Have we put five different countries in here together by mistake?

No. Let’s zoom in and connect the dots:

Looks kinda periodic, right? With a period of…seven. (The phenomenon of seeing the data as separate sequences is called *aliasing*.)

Oh! It’s *weeks*. And if you dig deeper, you see that the two “low” days in each period are Saturday and Sunday—except for that first weekend in September, which seems to have three days….ohhhh! Labor Day. Looking at the details highlights the human nature of data. The numbers are *reported* cases, and some people go home over the weekend and file a pack of new reports on Monday.

But we’re interested in the bigger picture, the overall pattern of COVID cases over the course of months rather than the details within an individual week. Students can come up with ideas about how to display that, and might come up with two strategies, two different *data moves*:

- Do everything by the week instead of by the day. If you want the average daily total (instead of weekly totals), just add up all seven numbers from a week and divide by seven. This will produce one data value per week.
- Do everything by day, but for each day, average that day plus the six that follow it. This produces one value per day. This is more sophisticated, though not necessarily any better.

For a small dataset, students can do these by hand and get a feel for how the smoothing works. But how to do it in technology requires thinking about the tools.

#### Weekly Averages

In CODAP, for the weekly averages, they might create a new attribute for `week_number`

(perhaps using `floor(caseIndex/7)`

). Then group by that attribute (dragging it to the left) and create a new weekly average for each group. If you plot that against `date`

, you get seven identical values per week. If you connect the dots and make them small, you see only the lines. Some options:

This is not as easy as I made it sound. At the end of 2021, CODAP’s powerful grouping capabilities (see *Awash in Data* for many more details) can handle this, but not (ha ha) smoothly. It’s designed for other situations, not for this kind of work with time series data. (If you try it, and I hope you do, you’ll need to eliminate formulas while keeping values in a number of steps. Here is a link to a CODAP file with the original data.)

#### Smoothing: the rolling mean

If you want the sophisticated “rolling mean” solution in CODAP, you’re in luck! There’s a function called `rollingMean`

:

The difference between CODAP’s rolling mean and what we described before is that it takes the current value and instead of adding the next six (and dividing by seven), it adds the previous three and the following three.

#### Reflecting on the two methods

I think the first is conceptually easier for kids to grok than the second: just plot it by week.

The curse of the second, rolling-mean strategy is that you get this nice smooth result using one relatively-obscure function *whether you understand the function or not*. But if students can handle it, the second has some interesting properties, and is “open at the top”: it leads elsewhere. In particular, I think understanding it expands your understanding of the mean.

Here’s what I mean by that: Suppose you wanted to explain what was going on. You might use a metaphor like this:

Imagine a line of people, all equally spaced. You want to weigh them and plot the weight against where they are in the line. One way is, you put a scale on a little cart and run it along the line of people. When the cart gets to a person, they step on, and the weight gets recorded. Now imagine that everybody is close to the same weight, except for one really heavy person called Blanziflor. When you get to Blanziflor, you’ll see a spike in the graph.

Now imagine a bigger cart, one that’s as wide as seven people, and the scale will weigh them all together. As the cart rolls along, when one person gets on the front, another steps off the back. So the total weight gets greater by the weight of the person getting on, and less by the weight of the one leaving.

Now when you get to Blanziflor, the total goes up by a lot—the same amount as before—and stays high for as long as Blanziflor is on the scale: seven spots. But the scale records the *total* weight of seven people, so when you record the *average* weight, you have to divide the total by 7. That is, the rise of the average when Blanziflor is on the scale is only 1/7 of what it was when we weighed people one at a time. However, that rise goes on for seven times as long. It all balances out.

The result is that fluctuations in the values we get for the average—the rolling mean—are *smaller* than if we just plotted each data point. Even so, the values still take all the underlying data into account.

(Picture the animation that shows this. I don’t have time to make it right now, but imagination is powerful!)

Stats teachers will also recognize a connection to sample size and sampling distributions: the spread of the “sample mean” decreases as *N* increases, even though these are far from being random, independent samples.

To go further, notice that our cart, our moving window on the data, is rectangular: you’re either in the interval or you’re not, and everybody in the interval counts the same. It doesn’t have to be that way. You could count people near the edges less than the ones in the middle, doing a weighted average of their values. Then when Blanziflor stepped on, it wouldn’t be such a big jump—it would ramp up more gradually, and then gradually fall off. This leads to the idea of *convolution*, especially when you make this weighting function continuous, and instead of thinking carefully about how much to add and what to divide by, you just wheel out the big guns and integrate that sucker.

#### How Many Points Should You average over?

We’ve been talking about data that have a weekly fluctuation, so we naturally decided to average over seven points. Think about the alternative: if you averaged over five (or ten) you would still see the fluctuation.

If we look at monthly CO2 data from Hawaii, we would see an annual fluctuation, so we should average over 12. (This only works, by the way, if the rolling window is “rectangular” in the sense we talked about above. Which is great, because that’s simplest.)

But taking care of periodicity is not the only reason to smooth! If your time series fluctuates a lot from point to point, any amount of smoothing will make the graph simpler, sacrificing the details of the change for a broader view the data. As an example, the next graph shows daily high temperatures at San Francisco International Airport (SFO) for the first six months of 2020 and 2021. They vary a lot from day to day.

Now the same data, smoothed over 14 days:

## Folding

With periodic data, *folding* or *wave slicing* is a powerful technique for understanding that periodicity, for example, by figuring out the period. It turns out I’ve already posted about this technique here, and developed that into a paper you can read.

The basic idea is to chop up the data into chunks one period long and superimpose them. The trick is that this “period” might be wrong, so you put it on a slider and adjust it dynamically. When things line up, you have it right!

A cool part of the underlying math is that you do the slicing using the modulo function, which hardly ever appears in math class, but shows up in CS all the time. Here is a salient illustration from the article…

I won’t spend more time on that here, except to say that this is another example of something that is not in introductory data analysis but that is useful when you analyze (periodic-ish) time series.

Next time: looking at time series when `time`

is *not* on the horizontal axis. We’ll use phase-space diagrams to explore the effects of delay.

## One thought on “Time Series! Smoothing and COVID (and folding, too)”