Outliers in the NYT: Reflections on normality

Image from the NYT post

I need a good system to deal with those moments when you’re reading the news or listening to NPR and they bring up something that could fit into an actual lesson, connecting math to everyday life. This probably happens more when thinking about teaching stats than with other areas of math. Of course I have thought of clipping the article, and I have several folders on my computer, but I can never find them. Here is another attempt: blog about them! And we get a new category, Data in the News.

Onward! Yesterday’s NYT prints what appears online as a blog post by Carl Richards. It makes the point that we often assume erroneously that everything is normally distributed (yay!) and that this affects our expectations about, for example, investing. The outliers, he says, are much more salient than we think they would be. And then we get this delicious passage:

If you take the daily returns of the Dow from 1900 to 2008 and you subtract the 10 best days, you end up with about 60 percent less money than if you had stayed invested the entire time. I know that story has been told by the buy-and-hold crowd for years, but what you don’t hear very often is what happens if you were to miss the worst 10 days. Keep in mind that we are talking about 10 days out of 29,694. If you remove the worst 10 days from history, you would have ended up with three times more money.

This is interesting in itself, but in terms of my desire for kids to get data goggles and to look at claims and cry, “evidence please!” this is perfect. Because we can do just that: go online, get the data he’s talking about, load it into Fathom, and see if this claim is correct.

It turns out that the data—and a lot of interesting financial time-series data—is waiting to be plucked from MeasuringWorth.org. It takes a little munging to get right, but here, for example, is a graph of the log of the DJIA for that period.

Dow Jones 1900-2008. In the Log.
Dow Jones (DJIA) from 1900 through 2008. In the log.

The choice of the log is in itself interesting. Why did Mr Erickson do that, precalc students? Because if you look at the plain old average over time, you see the following, on which the Great Depression isn’t even visible:

Dow Jones 1900-2008
DJIA 1900–2008. No logs, no Depression.

Anyhow, we’re interested in the best and worst days. For that we need change. And to spare you the drama, a big question comes up: do we mean absolute change or percent change? Are we looking at difference or ratio? The New York Times didn’t tell us. But if we have the data and can analyze them ourselves, we can try it both ways.

I won’t spoil your fun. You can test Richards’s claims. But I will save you the munging trouble; I’ve posted a text file with the raw data on my site.

I will, however, comment on the normality of the data. Note: this is really an AP or college stats topic, but since we have the technology to address it directly, why the heck not? Suppose we want ratios; then we calculate the difference in the log from one day to the next. Here is that distribution:

Histogram of difference in log, DJIA 1900–2008
The daily differences in the log of the DJIA, 1900–2008

See how different it looks from the “napkin” drawing, above? And see the serious low outlier at about –0.11? October 19, 1987. But how do we know it’s not normal? With experience, it’s pretty clear. But if you’re a newcomer to these things, what can convince you? I’m not sure, but let me show you two things. First, the same data but with a normal curve having the same mean and SD as the data:

Diff in log, with normal distribution
Histogram of difference in log of DJIA, 1900–2008, with Normal curve

I have zoomed into the central hump. You can see that, even though the central hump looks more or less normal, the curve clearly has a bigger spread (and therefore a lower central density) than the data.

Then, a normal quantile plot:

diffs in log DJIA, normal quantile plot
Normal Quantile Plot for daily difference in log of DJIA, 1900–2008

In case this kind of plot is unfamiliar, if the data were normally distributed, it would be roughly a straight line. We seldom see a dataset that we might initially think is normal—it has a hump in the middle and trails off kinda symmetrically on both sides—that is so clearly not normal on this plot.

I’m also intrigued by other pedagogical opportunities here (provided students care a whit about the DJIA), for example: that low point back in 1987: what does that value of –0.111 actually mean?

Author: Tim Erickson

Math-science ed freelancer and sometime math and science teacher. Currently working on various projects.

One thought on “Outliers in the NYT: Reflections on normality”

  1. Hi Tim. I will be following your blog – this is my first year teaching a non-AP stats class. Thanks for the mining the raw data for us.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: