The Data and Story Library, originally hosted at Carnegie-Mellon, was a great resource for data for many years. But it was unsupported, and was getting a bit long in the tooth. The good people at Data Desk have refurbished it and made it available again.
The site includes scores of data sets organized by content topic (e.g., sports, the environment) and by statistical technique (e.g., linear regression, ANOVA). It also includes famous data sets such as Hubble’s data on the radial velocity of distant galaxies.
One small hitch for Fathom users:
In the old days of DASL, you would simply drag the URL mini-icon from the browser’s address field into the Fathom document and amaze your friends with how Fathom parsed the page and converted the table of data on the web page into a table in Fathom. Ah, progress! The snazzy new and more sophisticated format for DASL puts the data inside a scrollable field — and as a result, the drag gesture no longer works in DASL.
Fear not, though: @gasstationwithoutpumps (comment below) realized you could drag the download button directly into Fathom. Here is a picture of a button on a typical DASL “datafile” page. Just drag it over your Fathom document and drop:
In addition, here are two workarounds:
Place your cursor in that scrollable box. Select All. Copy.
Switch to Fathom. Create a new, empty collection by dragging the collection icon off the shelf.
With that empty collection selected, Paste. Done!
Use their Download button to download the .txt file.
In the last three posts we’ve discussed clumpiness. Last time we studied people walking down a concourse at the big Houston airport, IAH, and found that they were clumped. We used the gaps in time between these people as our variable. Now, as we did two posts ago with stars, we’ll look at the same data, but by putting them in bins. To remind you, the raw data:
In the last two posts, we talked about clumpiness in two-dimensional “star fields.”
In the first, we discussed the problem in general and used a measure of clumpiness created by taking the mean of the distances from the stars to their nearest neighbors. The smaller this number, the clumpier the field.
In the second, we divided the field up into bins (“cells”) and found the variance of the counts in the bins. The larger this number, the clumpier the field.
Both of these schemes worked, but the second seemed to work a little better, at least the way we had it set up.
We also saw that this was pretty complicated, and we didn’t even touch the details of how to compute these numbers. So this time we’ll look at a version of the same problem that’s easier to wrap our heads around, by reducing its dimension from 2 to 1. This is often a good strategy for making things more understandable.
Where do we see one-dimensional clumpiness? Here’s an example:
One day, a few years ago, I had some time to kill at George Bush Intercontinental, IAH, the big Houston airport. If you’ve been to big airports, you know that the geometry of how to fit airplanes next to buildings often creates vast, sprawling concourses. In one part of IAH (I think in Terminal C) there’s a long, wide corridor connecting the rest of the airport to a hub with a slew of gates. But this corridor, many yards long, had no gates, no restaurants, no shoe-shine stands, no rest rooms. It was just a corridor. But it did have seats along the side, so I sat down to rest and people-watch.
We’re careening towards to the end of the semester in calculus, and I know I’m mostly posting about stats, but this just happened in calc and it applies everywhere.
We’ve been doing related rate problems, and had one of those classic calculus-book problems that involves a cone. Sand is being added to a pile, and we’re given that the radius of the pile is increasing at 3 inches per minute. The current radius is 3 feet; the height is 4/3 the radius; at what rate is sand being added to the pile?
Never mind that no pile of sand is shaped like that—on Earth, anyway. I gave them a sheet of questions about the pile to introduce the angle of repose, etc. I think it’s interesting and useful to be explicitly critical of problems and use that to provoke additional calculation and figuring stuff out. But I digress.
Okay: one class down, 27 to go. The big problem right now is scheduling “lab” time, and extra hour a week that will make up the rest of the time we need to get through the material and learn the stuff that’s not in the ISCAM text, such as EDA and more probability.
I do not yet have sense of how fast we can get through some of the investigations; I have hopes that once we get the hang of it, some can be slower and more thoughtful, while others can be more practice- and application-y.
I did start with good old Aunt Belinda, for comfort sake. It’s odd; I may go more slowly—too slowly—when I’m more familiar with the approach.
It’s Sunday. On Thursday, Math 102—Statistics and Probability—has its first meeting at Mills College, and I am allegedly in charge. This is a one-semester course, and at the college level, calculus required, in contrast to the year-long, high-school, non-AP classes I taught a few years ago.
So we will have to move pretty fast, but the students have more experience, which I hope will mostly be a good thing.
I’ve just come back from a few days at Cal Poly, watching Beth Chance and Allan Rossman actually teaching their courses, to see what the masters look like in action. It was inspiring and daunting. One thing Beth said that made me grimace was how important it was to take a few minutes to reflect on what worked. So here I am, gonna try again. I have hopes but make no promises, as this semester will be packed: I’m also teaching Calculus I and Multivariable, two more courses I’ve never taught before. I took them in college, and did well, though; OTOH, it’s been a long time since Green’s Theorem: my 40th reunion is this spring.
I will of course be using a simulation-based approach to inference. ISCAM starts that way but quickly (I think) brings in Normal-based inference and t procedures. I’m re-ordering some of their investigations to bring the Normal in later.
Students get Fathom for free, still, so we’ll be using that; I’ll write Fathom-based instructions to replace the ones ISCAM uses for R. It will mostly be fine; I think I saw one thing in the R code that I didn’t know how to do in Fathom.
At the same time, Fathom has trouble right now: under Mavericks data import from Census or the Web is broken. That was so great in the past, but now many of my handouts from before will no longer work. Arrgh.
Simulation-based inference is a big enough deal now that some of the Big Dogs of the movement have a blog.
I hope to get a link to have my students do the CAOS test so we can compare. It will also give me a nice pre-assessment so I have a clue what they know about simple stuff.
Last time, we saw how the length of a hanging slinky is quadratic in the the number of links, namely,
where M is the mass of the hanging part of the slinky, g is the acceleration of gravity, and is the “stretchiness” of the material (related to the spring constant k—but see the previous post for details).
And this almost perfectly fit the data, except when we looked closely and found that the fit was better if we slid the parabola to the right a little bit. Here are the two graphs, with residual plots: