There’s a great activity at the beginning of Workshop Statistics where kids write a couple of sentences about why they’re taking the course, and then construct the distribution of word lengths. The main point is to ask, “are all word lengths the same?” Answer: no, duh. Right: they vary. It’s not that an individual word changes its length, but that the idea word length varies from word to word. So it’s a variable, in a way that’s a little different from the variables they’re used to from algebra.
But what does the distribution look like? Rather than look it up, I found Hamlet’s “to be or not to be” soliloquy online, pasted it into my favorite text processor (TextMate) and did a bunch of global substitutions so that every word was on its own line. (I also stripped out hyphens and apostrophes and other punctuation, which may not always be appropriate, but never mind. But I think of ’tis as a three-, not a four-letter word.) Then a quick dump into a Fathom collection, and a new attribute (or variable) with a formula like stringLength(WORDS) and you’re all set. This process takes enough fluency that it’s an inappropriate activity for the kids in my class, at least, but the results are interesting enough to share, as in the illustration at right.
I mean, look at the distribution! I knew at the outset that twos would be heavy: think about it: TO BE OR TO BE IS, just in the first sentence! But all those threes and fours. Wow. It does make sense, but raises the question: what do other word-length distributions look like? Again, I know this must be done already, but it’s fun to do it yourself and see what you find—and find out what decisions you have to make, such as whether to remove hyphens. Anyway, I happen to use the first paragraph of Don Quixote for another activity, so I have it around (“Desocupado lector,…”) so now, after the same process, we have a distribution in Spanish (below):
Notice the lack of the 234 hump, but that, as in Hamlet, 2-letter words make up about 1/5 of the total (I have scaled all graphs the same and used a density axis; that’s a whole other story). We could clearly push on directly to more computation and hypothesis testing, but it’s too early in the year. I’m happy to gaze at the histograms.
Two more questions immediately arise for me:
- Is the soliloquy typical of a larger part of Hamlet?
- Is Shakespeare distinguishable from other authors just from this distribution?
So below we have two more: the entire third act of Hamlet, and then Bush’s second inaugural address.
I don’t know what you were expecting, but me, I thought Bush might have a substantially different distribution. I naïvely thought that we might see more shorter words, but on reflection, this makes sense: Shakespeare might be pithier; Bush’s speechwriters may be higher-falutin’ than I usually imagine. But what’s up with all those seven-letter words? So now, an interesting payoff from technology and doing it oneself: in Fathom, I can select that one bar, copy, and paste the words into a new collection—and look at the list of the 221 seven-letter words in his speech. Sorting that list, you can see which ones were used multiple times. The five most popular? Country, because, America, liberty, and 24 times, freedom.
The two 14s, by the way: selfgovernment. (I stripped the hyphens.)
Want to play? I have posted the cleaned-up text files and stuffed them into a zip. Safari transparently downloads and unstuffs this into your Downloads folder, showing you an unhelpful blank window. You may need to so something else, e.g., right-click the link to get a menu to download the linked file (or however your browser says that).