(This is part one. Link to part two.)
In the Data Science Games project, we have recently been exploring decision trees. It’s been great fun, and it’s time to post about it so you dear readers (all three or so of you) can play as well. There is even a working online not-quite-game you can play, and its URL will probably endure even as the software gets upgraded, so in a year it might even still work.
Here’s the genesis of all this: my German colleague Laura Martignon has been doing research on trees and learning, related to work by Gerd Gigerenzer at the Harding Center for Risk Literacy. A typical context is that of a doctor making a diagnosis. The doctor asks a series of questions; each question gets a binary, yes-no answer, which leads either to a diagnosis or a further question. The diagnosis could be either positive (the doc thinks you have the disease) or negative (the doc thinks you don’t).
The risk comes in because the doctor might be wrong. The diagnosis could be a false positive or a false negative. Furthermore, these two forms of failure are generally not equivalent.
Anyway, you can represent the sequence of questions as a decision tree, a kind of flowchart to follow as you diagnose a patient. And it’s a special kind of tree: all branchings are binary—there are always two choices—and all of the ends—the leaves, the “terminal nodes”—are one of two types: positive or negative.
The task is to design the tree. There are fancy ways (such as CART and Random Forest) to do this automatically using machine learning techniques. These techniques use a “training set”—a collection of cases where you know the correct diagnosis—to produce the tree according to some optimization criteria (such as how bad false positives and false negatives are relative to one another). So it’s a data science thing.
But in data science education, a question arises: what if you don’t really understand what a tree is? How can you learn?
That’s where our game comes in. It lets you build trees by hand, starting with simple situations. Your trees will not in general be optimal, but that’s not the point. You get to mess around with the tree and see how well it works on the training set, using whatever criteria you like to judge the tree. Then, in the game, you can let the tree diagnose a fresh set of cases and see how it does.
That’s enough for now. Your job is to play around with the tool. It will look like this to start:
The first few scenarios are designed so that it’s possible to make perfect diagnoses. No false positives, no false negatives. So it’s all about logic, and not about risk or statistics. But even that much is really interesting. As you mess around, think about the representation, and how amazingly hard it can be to think about what’s going on.
There are instructions on the left in the tan-colored “tile” labeled ArborWorkshop. Start with those. There is also a help panel in the tree tile on the right. It may not be up to date. All of the software is under development.
The first disease scenario, ague, is very simple. The next one, botulosis, is almost as simple, and worth reflecting on. That will happen soon, I hope after you have tried it.
Note: if you are unfamiliar with this platform, CODAP, go to the link, then to the “hamburger” menu. Upper left. Choose New. Then Open Document or Browse Examples. Then Getting Started with CODAP. That should be enough for now.