Thinking about yesterday’s post, I was struck with an idea that may be obvious to many readers, and has doubtless been well-explored, but it was new to me (or I had forgotten it) so here I go, writing to help me think and remember:
The post touched on the notion that communication is an important part of data science, and that simplicity aids in communication. Furthermore, simplification is part of modelmaking.
That is, we look at unruly data with a purpose: to understand some phenomenon or to answer a question. And often, the next step is to communicate that understanding or answer to a client, be it the person who is paying us or just ourselves. “Communicating the understanding” means, essentially, encapsulating what we have found out so that we don’t have to go through the entire process again.
So we might boil the data down and make a really cool, elegant visualization. We hold onto that graphic, and carry it with us mentally in order to understand the underlying phenomenon, for example, that graph of mean height by sex and age in order to have an internal idea—a model—for sex differences in human growth.
But every model leaves something out. In this case, we don’t see the spread in heights at each age, and we don’t see the overlap between females and males. So we could go further, and include more data in the graph, but eventually we would get a graph that was so unwieldy that we couldn’t use it to maintain that same ease of understanding. It would require more study every time we needed it. Of course, the appropriate level of detail depends on the context, the stakes, and the audience.
So there’s a tradeoff. As we make our analysis more complex, it becomes more faithful to the original data and to the world, but it also becomes harder to understand.
Which suggests this graphic: