Hindsight, it is said, has 20/20 vision. We seem to be able to predict the past flawlessly – or can we? The answer is surprisingly “no”.
“Creeping Determinism” is phrase from 1970’s psychology. It’s the effect of thinking that something was predictable, but only after it happens. We look back and say – “Ah – there it is, if these things happen, then this thing will also happen. I knew it all along”. This is more insidious than it might appear – it can even lead to death. Take, for instance, the studies of estrogen on women’s health – now shown to have more complications because of selection bias, which is an feature of Creeping Determinism. This type of error happens when we focus on things that have a particular outcome, and ignore the same ingredients that didn’t cause that effect lots of other times.
For instance – let’s say you’re out on your evening run. You trip, for the second time, and you just know it’s the brand of shoes you’re wearing. After all, you wore those shoes tonight, and sure enough, you tripped, just like you did last week when you were wearing them. Your brain is using an “availability heuristic” – if something is remembered, it must be important – and a flawed cause and effect. What you’re *not* remembering is that you’ve worn these same shoes dozens of times and *not* tripped. That’s a simple example of just one factor, but in working with data, things are never that simple.
Data Science to the Rescue
Since we’ve known about this problem for a long time, surely we’ve solved it, right? Well, no. Politics is a prime example. Politicians manipulate the public all the time by taking credit for things using cause and effect, or blaming the other side using the same causes and effects – all the while relying on people’s blind spot of Creeping Determinism. But Data Science can assist (although not completely cure) in this area.
Use Data Science Tools
A salient feature of Data Science is that it doesn’t rely on a single tool. You’ll use R, Machine Learning, Visualization technologies and more to process multiple sources of data. HDInsight (Hadoop) and the Azure Data Lake, along with High Performance Computing can work with very large data sets, which allows you to bring in lots of negative cases. Previously, we relied on only the main published studies in a hypothesis, now we can include more data sources and more data.
Follow a good Data Science Process
A Data Scientist is a natural skeptic. They are perhaps one of the few roles in technology that are trying to convince you *not* to believe their predictions.
If you have a group of columns or attributes of data (features) that you’re using to predict an outcome (the label), it’s vital to ensure that you’re using not just the positive times this event has occurred, but as much of the data as you can find, perhaps especially the times it did *not* have the prediction you’re looking for. I’m not talking about the data itself, but where you got the data from to begin with. It’s a tendency to publish only the times something worked, not the times it didn’t. Assume you test a group of people, having them eat a banana every day. Then you test to see whether they get over a flu faster than people who do not eat a banana every day. Two years later, three people from the study do in fact recover faster than those who did not eat a banana every day. Success! We publish the results of the new miracle “banana cure”. What we do *not* publish is a detailed numeric study of the thousands of people for whom bananas had no effect. Imagine slogging through that report! And so the positive case is over-reported. Decades later, the earth is devoid of bananas because conventional wisdom is that eating them cures the flu. This is not a trivial example – there are cases of this type of thing happening all the time, including that estrogen study I mentioned. (Full disclosure: I’m not a doctor, I’m not telling you to start or stop with any medications, and even this dispute is in dispute. Always check with your physician before taking medical advice from a guy with a blog)
One approach to Data Science I teach looks like this:
- Determining the actual problems to solve
- Identifying and vetting data sources
- Defining the data path
- Create Cleansing and Homogenizing Processes
- Create Feature Selection process (where needed)
- Create Computation Processes (Predictive or Classification)
- Create Output and Presentation Instruments
So, can Data Science help with Hindsight bias, or any other bias? Yes. If used properly – and if you consider all the facts. Remember: Statistics don’t kill people; people using statistics kill people.