Occam’s Razor and the Data Science Project

toolsThe Cortana Analytics suite from Microsoft is not a single platform, but actually a group of related products and features. Why so many? Couldn’t someone just use Microsoft R Server, or Azure ML, or Hadoop to create a solution? Isn’t the simplest solution always the best? Well, yes, but only inasmuch as it is as simple as it should be.

In many data projects, it’s common to use a single process to get the answers you’re looking for. For instance, in a reporting-based application, you have a data source (or perhaps more than one) and you run a query over that data. The query might have multiple components, such as aggregations, combinations and filters, but in essence it’s a single technology or type of technology that processes the answer from the data. The simplest answer is to apply the query language to the data. And simple is best, right?
In a Data Science project, more processes are involved. You start with not one but several data sources, and you’ll use that data in more than one process. This is the primary reason you have an Extract, Load and Transfer (ELT) component rather than an Extract, Transform and Load (ETL) component. Within a data flow in Data Science, it’s common to have different algorithms, processes and even tools to get a given solution.

gymLet’s look at an example. Consider a company that has a loyalty program of some sort – perhaps they offer a discounted gym membership to their customers for buying a certain amount of subscription time to a fitness-tracking program for a device they sell. Assume that an analysis shows they are losing money by offering the gym discounts. Should they cancel the benefit? According to a single analysis process, the answer is yes. But perhaps we should dig a little deeper…

Accessing multiple sources of data about not only the customers but about their extended buying practices, we use a classification method to learn more about their habits – with so many variables, perhaps a multi-class decision forest – and feed those results into yet another process to find out the sequences in time for what the customers do next. A sequential pattern or basket analysis could yield this result. After we know more about those habits, we can use a clustering algorithm, such as K-Means, to divide out these customers by similar attributes. In all, we’ve used the Azure Platform, Storage, Hadoop, Azure ML with an R script, and Power BI to arrive at and display our results.

suitsFrom all this analysis, we find that the customers taking advantage of the “money losing” benefit actually purchase more of the company’s associated clothing line – something not originally factored in. The profit level on the clothing is high enough to more than offset the benefit – so the advice is to continue the program, and in fact expand it to feature in-gym promotions of the clothing line. Far from viewing the benefit as a loss-leader, we’ve turned it into a revenue opportunity.

The point of this (mostly) fictional exercise is to show that multiple techniques, algorithms and even tools are often required to solve a given investigation. It’s true that the simplest process to arrive at the correct answer is usually best – but things should be as simple as possible, and no simpler.

You can learn more about this process by starting here: https://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/ 

3 thoughts on “Occam’s Razor and the Data Science Project

  1. Reblogged this on and commented:
    An example of how multiple technique and algorithms /tools are required to solve a given investigation – when the simplest process isn’t always the best. Kind of like the Zen of Python where Simple is better than complex but complex is better than complicated….


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.