Processes are great things. They give you a place to start, a roadmap, and a way to explain to your stakeholders (like customers) what you’re going to do and the order you’ll do it.
In addition, a process compresses information into shorter information so that you can “peg” it as you work through it. Then you can decompress that information for each step, assign it to the right people and teams, and parallelize work where possible. So it’s important to think through a process, create and edit it, test it, and adjust based on reality. It doesn’t mean you *have* to follow it – but it gives you a defined way to start. Dwight Eisenhower once said “I have always found that plans are useless but planning is indispensable.”
In Data Science, it’s often more fun and exciting to focus on the technologies, the algorithms, and visualizations in the project. But you should actually start with focusing on the process you’ll follow. Platform should always follow Process.
For years, the primary process a Data Scientist would follow was the Cross Industry Process for Data Mining, or CRISP-DM (http://www.sv-europe.com/crisp-dm-methodology/). It’s a great process involving many phases you’ll recognize from Business Intelligence frameworks. But there are a couple of issues with it.
First, it’s kind of old. It was created in the 1990’s, and technology and other factors have changed. It’s also not been kept up well to reflect those changes. It also assumes that every project will have a Machine Learning or at least predictive component – not always necessary in Advanced Analytics.
But it’s largest drawback is that the CRISP-DM doesn’t really focus on the team aspect of working in Advanced Analytics projects. It can be used within a team, but it really is useful by the single individual. But in an Advance Analytics project, there are lots of things that can be done by a team of folks, not all of whom are 6-year PhD’s in Machine Learning – such as Data Wrangling, visualizations, and other steps.
So Microsoft invented the Team Data Science Process or TDSP (https://docs.microsoft.com/en-us/azure/machine-learning/data-science-process-overview). It handles the same kind of work as the CRISP-DM, but adds in other phases and fleshes out the team aspect of the process. You should check out the link above for the full description (and read the CRISP-DM one too – tons of really useful info in there) and then check out this simple graphic I’ve created for the process. This is one of the images I use when I teach Advanced Analytics classes.
So check out the latest Data Science Virtual Machine (DVSM) on Microsoft Azure, open the TDSP and work through the labs on the DVSM. It’s a wealth of information that you can use to further your experience in Data Science. There are also more resources here: https://channel9.msdn.com/Shows/Cloud+Cover/Episode-227-Team-Data-Science-Process
5 thoughts on “The Keys to an Effective Data Science Project – Part 1: The Team Data Science Process”
This is so interesting! I like Eisenhower’s quote ‘I have always found that plans are useless but planning is indispensable’ haha 🙂