The Keys to Effective Data Science Projects – Part 7: Create and Train the Model

speciesWe’re in part seven on our series of the Keys to Effective Data Science Projects.  This is the section that most people think of when they think of “Data Science”. It’s where we take the question, the source data which has been turned into the proper Features (and potentially Labels), and select an algorithm or two to create a Model.

Let’s hold there for a moment – just what is a Machine Learning Model anyway? It’s actually not a trivial question. It’s made more difficult by conflating the terms Algorithm and Model in Data Science discussions. Ali-Kazim Zaidi, a Data Scientist here on my team, defines it this way:

“Models form our hypothesis set of what generates or approximates a true target function / data generating system. All models are wrong, but some can be useful for inference. Models are what you define through your inductive bias of a data generating system. Algorithms are what you use to fit parameters or values to that model so that it resembles data you’ve observed. What you get in the end is a parameterization of a function that you use to do inference about an underlying system.”

So at the end of the day, the Models are what we build and operationalize. So what are the important things to remember about Modeling that will help you with a successful project?

The first thing is to realize that Modeling is experimental. You don’t simply select some data, run it through an algorithm, and then get a definite answer.  You need to run lots of experiments, and change a lot of the features, change the parameters of the algorithms you choose, and perhaps even choose different algorithms each time. All the while making sure you treat it as a true scientific test (Data Science, y’all) by moving only one thing at a time. As an aside, this is where the new Azure Machine Learning Services kind of shines – you can track the runs of your experiments and go back to one that was working well. But I digress.

The next thing that is important to note for a successful project is to ensure everyone on the team, and especially the stakeholders of the project, understand that all ML is at essence a guess. A calculated, really good (hopefully) guess, but a guess. Most people that deal with IT project are used to VERY deterministic outcomes – something either is or it isn’t. But Machine Learning uses statistics, and statistics that deal with probabilities, so at best you’re getting a very good guess. But a guess. I have seen a few times where an incredulous manager comes in to a meeting with a fistful of charts saying “but the smart-phone thingie TOLD me this would work!” No, the smart-phone thingie told you it MIGHT work. Probably. Mostly.

This is where DevOps practices can really save the day. Communication between all the teams would have helped to avoid this misunderstanding. But that’s an article for another time.

In our next installment, I’ll give you another key. See you then.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.