In this series on the “Keys to Effective Data Science Projects”, we’ve seen a process we can use, we’ve determined what we want to know, and we’ve ingested the data. In the last step we explored the data, and in a different way than we might be used to when working with in a database project.
Recall than in the ingestion phase we didn’t follow the traditional “ETL” process of Extract, Transform, and Load on the source data. Instead, we brought all of the data (at least all we could think of for the initial exploration) and after we’ve examined it, we’re ready to do some transforms.
Note that we’ve pushed out the transformations until this point for three really important reasons:
- We’re not sure we have all of the features (and possibly labels) that we need
- We might use more than one technology in our solution
- We’re not sure of the transformations we need – per technology
Let’s start with the first one. Whenever we work with predictive or categorization solutions, we involve two types of data: features, and sometimes labels. Both of these are simply columns of data.
But that’s an important point – Data Scientists don’t think in terms of columns and rows. We think in terms of features of something (like hair color, income, number of fish in a lake, or some other number or attribute) that might predict an outcome, and what we want to predict – which we call the label. So the combination of hair color, education, job title and other features might help us predict a label showing a certain income.
A special note here – sometimes we don’t predict an outcome, we simply group data into smaller groups using some number of features. For instance, we might cluster the data on hair color, education and job title. In this case, we aren’t predicting anything, and so we don’t need the label. And to make it even more interesting, it’s sometimes useful to cluster the data to find out a label we want to predict!
The rows for a Data Scientist are the groups of observations we make about an event, or collection. Those fundamental ways of thinking bring us from a database focus to a Data Science focus.
Machine Learning is a tool that we use in Data Science to get at those predictions or clusters. Machine Learning is a set of algorithms we apply to data, and these algorithms often involve a lot of statistical formulas and linear algebra. And that is where the transformation comes in. The data has to be in the right format, shape and layout for the formulas to work – and each ML algorithm might require a different set of data, or type of data, or layout of data. So we push the transformation step off until we absolutely have to.
And we might use more than one tool to get at the answer. We might use R, for instance, for a clustering function, and then perhaps the result might be passed off to an API that uses Python. To “train” each one of these might require yet more changes to the data. So you can see that we need to keep the source data around longer in its original format longer than we would in a Business Intelligence solution.
You can find out more about another part of this process in a topic called “Feature Engineering“. And I’ll explain the next Key to your Data Science project in the next article.