We’re in a series on the “Keys to Effective Data Science Projects” – and we’ve completed the hardest part – identifying the problem(s) we want to solve. Note that each problem gets it’s own project – you’re not going to predict when a hurricane will hit and how much money you’ll get in your bonus this year using the same data and algorithms. Unless maybe you’re a pilot. Or an AirBnB host. Anyway…
Now we’re to the section where we need to bring in some data to work with. Note that this is the initial data set – we need to bring the data we think will help solve the problem, then examine it, and then do some Feature Engineering (more on that later). Once we do the analysis and the feature engineering, we’ll most certainly have to go back and get more data, or different data, or both.
That’s a critical point. Yes, the data science team will identify what data they think will train the algorithm, but you don’t know if that will really work until you try it – hence the reason the process is called an experiment. After the data evaluation and initial experimentation phases, you’ll know more about what you need.
This is different than most data projects, where once you’ve identified a data source you move in a linear fashion to the next step in the project. In a Data Science project, it’s almost certain you’ll need to update the data sources after you are almost a third of the way through the project. Everyone needs to be ready for that – management will almost certainly balk at having to budget in time and money for a repeat effort. But that’s how this works.
OK – now on to the actual process for identifying the data you need, and bringing it in. For that, let’s cut over to some documentation to learn more:
- Options for loading data – https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-ingest-data
- An example of importing training data into Azure Machine Learning Studio from various data sources – https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-import-data
In the next installment, I’ll cover the process of evaluating your initial data set. We’ll use a little R, a little Python, maybe some Azure ML, and Excel might even make a brief appearance. You never know, I might even sprinkle a little RegEx with some sed, awk and grep for good measure.