There are two defining difference between Data Science (or Data Mining for that matter) and other types of data analysis: The first is how far back you push the data analysis, and the other is the multiple processes and tools you’ll use within the analysis. In this post I’ll explain the first difference.
In a typical data analysis project, you select your data sources, and then begin the process of loading the data to the processing step, performing any transforms along the way to homogenize the data and enable it to work with a specific technology.
Leaving aside the processing step (I’ll cover that in another post), a Data Science or Data Mining data selection concerns itself first with the source data in depth. It’s not enough to know the shape of the data (depth, width, sparsity, ragged-ness) it’s also important to consider what the data means. It’s not enough to know that the number shown represents a measurement taken at a certain point in time – it’s important to know how that measurement was taken, why, and what it means. These characteristics, along with others, are called the Unit of Analysis.
Let’s take a look at a couple of examples. Assume you have a manufacturing machine, and it’s important to know the temperature of the machine as it works. You might need to know this to create a predictive maintenance model – perhaps the temperature is an important feature measurement because it is highly predictive of when the machine will fail. If the machine generates a temperature measurement, it’s important to know if that measurement is taken constantly, or if it sampled at regular intervals (which is more common). Why does this matter? Well, if the temperature is sampled every five minutes, and you’re taking measurements every 30 seconds, you’ll get an incorrect analysis.This is because the Unit of Assignment and the Level of Aggregation are impacted by how independent the data samples are. That independence between data points is crucial for most statistical formulae to work properly, and account for many of the errors in interpretation you’ll find in popular media.
Another example is when you’re measuring human behavior – perhaps a workplace efficiency study, or how well a campaign or test relates to improved or desired results. In this case, the Unit of Analysis has other factors to consider, such as the level you’re talking about – the office, the team, the company, etc. This is called the Unit of Generalization in statistics, and it also carries meaning.
The point to all this is: You’ll have to go further in your source data when you’re asked to do a Data Science type of analysis. It’s not enough to collect the base data – you’re going to spend some time understanding exactly what that data means.