The Importance of Unit of Analysis

abacusThere are two defining difference between Data Science (or Data Mining for that matter) and other types of data analysis: The first is how far back you push the data analysis, and the other is the multiple processes and tools you’ll use within the analysis. In this post I’ll explain the first difference.

In a typical data analysis project, you select your data sources, and then begin the process of loading the data to the processing step, performing any transforms along the way to homogenize the data and enable it to work with a specific technology.

Leaving aside the processing step (I’ll cover that in another post), a Data Science or Data Mining data selection concerns itself first with the source data in depth. It’s not enough to know the shape of the data (depth, width, sparsity, ragged-ness) it’s also important to consider what the data means. It’s not enough to know that the number shown represents a measurement taken at a certain point in time – it’s important to know how that measurement was taken, why, and what it means. These characteristics, along with others, are called the Unit of Analysis.

machineLet’s take a look at a couple of examples. Assume you have a manufacturing machine, and it’s important to know the temperature of the machine as it works. You might need to know this to create a predictive maintenance model – perhaps the temperature is an important feature measurement because it is highly predictive of when the machine will fail. If the machine generates a temperature measurement, it’s important to know if that measurement is taken constantly, or if it sampled at regular intervals (which is more common). Why does this matter? Well, if the temperature is sampled every five minutes, and you’re taking measurements every 30 seconds, you’ll get an incorrect analysis.This is because the Unit of Assignment  and the Level of Aggregation are impacted by how independent the data samples are. That independence between data points is crucial for most statistical formulae to work properly, and account for many of the errors in interpretation you’ll find in popular media.

HighlanderAnother example is when you’re measuring human behavior – perhaps a workplace efficiency study, or how well a campaign or test relates to improved or desired results. In this case, the Unit of Analysis has other factors to consider, such as the level you’re talking about – the office, the team, the company, etc. This is called the Unit of Generalization in statistics, and it also carries meaning.


The point to all this is: You’ll have to go further in your source data when you’re asked to do a Data Science type of analysis. It’s not enough to collect the base data – you’re going to spend some time understanding exactly what that data means.

One thought on “The Importance of Unit of Analysis

  1. Hi there,

    I hope you’re having a great day! My apologies for reaching out via comment, but I couldn’t find your email! I was reading through your blog and I was impressed by the content you’re producing on your site.

    My name is Raquel, and I’m the Community Manager at Quandl. We have a blogger project running right now centered on data analysis, and I’d love to discuss it with you.

    If you are interested in hearing more details or have any questions, please let me know!

    I look forward to hearing back from you.

    Thank you,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.