(Complete Table of Contents here: http://aka.ms/backyarddatascience)
Species: Population, Sample, Count, Spread, Mean, Median, Mode, Basic Visualizations
The most important part of data analysis is a thorough understanding of the data we’re looking at. Once we’ve verified what the source of the data actually means (another entry in the Data Science Notebook entirely) and that we can trust it, we need to do some simple visualizations and calculations to see what it means.
I find that using even basic descriptions is very useful. These species of statistics are also often called “Exploratory“, since it’s a method of just looking at what you have. And at the end of this Data Science Field Notebook entry, we’ll see how deceptive these “simple” things can really be.
First, is this all of the possible data or just a part of it? The formulas we’ll use to describe the data and eventually make predictions with depend on that answer. Statistics put data into two types:
- Population – All of the data about a thing
- Sample – Some of the data about a thing
For many things, we can have all of the data there is. For instance, let’s say we have a group of people in a room. If we want to know information about just those people, this is the entire group we need – everyone is right there in the room. That’s a population.
But in some cases we can’t get all of the data. Suppose we want to figure out what “most” people are like. Or maybe just part of what they are like – such as their age. We can’t measure everyone on the planet – it just isn’t practical. Not only that, the data changes as we measure it. By the time we measure someone and move on to the next billion people, the first person is older.
It turns out you can make a lot of guesses about the population (all of the people) from a smaller group of them (some of the people).
(And it turns out you can only fool some of the people all of the time)
There are, however, two problems with using just a subset of data to make assumptions about all of it. The first is that you need a fairly large group of people to make the guess (if in fact people are different. Sometimes they aren’t. More on that later).
The second problem is that the group (sample) you select needs to resemble, at least on some level, all of the people (the population). We’ll deal with those problems in another Notebook entry. For now, we’ll think about a sample that closely resembles the population, and one that is large enough to matter.
NOTE: In another Notebook entry I’ll explain ways that you can test the sample to see how much you can trust it to represent the population. And we’ll deal with that size issue. It turns out size does matter.
The simplest thing you can do with data is count it. How may do you have?
The second simplest thing you do is measure the spread of the data – although even this starts becoming interesting.
For instance, in measuring age, we might have 25 people in the room, and the youngest person might be 5, and the oldest 70. It’s actually important to know these numbers – they help us when we start talking about the sample representing the population, or in the case where we have the whole population, what some of these numbers might mean.
As an example – let’s say we want to describe the people in a college class. It’s not odd to have 25 people in the class. It is odd to have a five year old and seventy year old in that class! Something doesn’t make sense there, so we would need to look at our data more closely before we base anything on it.
With those basic numbers out of the way, the next thing to do is to see how the data “centers” itself. There are three basic statistics we’ll use to look at that, and then we’ll take a look at why those are problematic.
Let’s get some numbers for the ages of folks in our room:
23, 18, 16, 18, 25, 23, 22, 22, 21, 5, 70, 21, 19, 21, 22, 24, 24, 23, 19, 18, 19, 20, 20, 20, 23
One of the first formulas to learn is the “Average” or “Mean”. There’s one for the population, and one for the sample (yes, it matters):
Sample Mean: x = ( Σ xi ) / n
Population Mean: μ = ( Σ Xi ) / N
Wow – that looks complicated, but it really isn’t. Statistics uses a lot of symbols, and they just need a little teasing out to understand. Anytime we come across a new symbol I’ll treat that as a species we need to learn about. For the Population Mean, here are what the symbols, uhm, mean:
- The μ symbol is simply a placeholder for the whole formula to the right. It will be used in other formulas as we move through statistics, so a “simple” formula for a statistical calculation can explode into several lines after you decompress it. Remember, the fact that it’s a population matters – don’t let that mix you up later.
- Next, we have the Σ symbol. That’s just a sum, or addition.
- The large X (meaning it’s a population variable) stands for each number (like 23, 18, and so on)
- The I symbol just means “keep going with the X’s till you run out”.
- The N symbol (capitalized, watch that) means the count of items.
So, that whole thing boils down to “Add up everything and divide it by the number of things” (seems like they could just say that next time). And in our case, looks like this:
556/25 = 22.24
So that’s our Average, or Mean. We could say “the average age in this college class is around 22” and we’d be right.
But there are a couple of other measurements that are handy to look at when you first get a set of data. These are simpler formulas.
If we line up all the numbers from smallest to largest, we can take the middle one and find out where it lies:
5, 16, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 21, 22, 22, 22, 23, 23, 23, 23, 24, 24, 25, 70
In this case, the middle value is 21 – still pretty close to the average. This measure is called the Median.
The next handy measurement is the number that occurs most often – this is called the Mode. In our data set, this is 23 – once again, pretty close to the average.
So in this case, we know a lot about our data. But we need to look at it graphically to understand it a little better. Since we’re looking at how the data is averaging, a good set of visualizations to use are line charts and scatter plots. A line chart simply takes the data and draws a line from each data point going across (x-axis) and how far up from the bottom it goes (y-axis). Our chart, ordered from youngest to oldest in the room, looks like this:
So far that works for what we want to show. Most people lie in the 20-25 year old range. Let’s do a scatter plot of that as well – it’s the same concept, it’s just that the points aren’t connected:
This shows us the same thing, however – take a look at that! We can more clearly see there are two data points that are outside the biggest group. These are called “Outliers“, and we’re going to focus on those in another Field Notebook entry.
These are great ways to look at data. I use these “species” of statistics all the time. But they aren’t to be trusted, at least not all by themselves….
Take a look at these numbers:
We’ll move on to working with numbers like these in R soon, and learn a little more about how interesting simple things can be.