(Complete Table of Contents here: http://aka.ms/backyarddatascience)
In a previous Notebook entry, I showed you where you can learn Statistics. It’s one of the base skills you need to know if you’re going to work with Data Science.
But many times students know the process of using a statistical formula (more accurately called a “test”) and not necessarily when they should use that formula – or formulas. In this Notebook entry, I’ll explain some basics about when to use a particular test, and then point you to further resources about using them. In a future Notebook entry we’ll put all this together in R and Python to see examples of answering a specific question. Of course, this isn’t an exhaustive list, but will help you get started.
There are three main test areas for Statistics: Centers of data, Groupings of data, and Relationships between data. How do we decide which test to use? You can often find that with just a few questions:
- What are your trying to find out?
- How many data do you have?
- What kind of data is it?
What are you trying to find out?
There are two or three main areas within data that you can use statistical tests to get an answer for a question you have. These questions are often stated as “Hypotheses” or, “guesses”. If you have a guess (more people will like my blog, less people will buy that thing, and so on) you can use the tests for proportion (more on that in a moment, with a link and everything), check the average and standard deviation, and check the difference of two paired means (again, more on that in a moment).
Perhaps you want to compare two statistics – you want to see how well things correlate. For that answer you can use tests involving the differences of two proportions, or the difference of two independent sample means. If you want to see how well two populations relate to each other, you can use a chi-square goodness of fit test.
You might also want to know how the data is related – if at all. Here you have two choices – but the choice depends on the type of data you’re working with – more on that in a moment. For now, know that if the data are Categorical, use the chi-square test for independence, and if they are Quantitative, use regression analysis. I’ll explain what those mean in a moment.
How many data do you have?
Wait – how “many”? Don’t I mean how “much”? Nope. In statistics, more data is almost always better, so its’ really a question in the formula you choose of how many groups of data are you working with. If you have one big set of data (don’t think about SQL Tables here, think more about a View), then you have one “sample”. And within that sample, you need to decide if you’re looking at one variable (or Feature), or more than one.
If you have a single variable you are testing, you’ll use proportion tests, and averages calculations. For more information about how to perform these tests and what they will show, check out this page: http://www.stattrek.com/hypothesis-test/proportion.aspx?Tutorial=AP.
If you have two variables within a single population, you can use regression analysis, the differences between two paired means, or tests for independence.
If you have two populations, you may want to compare and/or contrast them. In this case, you can use a Chi-Square Test for Homogeneity (shows how alike they are) and the Chi-Square Test for Independence (show how different they are).
If you follow the links in the tests I mentioned, you’ll notice that the web page there specifies the type of variable you’re dealing with – that’s what I’ll cover next.
What kind of data is it?
You can collect data on anything, but how you collect it and what you collect determines the type of tests you can run. There are two big types: Categorical data (also called “Nominal”) and Quantitative (also called “Interval and Ratio” data). There’s another type called “Ordinal”, but I’ll deal with that in another Notebook entry.
Categorical data is what it sounds like – is it tall, short, working, not working, green, orange, in the box, not in the box, of a certain species, and so on. You’ll see this data as counts, or percentages. You can test for this kind of data using a Hypothesis Test such as a Difference Between Proportions. This is useful when you want to show how different two groups (populations) are, such as, “Who usually buys the opposite of our product?” Other tests here include the differences between the means, independence tests (like chi-squared) and the regression slope.
Quantitative data is numeric. You can perform counts, sums and other aggregation calculations, but most often you’ll use the average. You’ll then be able to use various comparison tests for those averages such as a test for a mean, differences of the means of independent or paired means, and also regression. Regression helps in predicting values that follow.
Of course this Notebook entry isn’t exhaustive, and glosses over some things that can be really important. It’s only meant to get you started. You can find out more in the links below – and keep studying those statistics courses I pointed out!
- There is a great chart showing a lot of these – and more – statistical formulas and tests you can use: http://www.datasciencecentral.com/profiles/blogs/how-to-choose-a-statistical-model
- If you’re further along in your statistical learning, this is a good video to see another angle dealing with when to use a test based on whether you’re going after a prediction (inferential) or trying to group (descriptive) your data: https://www.youtube.com/watch?v=HpyRybBEDQ0