(Complete Table of Contents here: http://aka.ms/backyarddatascience)

# What, Why, How

In a previous Notebook entry, I showed you where you can learn Statistics. It’s one of the base skills you need to know if you’re going to work with Data Science.

But many times students know the *process* of using a statistical formula (more accurately called a “test”) and not necessarily *when* they should use that formula – or formulas. In this Notebook entry, I’ll explain some basics about when to use a particular test, and then point you to further resources about using them. In a future Notebook entry we’ll put all this together in R and Python to see examples of answering a specific question. Of course, this isn’t an exhaustive list, but will help you get started.

There are three main test areas for Statistics: **Centers of data**, **Groupings of data**, and **Relationships between data**. How do we decide which test to use? You can often find that with just a few questions:

- What are your trying to find out?
- How many data do you have?
- What kind of data is it?

# What are you trying to find out?

There are two or three main areas within data that you can use statistical tests to get an answer for a question you have. These questions are often stated as “Hypotheses” or, “guesses”. If you have a guess (*more people will like my blog, less people will buy that thing, and so on*) you can use the *tests for proportion* (more on that in a moment, with a link and everything), check the *average* and *standard deviation*, and check the *difference of two paired means* (again, more on that in a moment).

Perhaps you want to compare two statistics – you want to see how well things correlate. For that answer you can use tests involving the *differences of two proportions*, or the *difference of two independent sample means*. If you want to see how well two populations relate to each other, you can use a *chi-square goodness of fit test*.

You might also want to know how the data is *related* – if at all. Here you have two choices – but the choice depends on the type of data you’re working with – more on that in a moment. For now, know that if the data are *Categorical*, use the *chi-square test for independence*, and if they are *Quantitative*, use *regression analysis*. I’ll explain what those mean in a moment.

# How many data do you have?

Wait – how “*many*”? Don’t I mean how “*much*”? Nope. In statistics, *more* data is almost always better, so its’ really a question in the formula you choose of how many groups of data are you working with. If you have one big set of data (*don’t think about SQL Tables here, think more about a View*), then you have one “sample”. And within that sample, you need to decide if you’re looking at one variable (or *Feature*), or more than one.

If you have a single variable you are testing, you’ll use *proportion tests*, and *averages calculations*. For more information about how to perform these tests and what they will show, check out this page: http://www.stattrek.com/hypothesis-test/proportion.aspx?Tutorial=AP.

If you have two variables within a single population, you can use *regression analysis*, the *differences between two paired means*, or *tests for independence*.

If you have two populations, you may want to compare and/or contrast them. In this case, you can use a Chi-Square Test for Homogeneity (*shows how alike they are*) and the Chi-Square Test for Independence (*show how different they are*).

If you follow the links in the tests I mentioned, you’ll notice that the web page there specifies the *type of variable* you’re dealing with – that’s what I’ll cover next.

# What kind of data is it?

You can collect data on anything, but how you collect it and what you collect determines the type of tests you can run. There are two big types: *Categorical* data (also called “*Nominal*”) and *Quantitative* (also called “*Interval and Ratio*” data). There’s another type called “*Ordinal*”, but I’ll deal with that in another Notebook entry.

*Categorical* data is what it sounds like – is it tall, short, working, not working, green, orange, in the box, not in the box, of a certain species, and so on. You’ll see this data as counts, or percentages. You can test for this kind of data using a Hypothesis Test such as a Difference Between Proportions. This is useful when you want to show how different two groups (populations) are, such as, “Who usually buys the opposite of our product?” Other tests here include the *differences between the means*, *independence tests* (like *chi-squared*) and the *regression slope*.

*Quantitative* data is numeric. You can perform *counts*, *sums* and other aggregation calculations, but most often you’ll use the *average*. You’ll then be able to use various comparison tests for those averages such as a *test for a mean*, *differences of the means of independent or paired means*, and also *regression*. Regression helps in predicting values that follow.

# Resources

Of course this Notebook entry isn’t exhaustive, and glosses over some things that can be really important. It’s only meant to get you started. You can find out more in the links below – and keep studying those statistics courses I pointed out!

- There is a great chart showing a lot of these – and more – statistical formulas and tests you can use: http://www.datasciencecentral.com/profiles/blogs/how-to-choose-a-statistical-model
- If you’re further along in your statistical learning, this is a good video to see another angle dealing with when to use a test based on whether you’re going after a prediction (inferential) or trying to group (descriptive) your data: https://www.youtube.com/watch?v=HpyRybBEDQ0

There was a software package that asked questions about the data, then returned a list of appropriate stats to use. I cannot remember the name of it, however! Any help here?

LikeLike