Learning Statistics


, ,


List Catalog

This is a “Catalog” entry, which is used in Scientific Notebooks to list out a species name and data about that species. In this Notebook, I’ll use that to show a list entry of things you need to know. This entry is about Statistics, and where you can go to learn it.

I’ll start with the basics of statistics, assuming that you’re new to the topic. As the resources progress, they rely on the topics before them. Of course, you can always attend formal education or other “in-seat” training – the references below assume that you’re going to follow a path of learning on your own.

These resources require real work to master – they are not quick tutorials, they have exercises, and take some time to complete. Make sure you can commit to them before you start them.

Final note – these represent resources I have used or know about, and I’ve only listed the ones that I think have the quickest path to learning. It is by no means a complete list (or the Population X), so if you are aware of one you like or have used, please post a comment here and I’ll approve it. Explain why you like that resource.

Resource Name Resource Type Notes Resource Location
Khan Academy – Probability and statistics Web course An introduction to statistics if you are new to the topic https://www.khanacademy.org/math/probability
Udacity – Intro to Descriptive Statistics:

Mathematics for Understanding Data

Web Course Another “beginner” course, this one on Descriptive Statistics https://www.udacity.com/course/intro-to-descriptive-statistics–ud827
Udacity – Intro to Inferential Statistics:

Making Predictions from Data

Web Course Take the Descriptive Statistics course first, then this one https://www.udacity.com/course/intro-to-inferential-statistics–ud201
Udacity – Statistics:

The Science of Decisions

Web Course Take the previous two courses, then this one https://www.udacity.com/course/statistics–st095
Carnegie Mellon University Open Learning Initiative – Probability and Statistics Web Course Requires Excel or R, uses Flash and Java. Contains multiple modules, can be used in place of the Udacity courses, and could be faster for some people. http://oli.cmu.edu/courses/free-open/statistics-course-details/
Statistics in Plain English, Third Edition Book Simple but not too basic. http://www.amazon.com/Statistics-Plain-English-Third-Timothy/dp/041587291X/ref=sr_1_1?s=books&ie=UTF8&qid=1447849339&sr=1-1&keywords=learn+statistics
Statistics for People Who (Think They) Hate Statistics Book If you’re a little intimidated by statistics, this is a good intro. You will still need to follow up with other courses to supplement. http://www.amazon.com/Statistics-People-Think-Salkind-Without/dp/1452277710/ref=sr_1_2?s=books&ie=UTF8&qid=1447849339&sr=1-2&keywords=learn+statistics
Even You Can Learn Statistics and Analytics: An Easy to Understand Guide to Statistics and Analytics (3rd Edition) Book Another approach to quickly learning statistics, again, the basics will need to be followed up on. http://www.amazon.com/Even-You-Learn-Statistics-Analytics/dp/0133382664/ref=sr_1_6?s=books&ie=UTF8&qid=1447849339&sr=1-6&keywords=learn+statistics
Introductory Statistics with R (Statistics and Computing) Book Good book on using R to learn Statistics http://www.amazon.com/Introductory-Statistics-R-Computing/dp/0387790535/ref=sr_1_11?s=books&ie=UTF8&qid=1447849339&sr=1-11&keywords=learn+statistics
Introductory Statistics by perdisco YouTube Series Actually chapter summaries of their books, but very well done. Click their channel logo to see the whole series. https://www.youtube.com/watch?v=qKPhtBIsyIY
Statrek Web Series This is one of my favorite sites to learn or teach statistics with. Simple, quick, straightforward. Not a lot of explanation, but very well done. http://stattrek.com/tutorials/free-online-courses.aspx

Learn First or Use First?




A good friend of mine asked a question the other day that I’ve been asked before – and I have a rule that if I’m asked the same question from multiple people, I blog it.

The question was this: “Should I learn statistics first and then focus on R or learn statistics along the way?” This follows a pattern of “Should I learn the foundation concepts of some technology or jump in and use the tools, picking up the concepts as I go?”

The answer is – yes.


If you really want to master a concept or skill, you need to understand the foundation concepts first. Golfers will tell you this. Many people that learn to golf simply start playing, and then find themselves frustrated that they are not doing well. They then seek out a professional trainer, who has to help them “un-learn” all the bad habits they’ve developed. A better approach in golf is to learn the proper basic techniques and strategies, and then move on to putting them together to form a good game.

This holds true for Data Science as well. Ideally, you’ll give full time and devotion to the topics you need to learn. By getting a strong foundation in math, statistics, and various data analytics topics, you’ll be a better Data Scientist.

There are a few issues with this approach. The first is time. One could make the argument that it would take several years to learn, much less master, the levels I’ve described, even for the basics.

Another issue is establishing a good learning path. A great deal of thought is required to develop a tailored course of study for each student that lays out the right mix of “this first, with this, then that” syllabus of learning.

So do you just start golfing, er, Data Science-ing? I think you can. That’s what this series of notebooks is about. We’re on a path of using a mix of basic foundational concepts (you are following the stats posts I gave you, aren’t you?) and mixing in tools that you need to know.

RSo jump in. Learn some R. Try some Python. Open Azure ML Studio. It’s not a perfect way of learning the topic, but it’s certainly achievable, and you’ll have the added incentive of actually seeing progress.

A word of caution: I do NOT advocate that you open R, run a few stats, and announce to your company that you are now the resident Data Scientist. Part of being a professional is knowing your limits. Always have someone with experience check your work, and leave the heavy lifting to the professionals, especially the important things your company needs to make decisions on.

(But follow them around like a kid brother, asking questions and making a general nuisance of yourself. That’s one of my favorite ways to learn.)

Now go get on that stats homework. And use R to do it. You’ll learn two things at once.

Python for the Data Scientist


, ,

WhatWhyHow What, Why, How

In a previous notebook I introduced the R programming language and environment. While R is very powerful, widely used and has multiple packages, another language called “Python” is also popular with Data Scientists. Yes, you can do amazing things in R – in fact, part of R is written in R (think about that for a moment), and the amount of packages you can get for it is simply huge.

But Python has some distinct differences that make it attractive for working in data analytics. It scales well, is fairly easy to learn and use, has an extensible framework, has support for almost every platform around, and you can use it to write extensive programs that work with almost any other system and platform.

It used to be that if you wanted to write scalable programs that worked in a complete system, you used Python, and if you wanted to specialize on statistics you used R. But that’s changed. R has grown to encompass more functionality, and with the RRE packages it scales well out from installed memory. On the other hand, Python has added a number of libraries dealing with math, statistics, science, data and more, that it is starting the rival the R language in usefulness for a Data Scientist.

So in short, if you’re dealing primarily with statistics and data, R is a great language to learn and use. If you want to add in more functionality dealing with systems not specifically involved in statistics and data, Python is great to learn and use. Actually, if you’re serious about Data Science, you should learn both.

Installing the tools

pythonIn this notebook entry I’ll show you a couple of tools you should install to use Python, assuming that you’re on Windows. For Linux or Mac, the process is similar but the tools are different, so I’ll cover those in another post.

Begin by installing Python itself. You can find it here: https://www.python.org/downloads/.

Right away you face a choice – Version 3 or Version 2? And why is that even a choice?

Well, Python is a victim of it’s own success. It does so many things, and so many things well, that it was adopted into many organizations in a big way. Like R, it has packages (called “Modules”) that allow it to be extended significantly. There are so many modules that were written for the 2.x version of Python that it is taking a long time to convert them to 3.x. In some cases, they simply won’t be ported at all. Since organizations may depend on these Modules, the earlier version is still around.

I use the 3.x version. Most of the functions a Data Scientist needs are ported, and new ones are being developed for 3.x, not 2.x.

The Module list is huge (here are a few: https://wiki.python.org/moin/UsefulModules ) and if you want to check to ensure the one you want to work with is supported in version 3, check this link: http://py3readiness.org/.

Still confused? Read more here: https://wiki.python.org/moin/Python2orPython3

Python comes with an editor, called IDLE. I actually like using a full Integrated Development Environment, so I use Visual Studio 2015 Community Edition. It’s free, robust, and if you select “Custom” during the installation, you can allow add-ins to the product. I did that during my installation and selected “Python Tools”. Now I can work with Python code in Visual Studio.

If you want to install Visual Studio Community Edition, download it here: https://www.visualstudio.com/en-us/products/visual-studio-community-vs.aspx#. 

Learn more about working with Visual Studio here: https://msdn.microsoft.com/en-us/library/dn762121%28v=vs.140%29.aspx?f=255&MSPPError=-2147217396

learningGetting started with the language

With Python installed, you have a litany of resources you can use to learn how to use it. My favorite is “Learn Python the Hard Way”, located here: http://learnpythonthehardway.org/.

You can find the official documentation here: https://wiki.python.org/moin/BeginnersGuide/Overview

And once you’re familiar with Python, check out Data Science and Python: http://www.kdnuggets.com/2014/01/tutorial-data-science-python.html

There are literally dozens of other resources you can use to learn Python. Another one I like is at Code Academy: https://www.codecademy.com/. Nothing to install, and it’s free! I’ve completed this course and it’s quite good.

Python Notebooks read

No, not the notebook you’re reading now, this is something else entirely. It’s a way of working with and integrating Python code directly in a web document. The basic concepts are here: https://ipython.org/notebook.html

And something you’ll see used quite frequently among Data Scientists is an implementation of ipython called Jupyter. You can sign up and use it here: http://jupyter.org/

And just like that, you’re on your way.





The SQL Server community is amazing. I’ve been in technology for 30 years, and it’s the most uniquely connected set (pun intended) of professionals I’ve seen. There’s a “SQL Family”, and that’s not hyperbole. We actually care about each other. I’ve stayed with some of you, and some of you have stayed with me (stop by any time for coffee).

One of our family recently had a life-changing event. Mike Walsh https://www.linkedin.com/in/mikepwalsh) was told he has Multiple Sclerosis. That’s a tough thing to hear, and it would be normal to back out of any commitments he had to take some time and deal with that. I don’t know what I would have done if I got that news.

But Mike is tougher than me. He’s stayed upbeat, funny, and fully engaged. He was at the PASS Summit event, working his job, presenting, and encouraging others. He’s a wonder.

So I’m inspired by Mike. So much so, that I’m going to do something, and I want you to do something too.

At Christmas (or whatever tradition or season you celebrate), we give gifts to each other. This year I want you to *not* get a gift from those you love. Instead, ask people to donate in Mike’s name here (http://www.nationalmssociety.org/Donate) as their gift to you this year. And let Mike know you did that by tweeting or facebooking with the hashtag #sqlcares. Something good can come out of bad news. Mike knows this, and you should too.

Mike isn’t alone. Yanni Robel is another of my heroes with an amazing story. We’re surrounded by people whose character is stronger than we can imagine.

So give away a gift you might get this year. It will do you good, teach your children about true courage like Mike’s, and help the fight against a terrible enemy that’s messing with our #sqlfamily. Let MS know that we don’t tolerate that behavior, and let Mike know you care.


Reducing (although sadly not eliminating) bias in sample gathering

WhatWhyHowWhat, Why, How

To obtain the data for the analysis a Data Scientist needs to work with, there are two options: you can get all the data (called a population or “X”) or a subset of the data (called a sample, or “x”). Most of the time the information you need to perform analysis is too large to get it all – especially where one of those data points is time.

So it seems a simple matter to look at a large group of data, pull some out at random, and use that to estimate what the rest of the data will look like. Ah, but therein lies the rub.

You see, humans aren’t really random, and computers are definitely not random. It might seem that things are randomly selected, but if you’re not extremely careful, a pattern emerges – statisticians called these “biases”. The word bias comes to us through Latin and French, and essentially means “to slope or angle”. If we want to trust the base data we use for an estimation or assumption, we have to eliminate as many of these biases as we can. There are a few we can examine to see if they show up in our sample gathering. Although there are several, I’ll bunch them up into two major groups: Bias in Design, and Bias in Collection

Bias in Design diagram1

In these types of bias, the researcher isn’t paying enough attention to how they design the study or data gathering.

The first and most common error is Selection Bias. This is where you pick the wrong group of data to begin with.

For example, if you’re testing to see the most popular kinds of food in the UK, creating a poll on the types of Grits they eat is a bad design. People in the UK don’t often eat Grits (although they do eat Polenta, which is essentially the same thing, but don’t tell them that). Or perhaps your test is designed to be administered in another country, so of course that wouldn’t tell you much about the UK. How do you stop this? Think it through – and involve as many people as you can in the design.

Also included in this area are population parameter biases, sensitivity biases and specificity biases. If you focus on the Selection bias, you’ll correct most of your errors.

Another primary bias is more of an error – it’s not collecting enough parameters. If you only ask what type of meat the people in the UK like, you’ll miss the people that only eat vegetables and so on.

eating Bias in Collection

These types of biases happen when you’re getting the data, or running an experiment or test. Here the most common biases I see are designing a study that is too small (perhaps because you don’t have a lot of money), interference from the observer (like asking leading questions) and inferring meaning into the data when it isn’t collected.

By far the most common error is selecting too few samples from a population. More data is (almost) always a good thing, especially of the data is likely to have a high degree of variability. For instance, if you’re collecting data from a web log on a server, collecting data from only one day intervals is far less useful than selecting data each hour or minute. (Although there are some tricks around that – here’s an interesting example: http://www.eetimes.com/document.asp?doc_id=1275354 )

You can avoid these types of errors by evaluating multiple collections to see if they show too much variability with each other, using statistical tests that will help you ferret out bias. I’ll cover that in another notebook entry.

The Data Scientist’s Computer


, ,


Everyone uses a computer for lots of things, from e-mail to chat, from gaming to office work. And yet, there are some specific needs a Data Scientist has for their primary system.

While I don’t recommend a specific brand or model (these things change too quickly to make this notebook entry useful for any length of time), there are four things to think about inside a PC or laptop:

  1. CPU
  2. Disk (or I/O)
  3. Memory
  4. Network

And three things to think about outside a PC or laptop:

  1. Screen
  2. Input (Keyboards, mice, etc.)
  3. Output (Ports)

I’ll cover a longer explanation in another tutorial, but if you want the “Bottom Line, Up Front” (BLUF), here are the things you need to know:

RIf you use R

  • Since R loads data into a dataframe in memory, you need lots of RAM. The more the better, and the faster the better. Spend money on this component.
  • More, and fast, storage is also important if you plan to store the input and output data on your own system.
  • The CPU is less important, but of course the faster your budget allows is better.
  • The keyboard is important, a gaming mouse is not. Screen is not as important in R if you are not taking the data to the next step and visualizing it.
  • Fast networking is only important if you are importing or exporting the data over the wire.

pythonIf you use Python

  • The CPU is important, and since you can parallelize the processing, more cores are better. Spend more money on this component.
  • Memory is important, but skew the budget towards CPU.
  • Faster and more storage is important if you plan to ingress and output the data locally.
  • The keyboard is important, and the mouse is important.
  • The screen is not as important in Python if you are not taking the data to the next step and visualizing it.
  • Fast networking is only important if you are importing or exporting the data over the wire.


pitotIf you use Hadoop, Virtual Machines, or Machine Learning

Here, everything is important. Since all of these systems use distributed processing, all four internal components should be as many as you can, as fast as you can, and as new as you can. Screen size, or multiple screens, become important so that you can see all of the panels these systems display. Keyboard and mouse is essential, since you’ll navigate quickly among lots of interfaces. And ports come into play here as well – you’ll often need to connect to external storage or even run a Virtual Machine on external storage. You’ll also need to think about taking data in from the “Internet of Things”, so you may need more than one networking or other interface to stream data in or out.

CowsIf you’re developing against a larger or distributed system

In my case, I only focus on two things: Lots of screens (and big ones), and a really nice keyboard and trackball.

My production environment, and in some cases my development environment, is Microsoft Azure (although other cloud platforms exist, as I understand it), so I have tens of thousands of cores at my command at any time. My process is to design and create the systems locally, and then I deploy that to a distributed system that grows and shrinks with demand. In some cases (like AzureML), I can develop online, so anything with a good screen and keyboard is all I need.

However, I use a tower system so that I have a dedicated graphics card to push three or more monitors. I use two monitors for development, and the third as my monitor for presenting. That third monitor is relatively small and cheap, so that the windows I present are large and readable for my students. I use an HD camera for recording.

I also have a gaming keyboard and a Logitech “marble” thumb-ball track system.

I don’t watch TV, so I put all my money into fiber Internet access for presenting and teaching.

Speaking of teaching, I triple-boot my system to Windows 7 (thanks a lot, WebEx for requiring that), Windows 10, and Ubuntu Linux, depending on what I am teaching. I can’t use a VM for the multiple OS’s since I need to present and develop on the OS I’m teaching, so I need the ports direct for the HD camera and so on. You might have a similar need if you are presenting a great deal or doing visualizations.

For travel – I use a Microsoft Surface Pro 3. I like having a tablet to read on the plane, and I like that it’s a full computer for presentations and work. I can still remote-desktop or SSH to my Azure systems from the Surface.

Those are the general guidelines. Your mileage may vary, and if you want to really go deep into the tech here are some resources to check out:

Statistics: Working with R and Revolution Analytics Software


, ,

WhatWhyHow What, Why, How

One of the most distinctive features of Data Science, as opposed to working with databases, Business Intelligence or other data professions, is its heavy use of statistical methods. At the first appearance of computing science, programs and algorithms were created to deal with the large amounts of calculations required in statistics.

One of those implementations was the “S” programming language, invented in the mid-1970’s. Based on those concepts, the “R” environment was created by Ross Ihaka and Robert Gentleman in New Zealand under the GNU license. Interestingly, it’s written in C, Fortran, and R itself.  It’s one of the premier languages and environments you can use in Data Science. It has amazing language breadth, and it can be extended through the use of “packages” – there are SO many packages out there, your first task in using R, it seems, is learning what is already written so you can leverage it.

In future notebook entries we’ll explore working with R, but for now, we need to install it. That really isn’t that difficult, but it does bring up something we need to deal with first. While the R environment is truly amazing, it has some limitations. It’s most glaring issue is that the data you want to work with is loaded into memory as a frame, which of course limits the amount of data you can process for a given task. It’s also not terribly suited for parallelism – many things are handled as in-line tasks. And if you use a package in your script, you have to ensure others load that script, and at the right version.

Enter Revolution Analytics – a company that changed R to include more features and capabilities to correct these issues, along with a few others. They have a great name in the industry, bright people, and great products – so Microsoft bought them. That means the “RRE” engine they created is going to start popping up in all sorts of places, like SQL Server 2016, Azure Machine Learning, and many others. But the “stand-alone” RRE products are still available, and at the current version. So that’s what we’ll install.

blog2_thumb.png Installing R

RRE builds on the R engine, so we’ll need that first. However, the installation for RRE has a dependency on the version of R we install, and as of this writing that’s 3.2.2 for Revolution R Open (RRO) and 3.1.3 for Revolution R Enterprise (RRE). More on those choices in a moment.

So we’ll start with R – you can find that here: https://cran.rstudio.com/bin/windows/base/R-3.2.2-win.exe

Once the download completes, select your language and “Next” from the Welcome panel.


Select “Next” after you read the Information panel, then “Next” again at the Select Destination Location panel.

At the Select Components panel, select “Next” again, and for all subsequent panels unless you want to change the defaults – although you don’t need to change anything for RRO or RRE.

toolsInstalling Revolution R (Enterprise or Open)

You have two choices for the stand-alone version of RRE: Open, and Enterprise. The differences between the two are summed up at this bottom of this page: http://www.revolutionanalytics.com/get-revolution-r

For this exercise, we’ll install the Open version of RRO, although in production we’ll want those features that deal more with the limitations of R, and also provide interfaces to Hadoop and contains interfaces to work with big data. We’ll install on Windows, but Ubuntu, RedHat, and SuSE Linux is also supported. We start here:



Since we’re running Windows during this installation, we select the link next to that platform.

Once that downloads, we start the installation:


For this installation, we can take all the defaults. I did add the icon to the quick launch area, since I plan to be in R quite a bit.  After we install the main RRO package, we’ll want the math libraries. You can see that at the download site, just to the right of the installation for the Windows package we just launched. Click that “MKL” link, and once again, take all the defaults.


 Exploring the Tools


We have two new folders on the Windows Start Menu. One is for R, and the other is for RRO.

Inside the RRO folder there’s an icon for the RRO GUI:


Opening that brings us to the R Console, with RRO loaded up and ready. We can now run some simple commands, like these:

X <-1:10


We’ll get into what all this means in future posts – but that doesn’t stop you from taking a free class ahead of time: https://www.datacamp.com/courses/big-data-revolution-r-enterprise-tutorial


On being skeptical of the old, women, and minorities


, ,

journal Journal

The definition for “skeptical” is “Not easily convinced, having doubts or reservations” (http://www.oxforddictionaries.com/us/definition/american_english/skeptical). As a Data Scientist, you’ll want to keep a healthy dose of skepticism in two areas:

  1. The source and meaning of data
  2. The conclusions someone draws from that data

In this notebook entry, I’ll focus on the second concern.

Answer a fool, don’t answer a fool  explosion

I saw an article last week with the provocative title “Why can’t Microsoft find analytic talent?” (http://www.analyticbridge.com/profiles/blogs/why-can-t-microsoft-find-analytic-talent). Being in the Data Science group at Microsoft, and wanting to know more, I read the article.

The article opines that Microsoft wants to bring in H1B talent, because they can’t find good workers here in the U.S. The author explains further than Microsoft does not pay well, is too large a company, and that they are hampered with hiring “the old, women, overweight people, and slow speakers” (My assumption here is that “slow speakers” are non-Americans). He sees these things as a weakness.

His data source is his apparent observations of some folks he thinks work for Microsoft that “show up at his favorite restaurant”, so I’m assuming he means these are the women, overweight, old, and minorities that he is basing his conclusions on.

In the Jewish “Book of the Wisdom of Solomon” (http://jewishencyclopedia.com/articles/14951-wisdom-of-solomon-book-of-the), there is an enigmatic set of proverbs, positioned next to each other. One says “Respond to a fool or he will think he is right and continue in his error”, and the other says “Don’t respond to a fool, because essentially you’re lending him credence and wasting your time.” Seems counter-intuitive, no? I’m told my my more learned friends that means that there are times you do, and times you don’t, answer someone who you think is wrong.

I couldn’t help but be reminded of these proverbs when I read the article, since I disagree with almost everything in it (he did spell Microsoft correctly), but I felt it would bring my credence down to even do that.

So instead, this becomes a wonderful learning opportunity for how we should treat conclusions and data.

I’ve long taught my daughter that whenever she is told something, she needs to ask three questions before she believes it: Who is telling me this, Why are they telling me this, and What are they telling me. Let’s use this method to break down a conclusion – we’ll use the aforementioned article as an example.

WWho is telling me this?

We start with the source of the interpretation. After all, if I don’t trust the person or program giving me the answer, I don’t need to go any further with the conclusion.

In the case of an algorithm or formula or even program, it’s a simple enough matter to test the process to ensure that you trust it. In the case of a person, that is harder to do.

The author of this article is in fact a highly educated person working in the field of Data Science. According to his LinkedIn page (http://www.linkedin.com/in/vincentg) he has been working in the statistical and computational area since 1995. So it seems his background tends to indicate a high level of scientific rigor.

However, I don’t see a background of working at Microsoft, doing studies on whether the old, women, or minorities make good data scientists or other authoritative information qualifying this person to make the claims made. So I have to wonder what qualifies this person to make the conclusion – he certainly seems to have the education to perform a study on the data and back up the claims.

The takeaway for us as we learn is to thoroughly research the person, algorithm, system or process that provides us with a conclusion before we accept it. This is standard scientific method. (http://www-personal.umd.umich.edu/~delittle/Encyclopedia%20entries/Verification.htm)

Why are they telling me this? W

After verifying that you trust the person or system making the conclusion, the next task is to find out why that conclusion is presented. In the case of a formula or computer program, that is quite easy to determine: because you ran the data through the system.

In the case of a human or group of humans, motivation and bias come in to play. If a salesperson tells me that this car “Is the greenest diesel you can buy”, it’s important to know that she needs to sell that car by today to make her quota. Would her statement be correct if that were not the case? More research is needed.

The takeaway in this step is to see if the person or system has bias – and if you can account for that. (https://www.psychologytoday.com/topics/bias/essentials)

WWhat are they telling me?

The article we’re using as an object lesson seems to have two points:

  1. Microsoft’s claims around H1B workers are incorrect
  2. A “good” Data Scientist does not work at Microsoft, for reasons of hiring and wage

The first point in this case seems to get lost – there are no data, examples, or citations on the first point. Without that source data, the claims are impossible to evaluate, so we will not continue that here.

The second point is the more interesting one. We need to define terms – what is a “good” data scientist? Is that a number of whitepapers published, systems developed, implementation of effective data science systems, etc.? And does hiring older workers, women, and minorities in fact a problem for an organization, or a strength?

Biology, society, and even a Financial portfolio need Diversity to survive – I’ll leave the study of business and scientific diversity as a topic for you to learn more about. http://www.workforcediversitynetwork.com/res_articles_DiversityMetricsMeasurementEvaluation.aspx

Let me be clear here that at Microsoft we are not “forced” to hire diversely – we do that on purpose! Especially and including the data science area: http://blogs.msdn.com/b/msr_er/archive/2015/03/24/diversity-in-data-science_3a00_-microsoft-research_1920_s-summer-school-aims-high-.aspx

And yes, we’re hiring – come one, come all. I don’t even care if you’re old, like me. :) https://careers.microsoft.com/

There are a few folks you can look up in history, not to mention recently, who might be a good resource:

To name but a few of folks I admire in science, and who oddly enough appear to be women, old, and minorities. Even that, I admit, is a small data sample, but it would seem that anything that disputes the null hypothesis makes it not acceptable.

Learning from everything

This post isn’t a rebuttal to the article in question. That would take more effort than it is worth – the more important thing to takeaway is the process you can follow to evaluate the conclusions from a set of data. You should do that on your own, for every important decision.

In future notebook entries, we’ll take a look at the first question in data skepticism – the source of the data.

Descriptive Statistics – Initial Evaluation of the Data


, ,

speciesSpecies: Population, Sample, Count, Spread, Mean, Median, Mode, Basic Visualizations

The most important part of data analysis is a thorough understanding of the data we’re looking at. Once we’ve verified what the source of the data actually means (another entry in the Data Science Notebook entirely) and that we can trust it, we need to do some simple visualizations and calculations to see what it means.

I find that using even basic descriptions is very useful. These species of statistics are also often called “Exploratory“, since it’s a method of just looking at what you have. And at the end of this Data Science Field Notebook entry, we’ll see how deceptive these “simple” things can really be.

Population or Sample? people

First, is this all of the possible data or just a part of it? The formulas we’ll use to describe the data and eventually make predictions with depend on that answer. Statistics put data into two types:

  • Population – All of the data about a thing
  • Sample – Some of the data about a thing

For many things, we can have all of the data there is. For instance, let’s say we have a group of people in a room. If we want to know information about just those people, this is the entire group we need – everyone is right there in the room. That’s a population.

But in some cases we can’t get all of the data. Suppose we want to figure out what “most” people are like. Or maybe just part of what they are like – such as their age. We can’t measure everyone on the planet – it just isn’t practical. Not only that, the data changes as we measure it. By the time we measure someone and move on to the next billion people, the first person is older.

It turns out you can make a lot of guesses about the population (all of the people) from a smaller group of them (some of the people).

(And it turns out you can only fool some of the people all of the time)

There are, however, two problems with using just a subset of data to make assumptions about all of it. The first is that you need a fairly large group of people to make the guess (if in fact people are different. Sometimes they aren’t. More on that later).

The second problem is that the group (sample) you select needs to resemble, at least on some level, all of the people (the population). We’ll deal with those problems in another Notebook entry. For now, we’ll think about a sample that closely resembles the population, and one that is large enough to matter.

NOTE: In another Notebook entry I’ll explain ways that you can test the sample to see how much you can trust it to represent the population. And we’ll deal with that size issue. It turns out size does matter.

count Count and Spread

The simplest thing you can do with data is count it. How may do you have?

The second simplest thing you do is measure the spread of the data – although even this starts becoming interesting.

For instance, in measuring age, we might have 25 people in the room, and the youngest person might be 5, and the oldest 70. It’s actually important to know these numbers – they help us when we start talking about the sample representing the population, or in the case where we have the whole population, what some of these numbers might mean.

As an example – let’s say we want to describe the people in a college class. It’s not odd to have 25 people in the class. It is odd to have a five year old and seventy year old in that class! Something doesn’t make sense there, so we would need to look at our data more closely before we base anything on it.

Mean, Median, Mode abacus

With those basic numbers out of the way, the next thing to do is to see how the data “centers” itself. There are three basic statistics we’ll use to look at that, and then we’ll take a look at why those are problematic.

Let’s get some numbers for the ages of folks in our room:
23, 18, 16, 18, 25, 23, 22, 22, 21, 5, 70, 21, 19, 21, 22, 24, 24, 23, 19, 18, 19, 20, 20, 20, 23

One of the first formulas to learn is the “Average” or “Mean”. There’s one for the population, and one for the sample (yes, it matters):

Sample Mean: x = ( Σ xi ) / n
Population Mean: μ = ( Σ Xi ) / N

Wow – that looks complicated, but it really isn’t. Statistics uses a lot of symbols, and they just need a little teasing out to understand. Anytime we come across a new symbol I’ll treat that as a species we need to learn about. For the Population Mean, here are what the symbols, uhm, mean:

  • The μ symbol is simply a placeholder for the whole formula to the right. It will be used in other formulas as we move through statistics, so a “simple” formula for a statistical calculation can explode into several lines after you decompress it. Remember, the fact that it’s a population matters – don’t let that mix you up later.
  • Next, we have the Σ symbol. That’s just a sum, or addition.
  • The large X (meaning it’s a population variable) stands for each number (like 23, 18, and so on)
  • The I symbol just means “keep going with the X’s till you run out”.
  • The N symbol (capitalized, watch that) means the count of items.

So, that whole thing boils down to “Add up everything and divide it by the number of things” (seems like they could just say that next time). And in our case, looks like this:
556/25 = 22.24

So that’s our Average, or Mean. We could say “the average age in this college class is around 22” and we’d be right.

But there are a couple of other measurements that are handy to look at when you first get a set of data. These are simpler formulas.

If we line up all the numbers from smallest to largest, we can take the middle one and find out where it lies:
5, 16, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 21, 22, 22, 22, 23, 23, 23, 23, 24, 24, 25, 70

In this case, the middle value is 21 – still pretty close to the average. This measure is called the Median.

The next handy measurement is the number that occurs most often – this is called the Mode. In our data set, this is 23 – once again, pretty close to the average.

linechart Basic Visualizations

So in this case, we know a lot about our data. But we need to look at it graphically to understand it a little better. Since we’re looking at how the data is averaging, a good set of visualizations to use are line charts and scatter plots. A line chart simply takes the data and draws a line from each data point going across (x-axis) and how far up from the bottom it goes (y-axis). Our chart, ordered from youngest to oldest in the room, looks like this:


So far that works for what we want to show. Most people lie in the 20-25 year old range. Let’s do a scatter plot of that as well – it’s the same concept, it’s just that the points aren’t connected:


This shows us the same thing, however – take a look at that! We can more clearly see there are two data points that are outside the biggest group. These are called “Outliers“, and we’re going to focus on those in another Field Notebook entry.

explosionAnd Now, For Something Completely Different

These are great ways to look at data. I use these “species” of statistics all the time. But they aren’t to be trusted, at least not all by themselves….

Take a look at these numbers:

AnscomeGridCalculating the Mean and Median for each set of data (each x and y pair) gives this:

AnscomeGrid2Looks almost the same for all of them, yes? So they are essentially the same? Uhm, no. Here are their scatter plots:

AnscomeGrid3Pretty different after all! This is a series of numbers called Anscombe’s quartet, and it shows that you can’t always judge a class by its numbers. And that graphs are important.

We’ll move on to working with numbers like these in R soon, and learn a little more about how interesting simple things can be.

The (Amateur) Data Science Body of Knowledge


, ,

Last updated: 09/16/2015


With a format for learning to be an Amateur Data Scientist established and a firm understanding of how I learn, it’s time to focus on what to learn.

There is no shortage of Internet posts, magazine articles, or college syllabi describing what a “Data Scientist” should know. I originally thought the term was still up for debate – but there are “real” Data Scientists that have formal degrees and years of experience with that official title (my team is full of them, save yours truly). But in my case, I’m building this knowledge outside of a formal degree. Since I have to start somewhere, I’ll extrapolate from these other references to include the knowledge path I need to follow. Feel free to modify to your liking.

NOTE: There’s an absolutely wonderful visual representation of what a Data Scientist should know that you can find here:  http://nirvacana.com/thoughts/becoming-a-data-scientist/ by Swami Chandrasekaran, and I would encourage you to look over his work. What I show here is independent of that grouping, but similar. Of course, he’s using several tools from IBM and I’m using the ones at Microsoft. Pick your stack and learn it well. Want to use Open Source only? Knock yourself out.

ALSO NOTE: I have never liked a “tools approach” to learning. Yes, you’ll need to learn several tools and yes, I often use a tool to learn a thing (like using R to learn Statistics) but I focus on the concepts, not just how the work is done. First learn why you do something, and then shawarma after. So learning concepts first and then choosing a tool is the route I’ll follow here.

Or, you can simply follow a complete course, online. There are several really good ones:

Among many, many others.

Foundation and Empanadas castle

Before you even get started learning all the new cool toys, you need a foundation (Asimov is my hero). Again, there are a lot of schools of thought on this, and certainly more knowledge is helpful. That being said, here is the barest of minimums for this discipline:


Statistics is the life-blood of the Data Scientist. Whether it’s something you already know well or something you have to learn, it’s required. You can start with simple courses and move up, but Classification and Prediction are the two areas you want to focus on. I use a combination of books, online courses, and even The Manga Guide to Statistics: http://www.amazon.com/Manga-Guide-Statistics-Shin-Takahashi/dp/1593271891 (I have no sense of shame). A quick web search will show you many courses for free on statistics, at almost any level from beginner to advanced.

Linear Algebra

Along with statistics, you’ll need to have at least a High-School understanding of Linear Algebra. Many Machine Learning models assume you have this knowledge. Once again, lots of Public Library books , online courses and more.


Understanding formal Logic is important for the Data Scientist. Focus on Predicate, Mathematical, and Computational as a minimum: http://www.logicmatters.net/tyl/.


Believe it or not, you don’t even have to use a computer to learn to program – although normally grabbing a language (Python or R are best for the Data Scientist) is a good way to use a tool to learn the topics. You don’t need a full Computer Science (CS) degree, but if you do follow a syllabi for one you’ll also get algorithms and other skills that will be helpful: http://spin.atomicobject.com/2015/05/15/obtaining-thorough-cs-background-online/

Data is Plural multiple

Another base skill you need is working with data. This might sound obvious, but most of the time the obvious things, aren’t. It’s important to learn more about at least these topics:

  • Data types
  • Data sources
  • Data Interpretation
  • Data Ingress
  • Transforms and rollups

(You’ll get pointers to where you can learn more in the sections that follow)

Getting down to Business business

Along the same lines as the foreknowledge you need for starting with the Data Science tools, you need to know a great deal about various industries that use data analytics. While every business or organization benefits from a good application of Data Science, some use data analytics in a larger, deeper way. It’s good to immerse yourself in some of the deeper knowledge about:

  • Healthcare
  • Physical sciences
  • Manufacturing
  • Government systems
  • Marketing

That is by no means an exhaustive list – far from it – but learn about how these types of organizations rely on data. In my career I’ve worked in all of these, and many more.

The key is that you can learn where you are, right now. Get involved in how your organization works, and how they do business – not just IT. Find out the hard problems, and join the teams that are solving them. Be in the moment of your current role, and work with any executive that will give you time. Also, couldn’t hurt to read the Portable MBA: http://www.powells.com/biblio/9780471119845.

Tool Time tools

OK, at some point, you get to play with the new toys. While you can start with tools, it’s a bad idea. Start with the foundations, then pick the right toolset for solving the problems. You’ll get less attached to the tools that way and more attached to success.


Don’t make that face. You need to know Excel. Not just how to create a workbook, but learn to milk it, make it dance, make it walk on all four legs. Start in Excel, get pushed out. The reasons for using this tool in data science are that it has much of what you need already, your users know it, and it will help you explore the problem and visualize it in new ways. I use this gem of a resource: https://www.microsoft.com/learning/en-us/book.aspx?ID=17313

SQL Server

SQL Server just keeps adding more things to the box to work with structured data. It’s fast, handles large datasets, is well used, and it plays well with everything else. Learn more here: http://www.microsoftvirtualacademy.com/product-training/sql-server and here: https://www.microsoft.com/learning/en-us/sql-training.aspx

Analysis Services

Built into the SQL Server product is the Business Intelligence suite called Analysis Services. This will help you explore historical data and do data mining: https://www.microsoftvirtualacademy.com/en-us/training-courses/designing-bi-solutions-with-microsoft-sql-server-8453?l=rcxlWtWz_3404984382

Power BI

You’ll want to publish your data so that users can consume it using data visualization with a tool both of you understand. In addition to Excel, Power BI is that: https://www.microsoftvirtualacademy.com/en-us/training-courses/faster-insights-to-data-with-power-bi-jump-start-8291?l=FGLOu8Xy_2504984382

R (and/or Python)

You’ll need a way to handle statistical programming. Two of the largest ways of doing that are using R or Python. As of this writing, either is useful. While there are thousands of resources on these topics, you can start here:


Hadoop is an ecostructure, not just a processing system. It’s used in many data processing systems, including many we use at Microsoft. Microsoft’s release is called HDInsight, and you can learn more about it here: http://azure.microsoft.com/en-us/documentation/services/hdinsight/

Machine Learning

Machine learning is the way that you can take sets of data and extrapolate reusable formulas for prediction and classification. I use AzureML for that, and you can learn more about that here. You get a free account and learning environment: http://azure.microsoft.com/en-us/documentation/services/machine-learning/

Streaming Analytics

Sometimes you need to act on the data as it arrives, especially in manufacturing and healthcare. You can use Storm or Azure’s Stream Analytics (or both) for that: http://azure.microsoft.com/en-us/documentation/services/stream-analytics/

Cortana Analytics

Microsoft is now bundling up this entire group of technologies in what we are calling “Cortana Analytics”. You can learn more about that here: http://www.microsoft.com/en-us/server-cloud/cortana-analytics-suite/what-is-cortana-analytics.aspx


Here are a few other views on what a Data Scientist should know:


Get every new post delivered to your Inbox.

Join 1,410 other followers