(Complete Table of Contents here: http://aka.ms/backyarddatascience)
With a format for learning to be an Amateur Data Scientist established and a firm understanding of how you learn, it’s time to focus on what to learn.
There are no shortages of Internet posts, magazine articles, or college syllabi describing what a “Data Scientist” should know. I originally thought the term was still up for debate – but there are “real” Data Scientists that have formal degrees and years of experience with that official title (my team is full of them, save yours truly). But in my case, I’m building this knowledge outside of a formal degree. Since I have to start somewhere, I’ll extrapolate from these other references to include the knowledge path I need to follow. Feel free to modify to your liking.
NOTE: There’s an absolutely wonderful visual representation of what a Data Scientist should know that you can find here: http://nirvacana.com/thoughts/becoming-a-data-scientist/ by Swami Chandrasekaran, and I would encourage you to look over his work. What I show here is independent of that grouping, but similar. Of course, he’s using several tools from IBM and I’m using the ones at Microsoft. Pick your stack and learn it well. Want to use Open Source only? Knock yourself out.
ALSO NOTE: I have never liked a “tools approach” to learning. Yes, you’ll need to learn several tools and yes, I often use a tool to learn a thing (like using R to learn Statistics) but I focus on the concepts, not just how the work is done. First learn why you do something, and then shawarma after. So learning concepts first and then choosing a tool is the route I’ll follow here.
Or, you can simply follow a complete course, online. There are several really good ones:
- Coursera: https://www.coursera.org/course/datasci
- Udacity: https://www.class-central.com/mooc/1480/udacity-intro-to-data-science
- EdX: https://www.edx.org/course/data-science-machine-learning-essentials-microsoft-dat203x
- Caltech: http://work.caltech.edu/telecourse
- The Open Source Society: https://github.com/open-source-society/data-science
Among many, many others.
Before you even get started learning all the new cool toys, you need a foundation (Asimov is my hero). Again, there are a lot of schools of thought on this, and certainly more knowledge is helpful. That being said, here is the barest of minimums for this discipline:
Statistics is the life-blood of the Data Scientist. Whether it’s something you already know well or something you have to learn, it’s required. You can start with simple courses and move up, but Classification and Prediction are the two areas you want to focus on. I use a combination of books, online courses (The Khan Academy is one of my favorites), and even The Manga Guide to Statistics: http://www.amazon.com/Manga-Guide-Statistics-Shin-Takahashi/dp/1593271891 (I have no sense of shame). A quick web search will show you many courses for free on statistics, at almost any level from beginner to advanced.
Along with statistics, you’ll need to have at least a High-School understanding of Linear Algebra. Many Machine Learning models assume you have this knowledge. Once again, lots of Public Library books , online courses and more.
Understanding formal Logic is important for the Data Scientist. Focus on Predicate, Mathematical, and Computational as a minimum: http://www.logicmatters.net/tyl/.
You will need to present your findings at some point, and even to explore them you need to understand how to represent data in a graphical fashion. I use the The Wall Street Journal Guide to Information Graphics, but there are many other books and sites dedicated to visualizing information.
Believe it or not, you don’t even have to use a computer to learn to program – although normally grabbing a language (Python or R are best for the Data Scientist) is a good way to use a tool to learn the topics. You don’t need a full Computer Science (CS) degree, but if you do follow a syllabi for one you’ll also get algorithms and other skills that will be helpful: http://spin.atomicobject.com/2015/05/15/obtaining-thorough-cs-background-online/
Another base skill you need is working with data. This might sound obvious, but most of the time the obvious things, aren’t. It’s important to learn more about at least these topics:
- Data types
- Data sources
- Data Interpretation
- Data Ingress
- Transforms and rollups
(You’ll get pointers to where you can learn more in the sections that follow)
Along the same lines as the foreknowledge you need for starting with the Data Science tools, you need to know a great deal about various industries that use data analytics. While every business or organization benefits from a good application of Data Science, some use data analytics in a larger, deeper way. It’s good to immerse yourself in some of the deeper knowledge about:
- Physical sciences
- Government systems
That is by no means an exhaustive list – far from it – but learn about how these types of organizations rely on data. In my career I’ve worked in all of these, and many more.
The key is that you can learn where you are, right now. Get involved in how your organization works, and how they do business – not just IT. Find out the hard problems, and join the teams that are solving them. Be in the moment of your current role, and work with any executive that will give you time. Also, couldn’t hurt to read the Portable MBA: http://www.powells.com/biblio/9780471119845.
OK, at some point, you get to play with the new toys. While you can start with tools, it’s a bad idea. Start with the foundations, then pick the right toolset for solving the problems. You’ll get less attached to the tools that way and more attached to success.
Don’t make that face. You need to know Excel. Not just how to create a workbook, but learn to milk it, make it dance, make it walk on all four legs. Start in Excel, get pushed out. The reasons for using this tool in data science are that it has much of what you need already, your users know it, and it will help you explore the problem and visualize it in new ways. I use this gem of a resource: https://www.microsoft.com/learning/en-us/book.aspx?ID=17313
SQL Server just keeps adding more things to the box to work with structured data. It’s fast, handles large datasets, is well used, and it plays well with everything else. Learn more here: http://www.microsoftvirtualacademy.com/product-training/sql-server and here: https://www.microsoft.com/learning/en-us/sql-training.aspx
Built into the SQL Server product is the Business Intelligence suite called Analysis Services. This will help you explore historical data and do data mining: https://www.microsoftvirtualacademy.com/en-us/training-courses/designing-bi-solutions-with-microsoft-sql-server-8453?l=rcxlWtWz_3404984382
You’ll want to publish your data so that users can consume it using data visualization with a tool both of you understand. In addition to Excel, Power BI is that: https://www.microsoftvirtualacademy.com/en-us/training-courses/faster-insights-to-data-with-power-bi-jump-start-8291?l=FGLOu8Xy_2504984382
R (and/or Python)
You’ll need a way to handle statistical programming. Two of the largest ways of doing that are using R or Python. As of this writing, either is useful. While there are thousands of resources on these topics, you can start here:
Hadoop is an ecostructure, not just a processing system. It’s used in many data processing systems, including many we use at Microsoft. Microsoft’s release is called HDInsight, and you can learn more about it here: http://azure.microsoft.com/en-us/documentation/services/hdinsight/
Machine learning is the way that you can take sets of data and extrapolate reusable formulas for prediction and classification. I use AzureML for that, and you can learn more about that here. You get a free account and learning environment: http://azure.microsoft.com/en-us/documentation/services/machine-learning/
Sometimes you need to act on the data as it arrives, especially in manufacturing and healthcare. You can use Storm or Azure’s Stream Analytics (or both) for that: http://azure.microsoft.com/en-us/documentation/services/stream-analytics/
Microsoft is now bundling up this entire group of technologies in what we are calling “Cortana Analytics”. You can learn more about that here: http://www.microsoft.com/en-us/server-cloud/cortana-analytics-suite/what-is-cortana-analytics.aspx
Here are a few other views on what a Data Scientist should know:
- Job interview questions for data scientists: http://www.datasciencecentral.com/profiles/blogs/66-job-interview-questions-for-data-scientists?goback=.mid_I207394501*45_*1
- So you Want to be a Data Scientist? http://www.jeffheaton.com/2014/02/so-you-want-to-be-a-data-scientist/
- 9 Must-Have Skills To Land Top Big Data Jobs in 2015: http://allabttech.com/2015/07/02/9-must-have-skills-to-land-top-big-data-jobs-in-2015/
- My data science journey: http://www.datasciencecentral.com/profiles/blogs/my-data-science-journey
- More free data science books: http://www.learndatasci.com/free-books/
- More than 100 data science, analytics, big data, visualization books: http://www.datasciencecentral.com/profiles/blogs/more-than-100-data-science-analytics-big-data-visualization-books