At the beginning of every data project is the data. While we spend a great deal of time figuring out how to move it, store it, compute it and evaluate it, the most important step is often given short shrift – sourcing the data properly. And that involves two things: Finding authoritative data and knowing what it really means.
Let’s say you need to find accurate data for international trade in the United States so that you can compare that with growth projections in your company. Before you locate and use that data, you need to ask yourself two important questions:
- Who do I trust to show me these numbers
- How do I let others know about this data
That second question might have given you pause – but it’s more important than you think. In Data Science, you will most always work within a team. And your conclusions are based on the bedrock of this data, so you need to document where you got it, and why you’re using it.
By the way – did you find that data yet? Did you use this one (http://www.census.gov/foreign-trade/statistics/historical/index.html) or this one (http://bea.gov/newsreleases/international/trade/tradnewsrelease.htm) or this one (https://research.stlouisfed.org/fred2/release?rid=51)? Or something else? Do they match? Do they use the same unit of measurement, for time and numeric features? Note that they are *all* from the same source – the government. In this case, that’s probably a good source to trust – or is it?
The idea of Master Data Management, or a Data Dictionary, has been around since before electronic computing. It is certainly not a new issue, but as we build more and more complicated analysis, the base data are the most important thing to evaluate. I’ve used multiple data catalog systems, and most every one of them has failed. There are three primary reasons for this:
- It works within a single kind of system, tracking only certain data
- Not everyone knows about it, especially the trillions of spreadsheet users that need it
- Only certain people can maintain it
Lately I’ve been using something I like a lot – the Azure Data Catalog (ADC). It overcomes these limitations, because it’s dead-simple to use, can point to any kind of data, and crowdsources the locations and descriptions of the data to everyone who uses it.
But it has one primary feature that I really like: It segments data by “Expert”. If you think about it, the expert is often the primary factor you use to find data – if you want your savings account amount, you go to your bank, if you want to know what time it is you look at an atomic clock, if you want to know your company holidays you look at your company portal. If you can’t find something in the fridge, ask your wife. (hint: it’s right behind the olives) You’re tracking Master Data Management in your head already – and when a new person starts at work, they ask you where that list is. You’re now the expert.
And so that answers the question. You trust the data because of who told you the data. Source is everything.
Sure, you could use a spreadsheet to do your Master Data Management. And of course there are other systems out there to track meta-data about your data sources. Whatever you use, make sure you keep those three limitations in mind. Keep it simple. Annotate for experts, and make sure everyone can use it. If you’re not using anything, I recommend you check out this video on the ADC. It’s quite useful.