DescartesThe beginnings of data science is data. Data are things that you know about, well, other things, so it makes sense to ensure you have a firm grasp on handling that data.

Note: I know this seems really is basic, but stick with me – it gets deep quick, and it’s essential to understand this well. I’ve (recently) had to go back and qualify some statements a client made when we started down an analysis path, only to learn one of these fundamental concepts was misunderstood, and it changed the whole project!

First, a couple of terms. Data is defined by the Oxford English Dictionary anything that qualifies as a Noun. According to my favorite source of education, that means a “Person, Place or Thing”. The next term we care about is Metadatawhich is defined as data about data. So let’s take a quick look at that in action:

Data: Buck Woody

Metadata: Name, Letters (number=10), English, Noun, Sentence, ASCII (Binary Conversion: 01000010011101010110001101101011 0101011101101111011011110110010001111001)…., ad infinitum

So you can see that the metadata about a datum is actually larger than the datum itself. (I’ve often wondered if there really is data at all, but rather just a group of metadata describing the datum which actually forms the datum, but I digress…)

There are generally two types of data: qualitative (information about something) and quantitative (numerical information that can be calculated). So “Buck” is qualitative (my name) and “42” was my age (at one time anyway) which is quantitative – the number of years I’ve been taking up space on the planet. These distinctions are VERY important to the Data Scientist,  since we’ve developed lots of methods to handle showing each of these kinds of information. You’ll use various quantitative techniques in R and Python, and other methods for qualitative data. You’ll also need to fundamentally understand these differences when you embark on learning Machine Learning techniques.

By the way – you can turn a qualitative datum into a quantitative one. We do this all the time – “On a scale from one to ten, how handsome is Buck Woody?” (don’t answer that). This is an important skill, and one I’ll cover in another Notebook entry.

So, does this device measure Continuous or Discrete data? Are you sure?

There are a couple of other data basics you need to understand. Does the data have a point-in-time, segmented, specific value oriented value? We call this a categorical datum – things like Male/Female, Night/Day, 1/0, Red/Green/Blue.   Note that categorical data can be numeric (quantitative) or not (qualitative).

The next distinction is whether the data is Discrete or Continuous. Continuous is probably easier to understand – it’s data that has a constant progression (up or down). Think about the temperature – it can be 0.000000001 Celsius, or  .00001, or .01, or 0, or 1, depending on what level of precision you need for measuring it. The value ranges over a scale. In point of fact, those values are infinite – which even has it’s own branch of math to deal with.

Discrete data has gaps. Read that again. That means there are not 1.0000001 bananas on my desk, just 1. It’s a discrete thing. This gets a little trickier than you might imagine, especially when we start thinking about that conversion from qualitative to quantitative I mentioned earlier (and is actually the error made that I talked about in that project).

Let’s say you’ve lined up myself and several Hollywood actors. I’ve asked you “Put us in groups of attractiveness”. You’ve done that, arranging us into “Scary”, “Guy Next Door” and “Wow”. Are these groups equally distributed? How much better is “Wow” from “Scary”? Is there a “Partly-Wow” we should have used? “Amazingly-Scary”? The point here is that a fundamental error I see quite often is using a Classification technique in a numeric comparison. You’ll see this all the time in rankings – rate your teacher from 1-10, or this meal from 1-7. You can’t treat discrete data like continuous data, especially when you are choosing an algorithm to work with.

I’ll refer back to this post as I cover algorithms in the future. For now, your homework is to use your newfound skill and start looking at the world using these terms – where do they work, and do they break down anywhere? Why?


4 thoughts on “Databas(ics)

  1. Maybe the jump from data to metadata was too fast. Sure, “Buck Woody” is data, even if it’s a single chunk of data. Metadata are regarded maybe too simplistic as data about data, though they typically try to define a dataset or data of the same type, and then, when compared with the dataset the size of the metadata is quite small. Metadata are context specific, and therefore there can be multiple characteristics that can be defined, some of them irrelevant in some contexts.
    Regarding your example of metadata – letters can be hardly considered as metadata, same as sentence, ASCII or even noun. Normally one remains with metadata definition in the realm of databases and refers to characteristics like data type (e.g. date, alphanumeric, numeric, interval, binary, etc.), minimal/maximal length, format, precision, etc.

    Number = 10 it’s more a statistic about your data, while the binary representation is a conversion.

    Turning qualitative in quantitative data is more complex than that, and the quantitative characteristics need to represent to some degree the qualitative ones. Typically that’s achieved by substituting a qualitative scale with a quantitative one – e.g. substituting bad, ok, good with 1, 2, 3, though here it would be helpful if the proportion between the different points on a scale could be kept during conversion. Substitution can be done also with categorical values that can’t be ordered on a scale, though then they’re just “IDs” or “references” for the real thing.

    From my point of view I don’t see your examples as representative, especially for one who’s trying to understand the topic. Otherwise good effort in the attempt of making some light on the topic!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.