Tags

,

OverThere

String’s over here, yo

The first step of deep analysis using the Team Data Science Process is to find the right question. From there, we need to determine where the data you need lives, or even if you have it. If we don’t have it, we need to get it – a topic I’ll cover in another post.

The next step is simply to examine the data we’ve gathered. There are quite a few tools we can use to do that, along with what we are looking for.
Let’s assume that we’ve gathered our data, found out what it has to say, and
And so begins the “Data Wrangling” part of Data Science.

After we obtain the data, we’ll want to warehouse it. I want to make sure that I’m clear here – I’m not talking about using a Data Warehouse, which refers to a specific set of technologies, although that’s a valid thing to do as well. When I say “warehouse the data”, I mean that we want to store it based on a point in time in a location we control. The data will now have a historical permanence (even if we add to it) that we can use for analysis that involves a time requirement, such as time series, point-in-time numerics, and so on. Remember, in Data Science we Extract, Load and then Transform (ELT) rather than using an Extract Transform and Load (ETL) process. We transform at the latest possible step, using the technology that step needs. That way the base data is left pristine to transform with another tool in a separate step.

At some point we *do* want to transform the data. There are many, many options to do this, and most of them are highly dependent on the language or technology you are using. One of the most “atomic” or basic ways to transform data is using Regular Expressions (or regex for short).

Regular expressions are a way to find data within a stream – the word stream here meaning data coming in from some source, such as the keyboard or more usually, a file. Regular Expressions has been around for a very long time in computing science. Since it’s such a mature technology, has lots of tutorials, and is used in so many other technologies, it’s a good place for us to start our discussion for Data Wrangling in Data Science projects. Once you locate the strings you’re looking for, you have multiple options for changing them.

This article isn’t an entire tutorial on Regular Expressions – there are lots of those already (check the References section at the end of this article and this one is amazing: https://github.com/zeeshanu/learn-regex) – I want to focus on what regex is, and when you might use it. We’ll hit only the important concepts here – becoming really good with regex takes time, but it is certainly worth the effort.

Think about the “Find and Replace” feature in most text editors you’ve worked with. You want to find “Amazing Guy”, and replace it with “Buck Woody” for instance – or perhaps you’ve even gone the extra step of learning how to find uppercase letters and replace them with lower case letters. Regular Expressions involves the first part of that process (the Find) with an extremely powerful syntax. You simply type in a string that you want to find, along with some instructions on how to find it.

That syntax, however, can be a little daunting to learn at first. Let’s take a look at the important concepts you need to understand about typing in a Regular Expression.

Installing Regular Expressions

OK, there *isn’t* an install. Regular Expressions are a concept that is implemented by other systems, not a system in and of itself. That’s actually part of the power of using them in Data Wrangling – they are used in many Data Science technologies, including (but certainly not limited to):

Many editors implement regex as well, such as vi (or gvim, which is what I use), Visual Studio, and many others.

Regular Expressions locate patterns of characters –performing actions on those results are normally based on the program that implements it. For instance, an editor that uses regex might highlight the words you’ve search for, while a processing engine might locate the text and then alter it in some way. The command-shell utility “sed” can use a regular expression to quickly locate text within a file and replace it with other text. I use sed quite often, using the bash shell in git – and you can find utilities like sed, awk and grep for Windows in many locations. Or you could use the Windows 10 Linux subsystem. (Or just use Linux)

NOTE: The Substitution feature in Regular Expressions do allow for some manipulation of the characters – but its real power lies in locating characters, not altering them.

Regular Expressions Primary Concepts

As I mentioned, this isn’t a complete discussion of Regular Expressions – there are a lot of things for us to learn even after this brief introduction. These are some of the more important concepts you need to understand to start working with regex.

The Match

The “match” is the most important concept in Regular Expressions. You have three options: things that match what you are looking for, things that don’t, and things that are close to what you are looking for. This is the primary feature in Regular Expressions – the “engine” or program you’ll use regex in will then do something with what you match. Regex is simply a means to an end.

To start the matching process, you type in a series of characters you are looking for, using the text, substitutions (like finding Return and Newline characters), classes of strings (such as whitespace, decimal or character strings), anchors (where the string starts, like in the middle, beginning or end of a line), and quantifiers (how many times a character or string occurs. You can also specify case settings and more.

You can find a list of some of these characters here: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

Patterns

Patterns are the strings of characters you are looking for – or looking for the opposite of. Start there – with what you’re looking for, and then add in the conditions that you are interested in, such as “at the end of a line”, and add all that up. That makes your search string.

For certain characters – specifically the ones that are used within the regex engine, you need to “escape”, or place a special character in front of, the character you want to find so that it matches properly. There are a few characters that regex “reserves” for itself, such as brackets, braces, plus-signs and others ( [ ] { } + $ . *| ) – these are called meta-characters. If you want to match one, you need to place a single back-slash in front of it, like this:

\[

If you want to search for the back-slash, you need to escape that as well:

\\

Interestingly, if you use a character that does *not* need to be escaped after the backslash, it takes on another meaning. For instance, \t matches a tab character. This is one search you’ll find yourself using a lot in data wrangling. In fact, if you know the ASCII Code for any character, even ones that don’t print, you can find them. Tab, for instance, is ASCII 0x09, so to search for it would be \x09. I also deal with Unicode from time to time, and regex works with that, too. Replace the x with a u and you have the same process – \u20AC finds the Euro character.

Ranges

The next important thing to know about regex is the sets of characters it can work with. You can look for a group of characters with regex, and this is its most powerful feature. Let’s say you want to find the word disc, but you’re not sure if the file is using disc or disk. You can match that pattern easily by including bot the c and the k in the search string, between brackets:

dis[ck]

This works for ranges of characters as well. Assume you want to find any number from 1 to 9 in a string. You could write:

[1-9]

and any number in that range will match. This also works for letters.

Quantifiers

To specify a number of times to match something, use the {} characters. For instance, to find any three numbers, you would write:

\d{3}

This says to use the meta-character of decimals only (the \d part) for three of them.

There are a couple of other common Quantifiers you’ll run into quite a bit in regex strings. Use the * character to match all instances of the characters (such as \d* for all decimals), and the ? character to match things one time (such as t?n to match tin, ton, and ten). There are other Quantifiers you’ll see in the tutorials below.

Anchors

Another pivotal concept is learning about the location of the characters you want to find- called “anchors”. You might want to find strings that start at the beginning of a line, the end of a line, or starting at a certain number of characters inward. You can build expressions that find strings that follow or precede other strings.

Let’s say you want to find two decimal numbers at the beginning of a string. The start anchor character is “^”. To specify that we want a decimal number, we’ll use the meta-character of “\d”. And then we’d like to specify a value of two numbers, we use {2}. That part we’ve seen before. From there, we add the ^ character to start at the beginning of the line:

^\d{2}

There a LOT of pattern combinations you can build. They look quite confusing when you see them complete, but you’ll get the hang of reading regex expressions as you learn to build them. There are some sites that will assist you in building regex patterns – check this one out for practice: http://www.regexr.com/ .

There are also many other concepts within Regular Expressions that you can use in your Data Wrangling tasks – check out the references below for more in-depth learning.

References

Advertisements