Data Wrangling – ELT not ETL

dmIn Data Science, we most often use Extract, Load and Transform (ELT) as opposed to Extract, Transform and Load (ETL) that you see most often in Business Intelligence (BI). There are a couple of reasons for this.

First, in many BI solutions you have a few data sources that you are integrating into a historical data exploration tool. You need to combine these sources into database structures like cubes. You’re setting up pre-computed set of data for users to explore and get answers to questions, often using visualizations.

To create these combined database structures from disparate data sources, you need to ensure that the data is related through a key field of some sort, and that the data are fairly homogenous – numbers are numbers (not text), text is text (not binary) and that keys fields are the same to link everything up. So you Extract the data from the sources, Transform the data into the right types and shapes, and then Load it into the analysis structures. ETL.

But in Data Science, you’re not browsing data, but looking for an answer. You’ll still extract the data – usually with far more variety and from many more sources – and you’ll probably end up processing it with several tools, from R to Azure ML, or perhaps with Python. Or more likely you’ll use many technologies (yes, SQL as well) to get different parts of the answer, or to get multiple kinds of answers. Since each technology has different aspects (R has about 9 datatypes, SQL Server over 20 for instance) you want to wait a late as possible in each step to change the source data in any way. Every time you change the data you lose fidelity from the source. And base data is the lifeblood of statistical analysis – the base numbers are VERY important. They have to be representative,

scienceSo in Data Science, you Extract, then Load, and then only at the last moment do you Transform. The overall ELT process is often called “Data Wrangling”, and has interesting direct and side effects. In the posts that follow, I’ll cover this process in more detail. For now, focus on locating, documenting and understanding the sources of your data.

Advertisements

3 thoughts on “Data Wrangling – ELT not ETL

  1. Great post, Buck!

    I noticed a shift from ETL to ELT (and ELTL) when SQL Server 2012 was released. I attribute it to many factors, but the speed of the relational engine tops my list. In previous versions, SSIS could transform the data in memory faster than the engine. That less often the case since 2012.

    :{>

    Like

    1. I still favor BLT, but my doctor advises against!

      I concur with Andy on the shift, or in my case “drift” into ELT for pragmatic reasons.
      My client base cannot afford a true ETL process as one might run in SSIS. Although the applications are mostly BI in nature, incoming data is so widely disparate otherwise messy, that we need to run a “fast path” (Lambda or Kappa, or something in between) to get something to the client quickly. I think current nomenclature makes this a “Data Vault”, and an “immutable data store” per Martin Kleppmann.

      Depending on downstream analytics requirements, we can sometimes take a direct feed from the vault (Azure SQL DB with semi-normalized querys) and finish the job.

      If not, sometimes we “cheat” and do fast transforms via Access (easy interface) or SSMS (quicker).

      I don’t actually know what to call this, other than Data Science-y BI, with the human analyst running most of the analytics in PBI or Tableau.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.