DevOps for Data Science – Infrastructure as Code

In the previous post in this series on DevOps for Data Science, I explained that it’s often difficult to try and implement all of the DevOps practices and tools at one time. I introduced the concept of a “Maturity Model” – a list of things you can do, in order, that will set you on the path for implementing DevOps in Data Science. The first thing you can do in your projects is to implement Infrastructure as Code (IaC).

Right away I may have put a few Data Scientists off. No, it isn’t that they can’t set up a server or components, it’s just that this isn’t normally their job. However, in a Data Science project, it’s often the case that you’re working with technologies that the other team members aren’t as familiar with, or perhaps you’re working on a cloud environment. In either case, the software, hardware configuration (virtual or otherwise), containers (if you use those, and yes, you should) Python environments, R libraries,  and many other parameters affect the experiment. It’s essential that you’re able to duplicate all of that and store it in a source-control system so that you can re-create it for testing, deployment and the downstream phases.

That brings us to scripts. There are lots of ways to build a Virtual Server these days, both on and off premises. Containers are deployed with scripts. Python (using Anaconda) and R (using packrat or Checkpoint packages) environments can be set with dependency files or in the code itself, and both have ways of handling library and function versions. All of that can and should roll up into a repeatable set of steps so that you can re-deploy to that exact environment for accurate testing.

If you’re using Microsoft Azure, the Resource Manager is a method of gathering all of the pertinent resources for your solution under a single umbrella. After you create your storage, servers, services, networks and whatever else you need, a simple click of the “Properties” panel allows you to script out everything into a JSON file that you can edit, version, or save, and redeploy it with PowerShell, C#, Python or even the Azure Portal itself. Other cloud providers offer similar features to create artifacts from code. In any case, get whatever scripting artifacts that re-create your environment into your project at a given state.

Even if you don’t develop a full DevOps mindset at your organization, this is a skill you can and should learn. And if you do, you’re on your way to a better structured and managed project.

See you in the next installment on the DevOps for Data Science series.

For Data Science, I find this progression works best – taking these one step at a time, and building on the previous step – the entire series is here:

  1. Infrastructure as Code (IaC)
  2. Continuous Integration (CI) and Automated Testing
  3. Continuous Delivery (CD)
  4. Release Management (RM)
  5. Application Performance Monitoring
  6. Load Testing and Auto-Scale

In the articles in this series that follows, I’ll help you implement each of these in turn.

(If you’d like to implement DevOps, Microsoft has a site to assist. You can even get a free offering for Open-Source and other projects: https://azure.microsoft.com/en-us/pricing/details/devops/azure-devops-services/)

Advertisements

4 thoughts on “DevOps for Data Science – Infrastructure as Code

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.