Data Science attempts to derive meaning from data. There are a lot of techniques, processes and tools you can use to do that – I cover those in this blog site.
But Data Science is insecure – by default. And that’s a real problem.
In a solution involving a Relational Database Management (RDBMS) system, you’ll see at least two very distinct focuses from the IT team. One is the development side, concerned with coding, business logic, performance, and the data itself. The other is the administration side, dealing with hardware, networking, configuration, other types of performance, and the safety and reliability of the data. The administration team also watches the system carefully to predict and manage growth. But both of these teams are inculcated with concern about security.
From the developer’s standpoint, they might build in a single account to access all of the data (a proxy account) and then control the users inside the application. Or, they might use the Principals (accounts, people or certificates) and Securables (data, code, and other database constructs) features in the RDBMS to create a very finely-grained access and authentication mechanism.
From the administration side, a primary concern is who has access to the objects in the database, how that is controlled, and what they can access. These considerations start at the server location itself, continue on to the hardware (even the drivers for the hardware), the encryption for data at rest and in transit, and even out to the backup media. Security permeates everything in the system from start to finish. (Or at least it should – if not, stop what you’re doing right this second and fix that!)
The world of the Data Scientist has been…different. In some environments, the Data Scientist works alone, on very specific solutions. To do their job correctly, they need ALL of the data they can possibly find – inside and outside the organization. Data is the life’s-blood of Data Science. More is better. Almost always more is better.
But the security of that data is often not given anywhere near the concern it does in team-based solutions. I’ve witnessed this time and again, in the past and even today. The isolation and focus of the Data Scientist, along with the rest of the company not fully comprehending the processes the Data Scientist uses, leads to a lull in paranoia about security of the data they work with. I’ve seen USB drives with private data, screens left unlocked, CSV files mailed using public “free” services, and many other practices that would get you fired in another IT team. There simply isn’t the desire to restrict access to data from the Data Scientist – we want access to all of it. In fact, we can’t do our job without being inherently insecure – at least from the default position. But this has to change. As the Data Scientist moves into the team development process, and as the discipline starts to take center stage, it’s time to start taking security very seriously.
I’ve focused on two things in my 30+ year career in IT: Data, and Security. I firmly believe if you have those skills you’ll always be in demand (Combining them doubly so). I’ve gotten my CISSP certification, worked in military, medical and government research organizations, dealing with extremely high levels of security. I have seen people lose their jobs, and some prosecuted by the law, for improper data access. When you’re dealing with government secrets, nuclear environments, and most importantly human lives, it’s kind of a thing.
There is no “trivial” data. There are no “overly paranoid” processes. Your data, and mine, are out there. Regardless of what type of data professional you are, it’s up to each of us to protect the data. If you have access to data, you have a duty to protect it.
The easiest way to secure the data is to become aware of the path data takes from origin through processing to output. Access and Authentication are the two places you need to focus on.
There are lots of ways to secure the data in a Data Science process. It’s one of the reasons we put R in SQL Server – you get the protection of SQL Server and the full R environment in at the same time. And of course there are other methods to use. It starts with being a little paranoid, and then educating yourself on what to do about it.
(Oh, just so you know: having worked extensively in the intelligence industry – they really ARE out to get you)