Since the first practical camera was invented in the 1800’s, it’s been used as a scientific tool. In essence, it’s a database – albeit one that stores shades of light rather than 0’s and 1’s, and didn’t use a hard drive (at least at first).
In 2012, a new type of camera was introduced, the Lytro. Actually, it’s not technically a camera – it’s a “Light Field Imaging Platform”, because it differs from previous cameras in a very significant way. Cameras work by collecting a moment of light differences based on a field of focus. Using a series of settings such as speed and aperture, the camera allows light to show and expose on a medium for recording the image. In the past, this medium was light-sensitive chemicals on a type of film, and later this was exchanged for the electronic sensors in digital cameras and phones. The part that is interesting is that the entire camera is designed to tune out most of the information you are looking at, and only capture the parts you care about.
The Lytro is different – it records all of the information it can, completely unfocused and mostly uncompensated. But what use is that? Viewed directly, the information is a mess – it’s too dense. But that’s where the difference comes into play. Using software, you tell the system what you want to focus on, and how to focus on it, after you have all of the information. (Read more about it here: https://www.lytro.com/about ). Essentially you have every picture you could ever want from that one field of view – you just need to process that into whatever focus and light you care about later. You can have hundreds of pictures this way from taking just one shot.
In working with Business Intelligence or Data Warehousing, it’s common to follow an “Extract, Transform and Load” (ETL) process. You find your source data, decide what formatting, data types, lengths and other tuning you need on that data to make it homogeneous. You might load it into staging tables using tools like SQL Server Integration Services (SSIS), or change it as it streams in to be in the final form. The point is to make the data take a shape that is well suited for the reporting and exploration you want to perform on it, because you know in large part the type of queries you want to run on that data.
In Data Science, it’s quite the opposite. Any change in data loses fidelity within the data – even normalizing the type of data, such as changing text to numbers, is fraught with peril. In fact, the process changes from ETL to ELT – Extract, Load and only when you query the data do you Transform it. In Data Science, you want the data to be as pure to the source as possible – because you aren’t sure what you want to ask it yet. You’ll also use the data multiple times, with multiple systems, each of which might have their own type of processing engine or data shape requirements.
So when you’re thinking about the base data you’ll use in your Data Science projects, think Lytro, not Kodak. Not that there’s anything wrong with Kodak, of course – it’s just that the more you leave the data in its original shape, the more systems you can process it with, and the more options you have for working with it. Storage is cheap, so bring it all in. And leave it alone – for now.