To store or not to store

27th June, 2017
datalake

On BLOBs, data lakes and the risks of storing everything
Hans Sloane who died in 1752, was a curious man. Born in Ulster in 1660, he was rich enough to buy up most of Chelsea where Sloane Square and Sloane Street among others, are named after him. But Hans Sloane was also a collector extraordinaire. Some of his possessions included, “a set of surgeons’ instruments made from fish-skin; inks and inkhorns; face paint; medicinal powders and pills; women’s shoes made of leather and silk; gold and silver pins and needles for the practice of acupuncture; tobacco pipes; several portable Buddhist idols; gilded rhinoceros horns; ‘metallick burning glasses’ and ‘a ball of several colors to be thrown into the fire to perfume a room’.” In their volume and sheer diversity, Hans Sloane’s collections were rather similar to the modern notion of a data lake.

Last month I discussed the concept of data silos and how you might be able to start migrating your company’s data out of an inaccessible, fragmented environment into more accessible, scalable consolidated structures. This approach, if properly managed, can allow your organisation to take advantage of previously inaccessible analytics. One of these consolidated structures, very much en vogue right now is the data lake. Be warned however: A data lake may not be as inviting as it first appears.

Doing away with pre-conceptions
The general principal of a data lake is that large volumes of data from different origins can be integrated, in their native formats and models, in a central repository. Real-time streaming data from sources like social feeds or the internet of things can be stored in the lake in their raw format. The raw format bit is quite important as it reduces processing time during acquisition and limits restrictions on later use by storing everything. Raw data are typically stored as distinct data objects called BLOBs (Binary Large Objects) whose data models or schema are determined by the applications accessing them. This is fundamentally different from traditional data warehousing where the required data model is applied to incoming data before storage.

Despite differences like this, within a data lake, BLOBs can coexist alongside structured, semi-structured and other unstructured data, all unaltered from their initial or previous forms. And therein lies the appeal of data lakes. From an analytics standpoint, large, diverse data lakes can be invaluable because their contents have not been adjusted before being stored. Thus, preconceived notions of what may or may not be valuable which can artificially direct analytic investigations are avoided. In fact analytics are generally quoted as the principal developmental drivers of data lakes which can provide scalable, fast, highly accessible data.

Finding diamonds in the sand
But as much as data lakes can facilitate insights, they can also bog them down. “The let’s store everything strategy hasn’t worked” argues Adam Wray, CEO & President of Basho, because it makes finding the value in the data more difficult. By collecting everything, without priority or category, finding the diamonds in all that sand becomes increasingly difficult. Another (of several) potential issues highlighted by Wray is that of differences in data privacy legislation between geographic locations. Storing all of your data, in any form you choose, wherever you choose, may expose your business to unforeseen legal risk.

Yet another argument against data lakes revolves around the idea that, like it or not, most data is in fact fragmented away in silos. While it is easy to think that the location of your data doesn’t really matter anymore, it actually does, and most companies don’t have the time or resources to migrate large volumes of data.

A somewhat different approach is proposed by Marc Linster of EnterpriseDB. He suggests that data is left where and how it is generated and that instead of collating data centrally, it is connected using software. EnterpriseDB, the company behind Postgres RDBMS have developed a number of foreign data wrappers allowing Postgress to interface with numerous other data storage architectures including Hadoop, MongoDB and MySQL. By adopting this approach, companies can build themselves virtual data lakes. While this approach sounds likely to suffer from latency or responsiveness related challenges, it may also reduce the amount of time and money required to get a data lake up and running.

Obviously no one can glean actionable insights from data they chose not to store, and understandably, the knee jerk response is to be like Hans Sloane and simply store everything. Before beginning to dig and fill your data lake however, heeding the warnings and doing some due diligence is in order. That said however, with well managed data management protocols in place and the appropriate skill base in your team, data lakes may indeed provide the analytic insights your organisation has been lacking.

Reference: Hans Sloane. Hoarder extraordinaire. The Economist. 10 June 2017

To store or not to store

Eye in the Sky

Set Your Insights Free

Comments

Start Here

Retaining Data Science Talent