Does size really matter?

23rd May, 2017
big-data

It’s tricky to measure how big, big data really is. At the risk of revealing my age, this exchange from the Australian-American comedy movie Crocodile Dundee, is not too dissimilar from how conversations about big data can play out. Click here to view clip.

In a business environment, the exchange may very well go like this:

Manager: Can you help me figure out how to deal with this big data set?
Data Scientist: That’s not a big data set. This is a big data set.

Obviously, that’s more of a comment than an answer and far from helpful, but it happens again and again. The moral of the story is that size is relative, and what constitutes an unmanageably large data set in one organisation may not be in another.

In fact there are definitions of big data out there, and Forbes listed 12 of them back in 2014. They range from descriptions of data that defy traditional storage and analytic approaches, to conceptual ideas around the types of insights we might glean from the data. The Wikipedia definition sums it up rather well: Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them.

Data, data everywhere but no solutions in sight: Don’t be fooled though. Just because your organisation doesn’t manage distributed storage and computational systems like Hadoop doesn’t mean that you’re not sitting on top of big data sets. Unfortunately, it’s very common for data to be stored in very large volumes, but in ways that they are virtually inaccessible to data science or Business Intelligence analysts.

These isolated repositories, or data silos, arise from a number of drivers. Within organisations, software applications are developed at different times, by different people and for specific departments with their own unique goals in mind. The resulting data repositories reflect these factors both in structure and content. Consider too that departments are often in competition with each other, and are hardly encouraged to share their data or insights. The net result of all of this are large volumes of data, stored and sometimes forgotten, in different places, on different platforms that seem more or less impossible to integrate.

Don’t try to tackle it all at once: If this sounds vaguely applicable to your organisation, don’t fret. There are approaches to integrating your disparate data into more consolidated and accessible structures. The ultimate goal of the endeavour is a truly data-driven business. Resist the urge though, should it strike, to tackle the whole problem at once.

The first step in any data science project lies in identifying objectives and a key obstacle in the path of data driven approaches remains organisational buy-in. For this reason, select objectives which are likely to have demonstrably positive impacts on the business. Think of the proverbial ‘low hanging fruit’. This can not only help to convince your execs that it might be worth investing in the approach from the onset, but it can help to guarantee continuing support.

With your objective in mind and an appropriately diverse data science team assembled, begin to collate your company’s data around these ideas. Critically though, remember to plan your data structures around the idea of integration or you may end up with even more silos! As time goes by, continue to investigate more ideas and collate more of the data. At the same time, encourage departments to have more inclusive approaches toward ideas and insights by illustrating how data-driven approaches involving different data sets can help them to increase their productivity.

What Intel did: One case study involves American technology company Intel. The company had at its disposal silos of IT help desk incident reports and PC client event logs from more than 95 000 clients. The event logs alone accounted for up to 19 million events per day. Not only had these data had been stored in largely unusable form, the volumes had become overwhelming and as result were mostly ignored. Report analysis was done on an ad hoc basis and required a great deal of effort on the part of the support staff.

Building on an earlier success, the company set out to reduce IT related incidents by associating incident reports with event logs. By collating these data - in this case on a distributed Hadoop cluster - and by deploying natural language processing and predictive modelling, the team were able to predict (and avert) 20% of the incidents over the following month, and potentially throughout the year.

Do you suspect that your company may have valuable insights hidden away in silos?
Contact Ixio Analytics today to unleash your data-driven potential.

Does size really matter?

When I Grow Up

Building a Data Driven Culture

Comments

Start Here

Retaining Data Science Talent