Big Data: It’s both about size and technique


We’re in the age of the petabyte, exabyte and zettabyte. We’re being overwhelmed with information, and we don’t know what to do about it.

There has been a data explosion in the internet age, brought about by the digital era where entry barriers have been lowered, allowing the masses to publish, to tweet, to build open source applications. The age of automation has also resulted in information generated by computers on an automated basis.

Former Google CEO Eric Shmidt says that we now, every two days, create as much information as we did from the dawn of civilization up until 2003. That’s a crapload of data or about five exabytes of data to be precise.

It would be a shame to chuck it all in the bin, so how do we make use of this endless list of ones and zeros? Enter the field of Big Data — a relatively new field dedicated to mining this data with increasingly complex algorithms to make it useful.

Steve Watt is the innovation guy at HP in Texas and he says we have more-or-less learnt to filter masses of data by using certain “filter patterns”. We can filter with search, and we can filter socially too. Infographics also help us decode information, presenting endless, boring text data visually in a way that makes it easily graspable.

But what if we need to grasp all this data at a large scale? We need third-party services to help us gather, sort, process and deliver that data in an intelligible and relevant fashion. There are dedicated “data marketplaces” out there that can help with this such as Infochimps, Factual and Nutch.

When the data gets really big it needs to be structured, and be in a low-latency environment. This is where it becomes an exercise in deep geek and you need serious services to achieve this.

Watt sees fantastic application for Big Data, allowing us to capture historical information and create learnings that help us build models that help us make decisions. It creates business intelligence.

To prove his point, he talks of a little pet project that he has been doing. Watt has used an analysis of CrunchBase, a directory of US tech companies, startups and VCs, to work out if the country is in a technology bubble or not.

He created his own code that analysed which company got funded, their address, how much they got, when they were funded and who funded them. He grabbed all this public data, popped off to Infochimps to get the zipcodes of these companies and cross-referenced.

The result? A visualisation that accurately answers the question: Are we in a tech bubble? — all from the public data that is out there.

Watt reckons the US is not in a tech bubble. And he also found out that biotech is the biggest investment sector, followed by software and cleantech, among other things. He also found out that more money was invested in San Francisco than the other major US cities. No surprise there I guess.

That’s an example of intelligence gathered via algorithms, presented visually and in an intelligible fashion. This is an example using external websites, but the exercise could happen internally too if you are a large corporate with masses of data to mine.

Matthew Buckland: Publisher
More

News

Sign up to our newsletter to get the latest in digital insights. sign up

Welcome to Memeburn

Sign up to our newsletter to get the latest in digital insights.