With all the talk of “Big” data from vendors and their sale forces, consultants excited by new opportunities and business people grappling with whether or not their revenue will go up, it is vital to understand the “Big” picture and the best way to do this is see where architecturally everything fits together.
Getting an understanding of the integrated architecture of Big Data is vital if any organisation is to understand how much of their current investment in their information environments including items like hardware, software tools and people’s skill sets can stay, need to be replaced or be upgraded.
The following areas make up the Integrated Architecture:
1. Data Sources
Essentially it does not matter what source the data is from, all that needs to be in place is an interface into the MapReduce framework for it to be processed and stored.
2. Hadoop Ecosystem
Data Import\Export — Tools such as SCOOP provide a framework that allows for data to be transferred between RDBMS and HDFS solutions via integration to MapReduce data transfer programs. Data transfer is performed in parallel without any fault tolerance. Most RDBMS vendors are now providing native connectors like Microsoft SQL 2012 Hadoop Data Connector.
High Performance Parallel Data Processing — More complex than traditional T-SQL, either way code must be written in order to process data. Final outcome is a blob mapped and reduced data.
Querying Engines — Traditionally JAVA was language of choice to query stored MapReduced data. Tools like PIG have simpler syntax. PIG converts code to MapReduce to send off Hadoop to retrieve data, in half the performance time and faster to write.
File Storage — HDFS is asynchronous and designed to scale seamlessly by adding more hardware. Data is stored in delimited flat files. Loading data to HDFS similar to copying file on an operating system.
NoSQL Database — HBase. Data is grouped in a Table that has rows of data that can have totally different column structure. Serves as an Indexed Key Value store on top of the HDFS store.· Data
Warehouse — Hive. Closer to traditional RDBMS in that it provides JOIN operations for Hbase tables. Maintains a meta-data layer of data aggregation and ad hoc querying with code that resembles T-SQL but which is limited.
3. Data Warehouse \ Business Intelligence
When examining the Integrated Architecture it is clear that the concept of not requiring a data warehouse is not all in itself correct. Sure introducing Big Data does not mean that a Data Warehouse is required, but careful planning and integration will ensure that the outputs, the business insights are available.
Social media data consumer Klout has an integrated architecture which allows the end business user to query and analyse data via a Microsoft SQL Server Analysis cube. The performance and value of such has even lead Klout architectures to term the functionality available as “Query at the speed of thought”.
The aim is to leverage off what each environment is best at providing. For example why not use a Hadoop ecosystem to crunch and distil data to the relevant metrics required by the business and then model those outputs dimensionally, keep them historically available in a traditional data mart and provide the analytical tools with functionality to empower the business.
4. The People
At the end of the day the skill set pool increases in an organisation but this does not mean that everyone will or can specialise in everything, rather best stick to specialists either in-house or consultants. These include:
And of course don’t forget the person\team who pays for all of this!
In an effort to practically understand the relevant concepts and architectures, hopefully considering the two areas of Big Data Concepts and Integrated Architecture can help anyone understand where they fit into the big picture. With this understanding we can hopefully move away from seeing just the Big Yellow elephant in front of us and realise there are a lot more other interesting animals in the zoo.