Big Data Architecture

The next few years will be critical for the information technology staff, as they attempt to integrate and manage multiple, diverse hardware and software platforms. This article addresses how to meet this need, as users demand greater ability to analyze ever-growing mountains of data, and IT attempts to keep costs down.

The State of Big Data
Type 0 Data – The Archive
Type 1 Data – Structured Big Data at Rest
Type 2 Data – Unstructured or Unmodeled
Type 3 Data – Data in Motion
Required Hardware and Software Infrastructure
The Big Data Appliance
Apache Hadoop
The NoSQL Database Alternative
Integration of Your Analytics Ecosystem
Summary

The State of Big Data

Business Intelligence (BI) has matured over the past two decades. No longer are big data, data marts and data warehouses new environments requiring non-traditional staff skills and specialized hardware and software. Today, more and more data are being made available within the enterprise. Information technology (IT) departments have developed and installed multiple technology platforms to store the huge amounts of data that are rapidly created from both traditional operational systems and the new, complex, multi-structured data streams emanating from e-mails, audio and video clips, click streams, sensors and social media.

There are several types of analytics to execute against multiple categories of data. Each combination serves a different business purpose, and is best hosted using a particular hardware and software solution.

Type 0 Data – The Archive

Much of early data processing concentrated on archived data. These were the precursors to the enterprise data warehouse; historical data stored in very large databases. These data were usually processed in either a standalone hardware environment or during the slow hour of the operational batch cycle when CPU, memory and disk storage were available. These resources could not be shared with operational systems due to a combination of factors: limitations in analytical software, relatively slow processor speeds, and the expense of maintaining large volumes of quickly-accessible data. Many times these archived data existed on tape, requiring either re-loading to disk for fast access, or sequential-only processing due to the physical limitations of tapes.

Despite its limitations, analyzing an archive provided the first use cases for discovering and exploring data relationships about the company’s costs, customers, and products. And, as processors became faster and disk storage costs shrank, analysts were able to derive more and better predictions.

The challenges today with type 0 data are data availability and volume issues. The IT enterprise usually archives data for a reason; the data is either old or stale, and removing it allows for faster processing and analysis of the remaining data. Many departments will consider analysis of current data to be more relevant for discovering ways to lower costs, increase customer satisfaction or provide actionable intelligence. Meanwhile, analysis of archive data may be seen as too little, too late.

Type 1 Data – Structured Big Data at Rest

The next evolution of data analytics might be called classic big data. This data was large in volume, generated quickly by a growing number of operational systems, and transported across network with greater and greater speed. Initial big data applications concentrated on storing well-understood, structured data from operational systems in a combination of the enterprise data warehouse and a big data datastore. This was often accompanied by a hybrid hardware and software solution called an appliance that allowed for massive parallel processing of the data, and sometimes enhanced by user-friendly analytical software. The software allowed the analyst to ask what if? questions couched in business language without needing to know how the data was stored or how it was to be processed or retrieved. The software then converted the request into one or more queries that accessed the data, and the massively-parallel hardware provided speed of access. The result was fast queries against big data, thus permitting analysts to generate more queries and reports in a short period of time.

The challenge of early big data solutions were many. Their performance characteristics were poorly understood, with query speeds varying greatly across query types and across industries. Analytics was generally kept simple, as users struggled to understand results that were being presented to them for the first time. Finally, big data appliance solutions tended to be costly, limiting their application to enterprises that could afford them.

As time passed, appliances grew faster and larger and cheaper, and analytical software more sophisticated. Along with this the IT staff gained experience in both maintaining and tuning the solution, and analysts became more sophisticated in their query choices. Combined with the steadily declining costs of hardware resources, big data solutions have matured to the point that complex analyses against terabyte-sized datastores are now commonplace.

Type 2 Data – Unstructured or Unmodeled

As operational systems matured, companies moved their applications closer to the customer. In-store kiosks, on-line ordering, and internet-based product catalogs became common. As a result, data about customer transactions was no longer generated solely in structured records and files from operational systems. Instead, the norm now is to expect web logs, social network data, click streams, machine sensors and other solutions to generate unstructured data. These data can then be analyzed in new ways such as reputation management or competitive intelligence gathering.

The challenge of this category of data is that it may require multiple distributions through multiple software levels in order to decode or understand. One phase may require parsing the data for errors or required data transformations, then converting the data to a semi-structured form such as eXtensible Markup Language (XML), or perhaps into raw text. Another phase may then analyze the semi-structured data for form and content relevant for analysis, and then forwarding the data to a datastore for advanced analysis.

The challenge of unstructured data is that each selection or manifestation of a data stream may require not only a one-time decision on how to process it, but also a unique hardware or software solution in order to analyze it.

Type 3 Data – Data in Motion

Data in motion is the latest variation of big data processing. Here, data from customer transactions, product shipping and receiving, medical patient monitoring or hardware sensors is analyzed as it is generated, and the results used as part of the transaction. For example, if an on-line customer orders a product, the transaction may be analyzed before it completes by comparing to historical customer purchases in order to detect possible fraud. Analysis of multiple customer purchases may allow algorithms to derive product popularity in certain geographical areas, leading to possible price or shipping cost changes.

Data in motion require analytics at the point of data creation. It may be too late to accumulate this data in a big data appliance or data warehouse for later processing (although this may be done for historical analyses). Instead, the intent is to use the data immediately, do necessary analytics, and implement decisions that can both increase customer satisfaction and increase profits.

Required Hardware and Software Infrastructure

With multiple categories of big data and different analytical requirements, it is becoming clear that the IT enterprise must prepare for all of these by implementing multiple hardware and software solutions. These solutions need to be configured and optimized for the type of data they store and the complexity of the expected business analytics. Here are the most common solutions.

The Big Data Appliance

Appliances are hybrid hardware and software solutions that use massively parallel disk storage processing to provide quick answers to analytical queries. They are best used for structured data that is analyzed at rest. Some more advanced applications may use proprietary storage methods or database processes such as data compression to further reduce execution time. Although generally expensive, they can be scaled to fit the data required for the use cases IT determines will generate the best return on investment.

Apache Hadoop

This software solution permits multiple computer nodes (sometimes thousands) to participate in data storage and retrieval and query processing. It includes a file system that manages multiple, disparate hardware platforms, several programming frameworks for managing the multiple servers, and additional file processing and workflow management systems. Hadoop is best used for unmodeled data where analysts can use exploratory queries, whose results can then be used for further analysis.

The NoSQL Database Alternative

NoSQL database management systems are another option. These systems allow for storage of data amenable to graph analysis. Graph analysis is a method of linking data elements and relationships to allow analysis of things such as social networks or comparisons of multiple customer transactions across multiple time periods and geographies. Implementing a NoSQL solution is not easy, and requires highly-trained users who understand graphing analytics.

Integration of Your Analytics Ecosystem

Multiple big data categories and analytical methods suggest that your IT infrastructure may require multiple big data application solutions. Each solution will begin as a specific, hybrid hardware and software solution that is customized for one or more use cases. Each mini-environment will attract its own combination of datastore administrators, data scientists, and business intelligence analysts. As each environment grows and matures the data stores will begin to overlap.

This usually begins with the enterprise data warehouse. The warehouse already contains dimensional data such as geographies, product categories, time dimensions, salesforce hierarchies, and the like. It is these dimensions that the analytical software will use for data aggregation, rollup, and drill-down against big data datastores.

The natural next step is the integration of these environments. Customer data may be gathered at the source or from operational systems and routed to one or more analytical platforms. This leads to the next phase in analytics: query routing. As an analyst enters a request, the software makes a decision as to which platform or big data solution is best suited for processing the query. To make this work, the IT staff first needs to ensure that big data application solutions are providing sufficient value added to the enterprise. This gives IT comfort in knowing that the solution will be retained, at least in the short term, which then allows designing the IT infrastructure for the massive data routing and storage required to support the idea of intelligent query routing.

Summary

With the coming explosion of big data applications and solution implementations, IT staff should step back and prepare for the next step: solution integration. As platforms grow and proliferate, users will tend to use business intelligence analytics software that supports their needs. This will lead to the requirement that such software is able to route the user’s analytical queries to the platform most suitable to providing the best answer.

See all articles by Lockwood Lyon