Big data applications start big and keep growing. As the masses of big data being analyzed grow both in size and complexity (think XML data, binary large objects or BLOBs, and URLs of internet page visits), the hardware and software communities have responded with huge storage offerings and massively parallel storage and retrieval mechanisms. The logical next step is for the IT enterprise to take advantage of these technological innovations for things other than classical big data processing.
Understanding Big Data Application Architectures
Big data is no longer new to enterprise information technology (IT) infrastructure. Today, larger and larger data stores are sourced from operational systems and are accumulated and queried by business intelligence (BI) applications. These data are stored on multiple hardware platforms and exist to support different user communities. We typically categories these architectures according to their data complexity, volatility and usage as follows:
Type 0 Data – The Archive. This is usually the first stage of integrating big data into the enterprise data warehouse (EDW). Daily EDW processes extract information from operational systems and store them in the warehouse. This is usually keyed data referring to accounts, customers, products, and associated dimensions such as sales territories. As the amount of data grows over time, users have a greater ability to detect trends and establish future forecasts. As the amount of data grows, the DW support staff needs to add resources to the system. These usually take the form of hybrid hardware and software ‘appliances’ that combine large data storage capacity with specialty processors.
Type 1 Data – Structured Big Data at Rest. This defines the classic big data application. As with the EDW, operational data is accumulated and stored. In addition to keyed data, a type 1 architecture will include transactional data consisting of product shipping, product purchase and customer interface information. This greatly expands the BI analysts’ ability to query and analyze the complex interplay between customers and their purchases of products and services.
Type 2 Data – Unstructured or Unmodeled. As operational systems matured, companies moved their applications closer to the customer. In-store kiosks, on-line ordering, and internet-based product catalogs became common. As a result, data about customer transactions was no longer generated solely in structured records and files from operational systems. Instead, the norm now is to expect web logs, social network data, click streams, machine sensors and other solutions to generate unstructured data. These data can then be analyzed in new ways such as reputation management or competitive intelligence gathering. The challenge of unstructured data is that each selection or manifestation of a data stream may require not only a one-time decision on how to process it, but also a unique hardware or software solution in order to analyze it.
Type 3 Data – Data in Motion. Data in motion is the latest variation of big data processing. Here, data from customer transactions, product shipping and receiving, medical patient monitoring or hardware sensors is analyzed as it is generated, and the results used as part of the transaction. For example, if an on-line customer orders a product, the transaction may be analyzed before it completes by comparing to historical customer purchases in order to detect possible fraud. Analysis of multiple customer purchases may allow algorithms to derive product popularity in certain geographical areas, leading to possible price or shipping cost changes.
Data in motion require analytics at the point of data creation. It may be too late to accumulate this data in a big data appliance or data warehouse for later processing (although this may be done for historical analyses). Instead, the intent is to use the data immediately, do necessary analytics, and implement decisions that can both increase customer satisfaction and increase profits.
Big Data 2.0
All of these big data categories have one thing in common: they require huge amounts of data storage. This is typically accomplished by implementing a large disk storage array, either as part of the EDW or in a big data appliance. It is this second alternative to which we now turn our attention.
A big data appliance is a hybrid of hardware and software that is specially configured. It includes hundreds of disks, each with a considerable amount of disk space (hundreds of terabytes or more). In addition, each disk has a dedicated disk controller. Special software is able to move data in parallel across the disks. This massively parallel disk array gives the appliance extremely fast data storage and retrieval speeds. For example, rather than storing a terabyte of data on a single disk, the software can split the data into 1000 pieces of a gigabyte each and write all the pieces simultaneously to 1000 disks.
As your big data application begins its life, there is a large amount of data storage that is unused. Many shops now realize that this storage can be put to good use immediately for type 0 and type 1 data. Here are some typical choices.
Retain stale data rather than purge/archive. Big data applications contain time-sensitive data that becomes stale or less useful as it ages. It is common to either purge or archive the oldest data on a regular basis. Typical choices are to purge after two years, or after two or three annual cycles. Unused storage in the appliance can now be used for this data. This allows for querying of a greater volume of historical data, as well as forecasting trends across multiple years. When this data is no longer needed, it can eventually be archived or purged.
Incorporate all data warehouse dimensions. Big data applications require dimensional data for analysis. This dimensional data is stored in the data warehouse, and it is usually copied to the appliance to expand the scope of possible BI queries. For example, geographic information such as sales territories, warehouse locations and store locations would be used to drill down on product cost or sales information by territory. So, include the entirety of your dimensional data in the appliance, rather than limiting it to what is needed for your first few applications. As big data applications are added they may very well needs these dimensions.
Incorporate major portions of the data warehouse. Several parts of a data warehouse lend themselves to temporary (or permanent) storage in the appliance. Consider the extract, transform and load (ETL) process that acquires data from operational systems, cleans it, standardizes it, and stores it in a staging area prior to loading to the EDW. Many of these processes are SQL-based, and much of the warehouse staging area is in the form of tables. By storing these tables in the appliance, you take advantage of both available and unused disk space as well as incredibly fast data storage and retrieval. This allows you to shorten the data movement time from its source to your warehouse.
It may also be advantageous to store portions of your EDW in the appliance. Along with the aforementioned faster data storage and access, it is common for new big data applications to require access to portions of the warehouse. Storing the data in the appliance makes installation of new applications go faster.
Utilize the disaster recovery site. Your first big data applications were probably used for asking what if? questions. However, as more data got added and users became more adept as writing queries, big data began to be considered as mission critical. In order to support this, you must eventually ensure that your disaster recovery plans include having an appliance available at the recovery site. As these appliances are large and expensive, do not let them sit idle. Implement some of your archive data or portions of your EDW in this appliance. It can be used for any number of non-critical processes while your primary site handles production applications. Consider its use as:
- A development environment for training users in BI query analysis;
- A non-production environment for testing data warehouse ETL processes;
- A storage area for archived data.
Summary
Big data applications have matured to the point where the typical IT enterprise implements a hybrid hardware / software appliance for data storage and retrieval. Interestingly, these appliances usually contain such a large amount of disk storage that it begs to be used for something useful. Good choices for utilizing this unused space include storing archived data and portions of your enterprise data warehouse. In addition, as your applications achieve higher importance and become mission critical, an appliance becomes a necessary part of the disaster recovery site. This appliance can also be used (in non-disaster situations) for high-speed data storage and retrieval.