Know Your Data (Warehouse)
In general, storage and analysis of Big Data will occur either within or in concert with the Enterprise Data Warehouse (EDW). Integrating all parts of your Big Data solution will require interfaces to every part of the EDW process. Here is a summary of the main subsystems of an EDW. After this, we discuss how they are affected by a Big Data, and how to plan ahead for its implementation.
Data Acquisition and Collection
Data stored in an EDW is acquired from the original source systems. These are typically systems that process business data in real-time. Examples include customer-interfacing systems such as ATMs and bank tellers, online order entry, customer records management, accounting systems, and so forth.
Data Transformation and Movement
While most business data is usable in its original form, some requires transformation, or ‘cleaning’. Typical examples include invalid dates (such as 02-31-2013, 99-99-9999, or spaces), missing data (an order for a non-existent customer), unknown data (Date Account Closed for a currently open account), and the like. The EDW transform process handles the cleaning of the source data in order to store it in a usable form in the database.
Another part of this sub-process is the movement of data from the source system to the EDW. Some possibilities include file transfer, messaging, and even high-speed hardware links.
Data Staging and Keying
It is common in the Enterprise Data Warehouse to store initially acquired data in a staging area, usually temporary files or DB2 tables. From there, the data is accumulated and cross-referenced prior to loading to the EDW.
Keying is a process where entities in the EDW are assigned a surrogate key. Why not use the natural key of the data, such as account number or customer number? The EDW is an accumulation of entities from multiple business systems. In order to compare across systems the keys must have common attributes. Consider “account number”. The enterprise may support many different formats of this data item, each with different data types and lengths. Assigning a surrogate key allows analysis to compare these entities using a common key.
Data Access and Analysis
With the data stored in the EDW, analysis can now begin. Most IT enterprises provide software packages to users that perform detailed analysis. This is sometimes called Business Intelligence Analytics. Some packages access the EDW data directly, while some extract subsets of data for local analysis.
Data Archiving
Eventually the data in the EDW ages or becomes irrelevant. Old data is removed from the warehouse and archived. It is important that EDW staff coordinate with business units doing analysis to determine if the archived data can simply be stored (say, on tape), or if it must be retained for re-loading later due to legal or business requirements.
How Big Data Affects the Warehouse
Big Data processed by the business will involve one or more of the following:
- Large volumes of data captured from source systems
- Data arriving at the source at high speeds
- Semi-structured or unstructured data
This poses an interesting problem: How do you know what Big Data to store, if you haven’t analyzed it yet? Alternatively, how can you analyze the Big Data without gathering and storing it?
This issue is the central one for Big Data implementations. Even a small test or pilot project that involves Big Data analysis requires that the data be acquired and stored. In order to determine the risks and rewards of investing in an analysis solution you must first process the Big Data through the EDW steps of acquisition, cleaning, staging and keying; otherwise, the ability to compare and join to the EDW is extremely difficult.
The next step is to determine how each process within the EDW is affected.
Data Acquisition and Collection
Big Data usually implies just that: a lot of data. You must allocate hardware, software, and storage media for the data. Include a contingency for storing data until it can be moved to the EDW in case there are delays due to hardware slowdowns. Also determine how these new assets and processes will affect your disaster recovery plans.
An initial Big Data implementation will most likely mean testing your new analytical software on the Big Data in-place at the source, bypassing EDW processes. Any initial success should be tempered by the fact that the new analytics belongs integrated in the Data Access and Analysis processes of the EDW.
Data Transformation and Movement
Some Big Data implementations involve unstructured data. Examples include audio, image and video files, document images, faxes, and so forth. (While technically this data is ‘structured’ so that it can be heard or seen, the word structure is used in the Big Data context to mean data that has entities, attributes and relationships. In other words, records, fields, and keys that is stored in DB2 tables.)
Another option is semi-structured data, the most common example being an Extensible Markup Language (XML) stream. Many business applications use XML as a method of encoding data in a commonly-used format. The data can then be read, stored, and processed by multiple applications.
The current version of DB2 allows storage of XML data in native format, without any pre-processing or decoding into DB2 tables. This feature makes storage, retrieval and analysis of XML data much easier.
Data movement raises another issue. Moving a lot of data quickly may require additional resources, perhaps even special software.
Data Staging and Keying
Staging of Big Data requires additional hardware, software, and storage media. Staging is necessary as an interim area in order to perform any transforms or cleaning on the data prior to storage. In addition, the EDW, now supplemented by Big Data, accumulates data over time. Such data is necessarily keyed by date or time. It is also common to take the business data in the EDW and store it in DB2 tables partitioned by date. If the data in these tables is keyed by date, these keys must be added to the Big Data as well.
Another issue is sheer data volume. To help combat this, most Big Data analytic solutions include some form of data compression or proprietary data storage mechanism.
Data Access and Analysis
Finally we have our data integrated into the warehouse. At this point, users can execute the analytics software against the combined Big Data and current data warehouse. This is where you finally get a sense of whether or not the implementation was worth the costs.
Data Archiving
Last we consider data archival. With massive amounts of data being analyzed, the sheer amount of data may take up valuable storage media and make some processes run slowly. IT and business partners must determine how and when old or stale Big Data is to be archived, and whether it must remain available for later use. Here is yet another cost factor to consider.
Summary
The idea of analyzing Big Data in order to add value to your business sounds wonderful. However, there are several steps in the implementation process and they affect all sectors of your Enterprise Data Warehouse process. Take a look at your entire suite of EDW procedures, hardware, and software to determine how your implementation will affect every area. Only in this way can you be prepared for your transition to Big Data.