SHARE

Getting Ready for Big Data: Prepping the Data Warehouse

Written By

Nov 21, 2013

7 minute read

Using advanced analytics to analyze business data is common, especially in large companies with many customer-facing systems. As more and more data is made available the enterprise stages large data stores into the enterprise data warehouse. These Big Data implementations bring their own problems and issues, and will require database administrators and support staff to redesign the data warehouse architecture.

Big Data

In today’s commercial information technology systems, larger and larger volumes of data are being gathered and stored. To be able to capture, store, and analyze this data, most enterprises turn to specialized hardware and software solutions. One common solution category is a special-purpose hardware data store (sometimes called an appliance) coupled with business information analytics software that is tailored to access that appliance. One example is the IBM DB2 Analytics Accelerator (IDAA).

These solutions are expensive. Large data stores require extensive disk and memory arrays. High-performance access requires lots of CPU power coupled with complex data access allowing multiple processes to access various sections of the data in parallel.

Prior to implementing such a solution, the enterprise needs to address and resolve the following issues.

Infrastructure Requirements

The IDAA from IBM is a hybrid hardware and software solution. The hardware consists of a large disk storage array coupled with software capable of massive parallel processing. The support staff designates which DB2 tables are to be copied and stored in the appliance, and how they are to be refreshed. The software then interfaces with the DB2 database engine allowing queries to access the table copies within the appliance, with (hopefully) much faster access.

Apart from standard issues of power and cooling, prior to installing such an appliance the IT staff must answer multiple architectural questions.

Will the IDAA store only production data, or will test data be stored there as well? In other words, how will database administrators and business analysts develop and test their data analysis queries.
How many appliances are required? For example, if the data analysis being executed on the IDAA is deemed mission-critical to the company, will an appliance be required in the disaster recover environment?
While IDAA can store a lot of data, only queries accessing data stored within the appliance can be accelerated. What tables should be chosen for storage?

Specific Use Cases

Extremely fast data analysis seems like a good thing. However, many enterprises have not yet developed specific queries or areas for analysis. This results in much time being spent on data loading and query testing, without tangible results.

Will Speed Translate to Benefits at Reasonable Cost

Most business data analytics consists of a series of steps:

1. The business analyst reviews reports, queries and other data and forms logical questions based on their analysis;

2. They then develop one or more queries that will analyze the large data store;

3. The queries are executed;

4. The analyst reviews and interprets the results.

An appliance can significantly reduce the time it takes to execute step 3 above; however, the other steps still exist. For example, assume that each of the steps above takes one hour, for a total elapsed time of four hours. Installing an appliance may reduce query execution time to a few minutes. While this is a very significant time reduction, the total time has only shrunk to a little over three hours.

In summary, reducing query execution time is certainly a benefit, but perhaps not as significant as previously thought.

Business Data Consumer Communities

Most business data consumers fall in to one of three categories:

1. Technical users running direct queries. These users create their own queries against data tables using structured query language (SQL). They then use an online SQL execution tool to run the query and produce the results in raw data form, which they can then observe directly or download to a spreadsheet program for further analysis. These users know the data tables, have SQL expertise, and use simple tools to refine the results.

2. Sophisticated report analysts. These consumers typically use a sophisticated reporting tool that displays a pictorial data model of the data. They then manipulate the model by dragging and dropping tables and columns to a report window. The tool then creates the appropriate SQL statements based on the model and other parameters, executes the query, and displays the results. These users know the data, usually do not have SQL expertise, and require some advanced query and statistical reporting techniques.

3. Data mart consumers. These users have their own highly-specialized business data analytical software. They directly extract business data from the source and store it on a local server. They then use special-purpose software to analyze the data.

Any big data solution must take these communities into account.

Implementation Issues and Problems

During implementation of an appliance the IT enterprise typically encounters some common issues.

The Catch-22 Problem

What do we store if we haven’t analyzed it yet? What do we analyze if we don’t have data yet? The business does not completely understand what data will be available, and information technology support staff do not know what business data will be most useful to implement in a big data solution.

These two questions are usually a symptom of a lack of specific use cases or lack of communication between IT and the lines of business.

The Bulk Data Load Problem

Most appliances that claim to support a big data solution can hold extremely large amounts of data. One of the most common questions is how long will it take to load that data into the appliance?

Once the data is loaded, other bulk data questions arise: How will we keep this data up-to-date? How will we purge large amounts of stale or old data?

These are not new questions. IT is familiar with similar situations. One of these is disaster recovery (DR) preparation. If a disaster occurs (fire, flood, etc.) at your primary site, the typical disaster recovery site must be ready to function as the primary site within a few hours. With today’s large amounts of business data the most common technical solution is to maintain a full copy of the current business data at the DR site with a network connection and software that “mirrors” the primary site data to the DR site.

With a big data solution, IT must find a way to keep data in the appliance fresh by some combination of data mirroring, regular data loads and regular data archive jobs.

The Disaster Recovery Problem

Most data warehouses are used for analysis and reporting, not for processing business data such as customer transactions. A big data appliance is usually attached to the data warehouse, and so is not usually considered something required at the DR site. However, it may become so! Consider the following scenario:

1. Your company implements a big data appliance;

2. Business analysts and users begin querying the data;

3. Many queries produce results that result in lower costs or better pricing;

4. Queries run quickly, so many analysts begin executing many queries;

5. As more queries produce actionable results, management sees that they have value;

6. One-time queries begin running weekly; some queries become daily reports;

7. The number of valuable, daily reports results in management designating the big data solution and analysis “mission-critical”.

Surprise! IT is suddenly informed that this large data store must be available if a disaster occurs.

To prepare for this happening in your enterprise, review the storage needs, network capacity, hardware capabilities and capacity and software license requirements at the beginning of your implementation. Have this data published and available to management before it becomes critical. This allows your enterprise to budget and plan for its needs in advance.

The Initial Implementation Problem

You may intend to install an appliance with the intention of analyzing some big data volumes. Typically these data are not gathered or stored currently in the data warehouse: the data is too big! Instead, these data stores exist as a part of current operational data. Some examples include voice response recordings and keystroke, on-line customer web site interactions and equipment sensor data.

This leads to an interesting thought: Your first analysis will be on raw production data, rather than that in the data warehouse. This is a tempting idea. You can forgo the costs and time spent acquiring, transforming, and storing a large amount of data in your warehouse. Data can be accessed immediately, without the delay associated with normal data warehouse data staging and loading.

Regrettably, direct production data access has its drawbacks. Some production data may be incomplete or missing, or not in a form easily accessed. Some data may be invalid, like a date of “99-99-9999”, or a dollar amount field that contains letters. Other data may need interpretation, such as a code field containing O, A, or C.

Another issue is that most analysis of big data depends on aggregation of that data across categories called dimensions. For example, customer order data may be summarized by geographic region or by product type. These dimensions exist in tables in the data warehouse. In order for the queries to execute successfully they must act on data that is totally within the appliance. This means that data warehouse data must exist in the appliance in order for the queries to work.

Summary

Most advanced analytics solutions access large data stores called big data. High-speed appliances will assist business users by significantly shortening query times; however, the best architectural solution will require the appliance to be a part of your data warehouse.

Integrating big data appliance solutions into a data warehouse requires preparation and forethought. DBAs and business data consumers must work together both to address the implementation issues above and to meet the needs of multiple business data consumers.

See all articles by Lockwood Lyon

Lockwood Lyon

Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).