Prepping Big Data for High Performance

Big data applications are common in large IT enterprises. The ability to analyze historical data and predict trends delivers value; in addition, business intelligence (BI) analytics can pay big dividends by avoiding outages and resource shortages, reducing service level agreements (SLAs) and forecasting customer demands for products and services.

As the holiday season approaches, your organization can expect more BI activity as you take advantage of a significant increase in customer interactions. The smart IT organization should be proactive and prepare now for larger data volumes and more analytics activity by tuning their big data application for high performance.

Where to Start

There are three primary areas where the database administrator (DBA) and IT support staff should concentrate their efforts: disaster recovery, data warehouse performance, and data organization and storage in the big data application.

Disaster Recovery

Most IT support staff do not consider disaster recovery to be related to performance tuning. In a big data environment this belief is reinforced by the perception that business analytics queries against your big data application are not as important as operational systems such as payroll, general ledger, order entry, shipping, and customer service applications.

However, big data applications have matured in the past several years, and the enterprise’s use of business analytics has matured as well. What were once one-time ad hoc queries are now being run as regular reports; the addition of historical data allows queries to compare and analyze larger amounts of data; and business analytics software has made the querying of your big data application easier and faster by assuming the burden of constructing queries. The result is that the big data application of today provides a significant amount of actionable data that results in better customer service, lower costs, and higher profits.

This means that even a small outage of your big data application may cause a ripple effect across your user base as reports are late, queries do not run, and decisions made based on business analytics are delayed or postponed.

The DBA should review DR plans for the big data applications under their control. Items that require attention include the following.

Review recovery procedures.  The busiest time of year is a terrible time for a disaster to strike. Outages will have a significant effect on the enterprise’s bottom line, especially during periods of high transaction rates. DBAs should be ready to assist in recovery without wasted time and effort.

Verify recovery times. Many DR plans include a recovery time objective, indicating the maximum time allowed to complete recovery. For a database, this may mean recovering data from a backup file and applying log records from the time of the backup to the recovery point. During peak season data and transaction volumes are higher; consequently, recovery times may be extended. To mitigate against this risk, the DBA should consider implementing more frequent backups for critical databases. Since during recovery of a table it is usually necessary to take time to rebuild indexes, DBAs should consider DB2’s ability to backup and restore indexes.

Data Warehouse Performance

In general, storage and analysis of Big Data will occur either within or in concert with the Enterprise Data Warehouse (EDW). Integrating all parts of your Big Data solution will require interfaces to every part of the EDW process. Here is a summary of the main subsystems of an EDW, how they are affected by a Big Data, and how to plan ahead for performance.

Data acquisition and collection.  Include a contingency for storing data until it can be moved to the EDW in case there are delays due to hardware slowdowns. Also determine how these new assets and processes will affect your disaster recovery plans. The period immediately prior to peak season is a favorite time for developers to implement new or enhanced operational systems. Some of this enhanced data may need to be passed to your warehouse and then to your big data application for future querying. Ensure that you are aware of new or changed applications that may require a business analytics component,

Data transformation and movement. Moving a lot of data quickly may require additional resources, perhaps even special software or hardware. Does your network have the capacity to move an increased amount of data from operational systems to the data warehouse and thence to the big data application?

Data access and analysis.  As data continues to fill the warehouse, users can execute analytics software against the combined warehouse and big data application. Capture data access paths and data distribution statistics and save them for analysis. You will need to determine if there are sufficient system resources (CPU, disk storage, network capacity, etc.) to support the expected query workload.

Data archiving.  With massive amounts of data being analyzed, the sheer amount of data may take up valuable storage media and make some processes run slowly. IT and business partners must determine how and when old or stale big data is to be archived or purged, and whether it must remain available for later use.

Big Data Organization and Storage

The first big data applications sold by vendors were usually designated plug-and-play. There were few, if any, tuning options. The major reason for this was that the applications depended upon a proprietary, hybrid hardware and software solution that used massively parallel storage and I/O to deliver fast answers to analytical queries.

As customer requirements for massive data analysis matured, vendor solutions kept pace by offering multiple methods of storing and retrieving data. The result was the design and implementation of alternative data storage and retrieval options. One example is the ability to specify how keyed records are stored. Consider a big data application that implements one hundred separate disk drives. The original application typically spread records randomly across these drives. Some analytical queries could then be logically split into one hundred separate queries, each accessing a single drive, and the results from each query combined into the final answer, executing about one hundred times faster than the same query against a sequential database.

However, consider a query that joins two tables based on a key column. Randomly distributing rows from the two tables across the one hundred disk drives no longer provides a performance advantage, since joined rows are not likely to be stored on the same disk drives.

Many current big data application solutions contain options and algorithms for distributing table rows across disk drives sorted by key or key range, rather than a random distribution. By storing rows within similar key ranges on each disk drive, joined rows from the two tables will exist on the same drive. The ability to specify data distribution by key thus offers a significant performance boost.

With this in mind, here are some suggestions for performance tuning your big data application.

Review data distribution statistics.  Use the RunStats utility to gather information about table keys and data distribution. This is especially true for indexes that support primary and foreign keys, as related tables are usually joined on these columns.

Review data access paths. While it is true that many analytical queries are ad hoc in nature, data tables will most likely be accessed and joined using the same or similar access paths across these queries.  Capture and analyze these access paths looking for common join methods. This information, plus that from data distribution statistics, will assist you in determining which tables should be distributed by key in your big data application.

Store data access paths for analysis. Expanding upon the suggestions above, you should have a method of capturing and storing the access paths for all analytical queries. The results will show how tables and indexes are accessed, which indexes are used, what sorts are performed, and so forth. As queries access more data and data volumes increase re-review your historical access paths and compare. Factor in increases due to data volumes, and watch for access path changes that may indicate performance problems.

Summary

Prepare your big data application for the upcoming peak season by reviewing data recovery procedures, improving data warehouse performance, and reviewing performance tuning options in your current big data application. The most likely area for significant performance results lies in organizing data in your big data application; however, carefully consider the ramifications of possible outages and prepare disaster recovery plans for these eventualities. Even a small outage of your big data application can have a significant effect on the bottom line.

References:

IBM Tools for Database Performance Management

IBM Resource Measurement Facility web site

Data Warehouse Architecture

Resource Constraint Analysis

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles