SHARE

Tuning Distributed Applications to Access Big Data

Written By

Oct 17, 2013

6 minute read

Distributed applications are just that: distributed across one or more hardware platforms across the enterprise. These environments are usually separate from that of the database server. The database administrator (DBA) has the unenviable task of monitoring these environments and configuring and tuning the database server to meet multiple needs. The advent of big data compounds the DBAs problems, as multiple distributed applications now require access to a very large data store. What tuning options are available?

In the Beginning
Excessive Locking
Poorly-designed Data Access Patterns
Tuning for Big Data
Tuning for Data Warehouse Access
Distributed Processes and Big Data
Summary

In the Beginning

The DBA must first address common application bottlenecks. If data availability or performance is already poor, tuning activities geared toward high-performance access to big data will fail. Here is a short list of common tuning issues. The DB2 should ensure that processes exist to mitigate these potential problems.

Excessive Locking

In the DB2 environment there are two process classes that can “reserve” data: SQL processes and database utilities. SQL processes include applications issuing static SQL statements and those which issue SQL dynamically. SQL issues locks against data and these locks usually prevent concurrent updates while data is being read. In addition, locks prevent utilities such as Load from loading data that would replace or overlay data being read.

Utilities issue claims against data. A claim is similar to a lock, in that it reserves data for access by the utility and prevents some concurrent SQL access. Generally speaking, locks force outstanding to claims to wait, and claims force outstanding SQLs to wait. This allowed the database management system to manage multiple concurrent utilities such as Load and Image Copy without interference from SQL statements.

The most common locking problem is SQL statements that lock too much data. An SQL statement that reads a row usually locks multiple rows as read-only for the duration of the SQL statement. This behavior is controlled in multiple places, including the statement syntax, the database definition, and the use of commit statements by the application.

DBAs should review SQL statement locking behaviors to make sure that the minimal amount of data is locked. Know the lock size of objects, and how applications access the data. Long-running applications may lock data for long periods of time, reducing data availability. Consider row-level locking to minimize the footprint of the SQL, although this may lead to an increase in CPU time used to manage locks.

Application commit logic should also be reviewed. A commit releases locks and allows data access.

Additionally, the DBA should review application and utility scheduling. For example, verify that utilities such as Image Copy are not running concurrently with applications that do database updates.

Poorly-designed Data Access Patterns

If a certain set of rows in a table are accessed frequently they can become a ‘hot spot’. Consider an order table sorted by order number. Recent orders will tend to be more active as they are processed. As multiple applications and utilities access a small number of rows, the data access footprint will concentrate in a small portion of the database. This commonly leads to performance issues as some transactions lock or claim data that other applications or utilities are attempting to access.

Such hot spots can be predicted during database design. The DBA can embed empty space in the database to ‘spread out’ the data, thus lessening the concentration of activity in one physical spot. Other options include methods that distribute rows across the database. In our order table example above, the DBA may implement the table sorted by geographic location rather than by order number. Thus, new orders (perhaps having sequentially ascending numbers) will not be placed adjacent to one another, but rather distributed across the physical table.

Tuning for Big Data

Big data usually means a large data store that is queried with high-speed data analytics software. Many times these big data implementations co-exist with the enterprise data warehouse. This means that the DBA staff must coordinate with the data warehouse staff to ensure good performance. Some items for consideration are:

Big data stored in very large DB2 tables may have special recovery requirements. Consider a large store of transaction data that is analyzed daily. Business managers may deem this analysis critical for their day-to-day operations, leading to the designation of the data as mission-critical. If a disaster strikes, how will this data be recovered? Best practices for a data warehouse usually designate the data as low-priority for recovery.
Big data stored in DB2 tables may require the DBA to reduce or minimize the number of indexes on the data. While it is typical to add multiple indexes to a table to improve query performance, for very large tables the indexes will be large as well. Disk storage limitations may prevent the DBA from creating some indexes. In addition, a greater number of indexes will slow data insert performance as well as make any database recovery process run longer.
Big data housed in a proprietary hardware and software appliance must often be accessed simultaneously with data warehouse tables. (This is typically implemented using SQL join statements.) The DBA must coordinate loads of the big data appliance with data warehouse extract – transform – load (ETL) processes to ensure that all data is available during querying.

Tuning for Data Warehouse Access

This last point is most important. The extract – transform – load process of the data warehouse has its own unique performance issues. Data extract processes usually execute as multiple parallel data query processes. The data warehouse team may use high-speed networks to expedite this process. Data transforms require programmatic skills as operational data may not be in a form that is easily analyzed. Common issues are nulls, missing or unknown data, and even invalid data such as dates with the value “99/99/9999”.

Last, load processes typically involve multiple concurrent load utilities against warehouse tables. Loads are usually long-running and resource-intensive.

As distributed applications attempt to access big data they will inevitably access data warehouse data as well. Again, DBAs must coordinate this access with data warehouse ETL processes.

One common method of managing this is to architect tables with two partitions, active and inactive. The target table is divided physically into datasets or partitions. One partition is designated as the active partition, and a control table or parameter is set to indicate which partition is active. Distributed queries may now access the active data, allowing load processes to load data into the inactive partition. Once the loads are complete, the designations of active and inactive are switched.

Distributed Processes and Big Data

A best practice for tuning distributed access performance is by using resource constraint analysis. The DBA monitors resources such as the disk subsystem and CPU while gathering performance data. Even query and job elapsed times can be considered resources. When the DBA discovers that a resource is constrained they may then balance other resources to compensate.

For example, consider a big data store that is heavily queried by multiple distributed applications. The DBA may determine that elapsed times (resource #1) are too long. One resource balancing option would be to add more indexes to the tables. This uses disk storage space (resource #2) while simultaneously speeding query times.

Other balancing options include removing indexes, allocating additional memory to DB2, increasing DB2’s sort work areas, query tuning, and so forth. These and other methods are documented in the DB2 performance manuals.

Summary

Big data may mean big performance problems, and access by distributed applications complicates these problems even further. The DBA can approach these issues proactively by considering the following:

Database design options (active / inactive partitions, index choices, spreading data across physical datasets);
Distributed query tuning using Explain;
Coordinating big data access with data warehouse access;
Performing resource constraint analysis.

Distributed applications can be a challenge to the DBA. Begin by addressing current and potential data availability issues, especially in your enterprise data warehouse. Once these concerns are mitigated, the DBA can begin to manage distributed data access to big data.

Source(s):

IBM DB2 DBA solution by Optim Performance Manager 2012
http://www.ibm.com/developerworks/offers/lp/demos/summary/im-idb2dba.html
http://www.ibm.com/developerworks/offers/lp/demos/summary/im-idb2dba.html

IBM DB2 for z/OS DB2 database design: Principles of peak performance 2012
http://www.ibm.com/developerworks/db2/library/techarticle/dm-0408whitlark/index.html
http://www.ibm.com/developerworks/db2/library/techarticle/dm-0408whitlark/index.html

IBM DB2 for z/OS: Squeezing the Most Out of Dynamic SQL 2012
http://www.redbooks.ibm.com/abstracts/sg246418.html/
http://www.redbooks.ibm.com/abstracts/sg246418.html/

See all articles by Lockwood Lyon

Lockwood Lyon

Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).