Automating Data Warehouse Support

Big data is here, and fits nicely into an organization’s data warehouse infrastructure. Most large IT enterprises have installed one or more big data applications. These applications provide fast access to large stores of data, usually customer or sales data. Your technical staff that supports these applications and the systems that analyze and consume the data didn’t exist ten years ago. In addition, new technologies and new business models using advanced analytics require installation, configuration, and performance tuning.

As budgets tighten, how can database administrators (DBAs) keep up with such a rapidly-changing new technology environment? The solution: automation. In this article we discuss various ways to automate common processes and relieve the infrastructure staff from repetitive tasks, so that then can concentrate on more urgent priorities.

Targets for Automation
Database Autonomics
Big Data Automation
Big Data Storage
Big Data Analytics
Summary
References

Targets for Automation

If your enterprise is big enough to host one or more data marts or data warehouses, then you most likely already own one or more third-party vendor tools for database support. These may include tools for performance tuning, backup and recovery, change management, and other processes, simplifying database support by providing menus and pre-defined scripts for executing most support tasks.

Begin here by utilizing tool features and functionality to pre-script the following tasks.

Database backups … and recovery! Most DBAs have already set up regular database backup jobs. These usually execute on a defined schedule (weekly, monthly, etc.) and provide for basic recovery options. Enhance these jobs in three ways.

First, bake-in intelligence that analyzes databases to ensure that backups are taken often enough. This can range from a simple report of percentage of changed records to detailed partition-level statistics. Use this information to drive the backup schedule. If the number of changed records reaches a certain threshold, automatically execute an additional backup. Do certain partitions (perhaps containing stale data) never change? Consider delaying backups of those partitions. The goal is to automate the backup process so that changes in data due to expansion or high transaction volumes will automatically result in more frequent backups.

Second, automate the recovery process. Most DBAs consider recovery a rare event, and depend upon their tools to help them react to recovery situations. Instead of this, analyze your potential recovery situations and pre-script recovery processes for possible execution. These should be coordinated with your backup jobs; indeed, it is beneficial to have the backup jobs themselves create relevant recovery scripts. This will be extremely important in a disaster recovery scenario, where it is critical to begin execution of recovery processes as soon as possible.

Third, regularly test and time your recovery processes. Most enterprise-wide disaster recovery plans include a recovery time objective that defines the maximum elapsed time from the beginning of the disaster to the point when databases are available. Ensure that you can meet that objective. One easy way is to estimate recovery time as a percentage of backup time. Choose a representative production database that is relatively large and contains tables with multiple indexes. Back up the database then recover it using your automation. Note the elapsed times of backup and recovery, and use the ratio as an estimated recovery time factor.

Update your backup processes to report the backup time and the estimated recovery time based on this factor. For even more flexibility, store the database names, sizes, backup dates, elapsed backup times and estimated recovery times in a DB2 table. Over time you can query this table to get a complete picture of when databases were backed up, and the accumulated and trending backup and recovery times.

Finally, consider ways that you can use the information in the table in your automation. Is total recovery time approaching a threshold? Consider storing backups on faster storage media. Are large database backups running long? Consider re-scheduling backup jobs. Are certain databases changing or growing faster than predicted? Forward this information to your capacity planning group to ensure that you have enough storage for backup and recovery datasets.

Repetitive DBA Analysis Tasks. There are a multitude of processes that DBAs perform every day that can be automated. Typically these are tasks that either verify compliance to standards or perform simple data gathering for developers. Consider the following tasks and automation possibilities.

Executing EXPLAINs of one or more SQL statements. This is usually done by the DBA as part of an application SQL review. The DBA runs the EXPLAIN process, sometimes using an enhanced vendor tool, and provides an interpreted report to developers. This is used to do initial performance tuning of the SQL, as well as ensure that tables contain the correct indexes and other performance features. Automate this by providing the EXPLAIN process to your developers! Have them execute EXPLAIN and do their own interpretation, thus saving several steps.
Verify that new tables and columns comply with shop standards on naming and attributes. This also includes entry of these elements into the enterprise data model. DBAs usually create a table request process where developers submit a list of tables, columns and attributes for creation in the development environment. The DBA then verifies that the table and column names follow shop standards, and that column attributes (data type, length, etc.) are acceptable. Automate this by publishing your standards and delegating initial table creation to a local application DBA, who would then be responsible for standards compliance. Meet regularly with the application DBAs and review upcoming design with all of them so that enterprise knowledge is shared. In addition, develop or buy a tool that can parse table definition SQL and automatically review it.
Monitor production processes for deadlocks and timeouts. DBAs sometimes get calls from developers or support groups about production issues related to deadlocks (two or more processes competing for the same resource) or timeouts (a single process that waits inordinately long for a resource). A common example is when two applications attempt to update the same row of the same table. Automate this by developing or acquiring a tool that monitors the database logs for deadlocks and timeouts and provides notifications to the DBAs. Consider enhancing this process to store the information in a DB2 table, as well as determine who should be notified. Has a new application been moved to production? Set up notifications of deadlocks to the developers and DBAs. How long should a process lock a database? Define a threshold time limit, and report any processes that exceed this limit to the DBA.

Database Autonomics

Autonomics refers to a database being self-healing. Many database management systems (including DB2) contain code that can monitor for a range of issues, report these to DBAs, and sometimes even apply corrective action. The DBA needs to understand what autonomics are available and when to enable them.

Autonomic processes vary across database management systems, with more capabilities arriving in each new version. A complete list is beyond the scope of this article; however, the administration guides and manuals cover these features in great detail.

Some of the more common autonomics include:

Automated database reorganization — Backup processes analyze the data distribution within tables and indicate to what extent they are disorganized due to insert, update and delete activity; the analysis is then used to optionally execute the reorganization utility;
Automated data distribution statistics — The database manager gathers real-time statistics regarding table insert, update and delete activity; these data can then be used as part of EXPLAIN analysis to determine optimal data access paths;
Automated capacity planning — The database manager monitors the size of table and index objects; when insert or data load activity causes a dataset to expand, the database manager uses the growth information to allocate storage.

Big Data Automation

What about your big data applications? There are several areas where automation can be implemented. Let’s look at how data is acquired, stored and queried in the big data application.

Big Data Acquisition. The data part of big data originates in operational systems. The simplest big data applications are fed by loading regular (i.e., daily) snapshots of production data. These processes are sometimes called extract, transform and load (ETL). Each of these sub-processes can be automated to a certain extent.

Standardize extract jobs. These jobs should acquire data from operational systems and present them as flat files. While the data extraction process can be varied based on the system, the result data should be presented to the data warehouse or big data application as a flat file in character form (i.e., no data is in a hardware-specific format).
Standard transform jobs. Flat files coming in can now be processed similarly. One common method is to load the input file into a DB2 table, perhaps performing some transformations during load. Additional transforms can be applied when unloading the data.
Standard load jobs. Final load of data should be done using the database load utility.

Standardized jobs permit you to automate any new data coming in. Adding customer data to your big data application? Begin with templates of the ETL jobs and modify to suit the incoming data. Use the new load job script to create one or more scripts for table and column definitions.

Big Data Storage

A big part of a big data application is capacity planning. The initial implementation was probably based on a set of tables to generate customer analytics. Once the DBAs determined the operational system data sources they were able to estimate how much storage would be needed over time in the data warehouse and big data application.

It’s the next step that causes capacity issues. Once your big data application is up and running, business analytics can provide actionable intelligence about products, prices, placement, customer service, and other factors. There will be one or more requests to expand the application to include additional data sources, and perhaps even more historical information. Thus, the size of your big data application, already big, gets bigger.

Since data enters on a regular basis, begin by automating reporting of data sizes. This includes the original files from operational systems, the size as transformed, and the size as loaded. These will vary wildly by system: operational data will arrive in character form; transformed data will change some data attributes (for example, 10-character dates will be transformed into 4-byte internal format); and final storage in either the data warehouse or a big data application will almost always involve some data compression, but may also include additional disk storage.

Here is an example. A 10 megabyte (MB) file of daily Customer data from an operational system arrives in the data warehouse. The ETL process selects only certain fields, transforms others, resulting in a 5 MB load file. This file is loaded into a data warehouse DB2 table using data compression, resulting in 2 MB of rows. However, table indexes expand due to these additional rows, resulting in an additional 1 MB of storage. The DB2 tables are stored in a RAID disk array that uses redundant disk drives. The 3 MB of table rows and index entries now takes up 6 MB of space. But there’s more! The table and indexes are backed up regularly, so the next backup file will be 3 MB larger. In addition, the RAID array is “mirrored” to another facility for recovery purposes, so an additional 6 MB is required.

These numbers (10 MB input file, 3 MB table/index increase, 3 MB backup file increase, 9 MB disk storage increase) should appear in the report.

Big Data Analytics

Analytics is the reason big data exists. Users employ business intelligence (BI) software to analyze your data in order to derive trends and opportunities. The software queries your data using SQL, and the most common analytics automation opportunity is SQL performance monitoring and tuning. DBAs can use the EXPLAIN tool to analyze SQL statements and report possible SQL and database changes that would enhance performance. In addition, for some big data applications there are additional third-party vendor tools.

There are many performance enhancing options for standard tables, including new or expanded indexes, database reorganization, vertical or horizontal table partitioning, data compression, free-space allocation, and others. However, due to the immense size of big data applications, many of these are not feasible. For example, reorganizing a 10 terabyte database may take several days.

The most common performance enhancement for big data is data clustering of large fact tables. This involves storing the physical table data in a key clustered order to allow faster joins to multiple additional tables.

Automate this analysis by store SQL EXPLAIN information in a DB2 table. As you get access path information for more and more tables, trends should emerge regarding what clustering is best. This process is usually described in the manuals for your database manager and big data application.

Summary

Data warehouses commonly include one or more big data applications. As the complexity of the environment increases and budgets are tightened, DBAs must automate simple and repetitive processes so that they have time to devote to the critical tasks of supporting business analytics and implementing more and more operational data in the warehouse.

References

IBM – Big Data Overview

See all articles by Lockwood Lyon