Big Data: Health Checking Your System

As big data solutions grow in size and complexity, concerns can rise about future performance. Will the current application scale up and out? Are there potential issues with data consistency? Can your big data application co-exist on the same hardware with regular production processing, or is it destined to be implemented in either a stand-alone configuration or in the cloud? One way to assess the potential effects of these issues is to measure your big data application’s health.

In the Beginning

Early big data solutions were presented as stand-alone, turnkey systems that required no performance tuning and promised “crazy fast” query execution times. Very little support from database administrators (DBAs) or systems programmers was required.

For a few years and for most IT shops, this held true. However, the number of analyzable data types grew (think XML, video, click-streams) and the time range of business queries expanded from current month to recent months to year-over-year. Total data volumes in some businesses grew beyond the 100 terabyte range, while at the same time queries that originally were executed only once morphed into regular reports.  Performance began to degrade. Tuning became a necessity.

Big Data on Db2

IBM’s flagship entry into the big data arena is the IBM Db2 Analytics Accelerator (IDAA). This was originally presented as a hybrid hardware/software solution with a proprietary data store and highly parallelized I/O processing. The Db2 Optimizer was enhanced to consider tables stored in the IDAA as alternatives for query data access paths. For example, the DBA may choose to store a Customer table in both native Db2 and the IDAA. Any query that accesses the Customer table is analyzed by the Db2 Optimizer, which then decides which of the table occurrences will provide the best query performance. Thus, real-time queries that require a single row from the Customer table will get their data from the table in Db2, while analytic queries may be directed to the table in IDAA.

Later versions of IDAA allowed the DBA several possible ways to affect query execution, including data clustering and data partitioning options. The DBA became more and more involved in configuring the IDAA, and consequently needed to understand current and potentially future data access options by user queries. Consequently, the DBA was now in a position to resolve performance and health issues.

The latest upgrade to IDAA allows it to run natively on z14 hardware using a logical partition (LPAR) called the z Secure Service Container. Embedding such specialized storage and processing directly into the mainframe configuration means that the hardware and accompanying operating systems software can process both on-line transactions and business analytics queries at the same time, sometimes called hybrid transactional and analytic processing (HTAP).

The Health of Your Big Data System

Today, the DBA is intimately involved with configuration and tuning of the big data application, partly because vendors responded to their concerns by upgrading solutions to include options such as data clustering and data partitioning. Still, performance tuning is only one aspect of the health of a system.

Application or system health is a combination of the user experience and resource utilization. While there are other items that are relevant to application health in general (such as data availability and consistency and various integrity rules), these are rarely an issue in a big data environment as the data loaded there originates from operational systems.

We concentrate on these two points because they may be affected the most as big data configuration options expand, data volumes increase and the needs of the business analysts change over time.

User Experience

Generally speaking, the big data user differs from users of other systems (including the data warehouse) in several ways. Business analysts typically use a specialized software interface that displays big data entities and relationships in one or more visual formats, allowing the user to point-and-click or drag-and-drop choices of data items, selection options, aggregation criteria, analysis time periods and the like. The software then constructs the appropriate query or queries, accesses the data and returns the results, usually in tabular or pictorial form.

The user’s experience is tied directly to their software package, and not to the SQL query generated. This makes query performance tuning somewhat difficult, as users are not usually aware of the underlying query syntax. Further, their software may hide from them any available performance options such as keys, clustering, indexes or other features. What the user notices the most is query response time.

To address the user experience, the DBA can use several methods to monitor the health of their big data application.

Query Tuning: While it may be difficult (or impossible) to predict what queries may be generated by business analytics software tools, the DBA can still make use of historical access patterns. What tables are accessed, and how frequently? What tables are commonly joined? Which tables are dimension tables (used for aggregations, subsetting and sorting) and which are fact tables (used for calculations)? What access paths are typical? Is the starjoin access path (typical in data warehouse applications) used at all?

Typically, the DBA will use a tool to capture historical SQL statements and their access paths. This can be done in Db2 via the dynamic statement cache and the Explain tool. Then, store the results in a set of performance history tables. Repeat on a regular basis. You can then query the results to answer the above questions. Commonly accessed tables might be stored both in native Db2 and IDAA. Commonly joined tables should be clustered and partitioned in a similar manner to take advantage of parallelism. Common data item aggregations can be pre-computed and stored in tables themselves (sometimes called summary tables or materialized query tables), providing for faster queries that aggregate these items.

Query the performance history tables to determine if the same access path(s) are used to access the same tables. If not, are any of the access paths faster, and can objects or queries be tweaked to take advantage of them?

Data Archival: Another method of addressing perceived slowness of queries is to remove stale or old data or place it in a secondary set of tables. For example, consider a table of customer transactions over time. Keep the most current data in a Current Customer Transactions table, move the prior three months of data to a Recent Customer Transactions table, and place the oldest in a Historical Customer Transactions table. By splitting the data this way, queries acting upon current or mostly current data will access fewer tables, and thus save data processing time.

Another method of archival is a form of what is called vertical partitioning. Review the columns in a commonly accessed table and split them into two categories based upon how often they are referenced in queries. Place frequently accessed columns in one version of the table, infrequently accessed columns in another. As with archival, queries that select only frequently accessed columns will reduce the number of I/Os necessary to retrieve what is needed.

Customized Environments: Users across an enterprise can use big data for completely different purposes. Ad hoc users create new queries, or slightly modified versions of current ones. Reporting users run and re-run current queries on a regular basis to provide reports to management. Data mart users take regular extracts of big data and load them into their own local databases for intensive analysis. Each category of user accesses big data in a different way, so each has a different experience.

The ad hoc user requires the most intensive support from the DBA. Here, it is important to gather and maintain historical table usage and access path data. Single query execution speed is the most important factor, and the DBA will most likely be tuning objects (through archival, clustering, partitioning, etc.) rather than tuning SQL. On some occasions, there will be SQL issues. This is because the business analytics software may not be generating the highest quality query syntax. Some software vendors have features that allow your support staff to make minor changes to certain categories of SQL statements, or to certain clauses. Clauses such as OPTIMIZE FOR n ROWS or FETCH FIRST n ROWS can be appended to some queries, while the software can be configured to recognize things such as summary tables.

The reporting user exists in a much more static environment. Here, the reporting queries executed are typically stored in a format that allows for editing. Thus, if the DBA is aware of a better way of coding the query, they can make the appropriate changes.

Data mart users typically extract from the big data application because the data there has already been extracted, cleaned, transformed and loaded from operational systems. While this is a convenience for the data mart user, it may be a faster option to have them extract data directly from operational systems. The processes already exist for big data, and probably also for your data warehouse, and they can be modified appropriately to send data to the data marts as well.

Resource Utilization

Resource utilization drives costs in several ways. Applications’ needs for CPU, memory, disk space and network bandwidth drive the direct costs of purchasing, installing and maintaining these resources. Implementing IDAA as a stand-alone hybrid has the effect of offloading CPU cycles from the mainframe. Extracting data to a remote data mart for processing can have a similar effect.

One common issue regarding disk space is the enormous amount of space provided in big data solutions. Space in the hundreds of terabytes are common. When your big data application receives its first load of data much of that space remains unused. It is therefore a great temptation to use it as storage for other things. Some examples include archive tables, database backups, and space for multiple environments such as development, test and production.

If you allow things other than big data to be stored in the IDAA you must take into account the future usage of the device and the environment. One such consideration is disaster recovery. Most mature big data applications can be accessed directly from production systems for things such as credit analysis, product buy recommendations or fraud detection. As such, your big data environment may need to be replicated in some manner at your disaster recovery site, perhaps thereby doubling the amount of resources you need to acquire. Implementation of your big data in a robust disaster recovery environment is critical for the health of your overall system.

Another issue is software costs. Some software is priced based upon CPU usage. Typical examples in the IBM world are the z/OS operating system, Db2, and Cobol. In this environment, reducing CPU usage may be a major concern. One option is to use the IDAA as a place to store non-big data objects, perhaps frequently queried tables.

Common Big Data Health Metrics

It is best to automate the gathering and reporting of certain statistics that measure certain aspects of your big data environment. These are mostly data-related and are aimed at ensuring that data is complete and consistent. These include the following.

Cross check table sizes – If you have tables existing in multiple places (such as a Customer table in both native Db2 and in the IDAA), do they have the same number of rows?

Business rule integrity – Do data elements have values that are correct per the business rules that define them? For example, do all customers have a valid value for Customer-Name? These rules are usually found in the extract, transform and load logic, but may also be found in the code for operational systems.

Data consistency – Do all data elements have existing, valid values? Consider querying either big data tables or input load files for non-numeric quantities or dates, inter-element consistency (Retire-Date should either be null or be greater than Hire-Date).

Access path consistency – Is table data accessed using the same Db2 access path every time, or are there exceptions?

Aggregation consistency – Are there common aggregations (for example sales summarized by region, or shipping costs by product by date), and can they be augmented by pre-aggregating with summary tables?

Summary

The health of your big data application is dependent upon how you address both the user experience and resource utilization. Along with regular query capture, analysis and tuning, the DBA should plan on means and methods of archiving old or little-referenced data to reduce I/Os by most queries. Customizing the environment is sometimes possible, perhaps by implementing a stand-alone big data solution, sometimes by extracting data to remote locations for analysis (and consequent reduction in local CPU usage). Finally, metrics exist that can raise flags indicating that your data is becoming less consistent or usable. Consider all the above when supporting your big data solution.

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles