Big Data Quality Metrics

Big data applications and their associated proprietary, high-performance data stores arrived on the scene a few years ago. With promises of incredibly fast queries, many IT shops implemented one or more of these combination hardware and software suites. Results have been generally good: many installations report incredible decreases in query elapsed times, sometimes by factors of 100 or more. Such improvement translated to faster results from analytical queries as well as a better understanding of corporate data.

However, few IT enterprises have implemented metrics that clearly measure the benefits of these systems. The expected monetary gains from big data applications have not yet materialized for many companies, due to inflated expectations.

The solution: Measure resource usage, and use these measurements to develop quality metrics.

Big Data and the Data Warehouse

Big data applications usually store data in a proprietary hardware appliance that is optimized for fast analytical queries. The logical data architecture is that of a data warehouse in that it is mostly static and time-dependent, and supports a heavy query-only workload. With these similarities, it is logical to begin designing resource measurement points in terms of standard data warehouse flows.

Some sample measurement points are:

Data extract. Data enters the data warehouse from the main IT operational data stores either by direct access to the data or by shipping of extract files. This data is usually collected via a snapshot technology at the end of a regular business cycle, typically daily, weekly or monthly.

Measure the data volumes, CPU time and elapsed time used for each means of data transfer, whether it be direct access by SQL, ftp (file transfer protocol), or sequential file.

Data transform. Operational data is rarely clean. Some fields may not be known at the time of extract and may contain spaces or some special indicator value such as 999999. Other fields may simply be unfilled, contain invalid data, or contain data that must be translated or converted in order to be stored efficiently in the data warehouse. Examples include strings such as “140101” (meaning 2014 January 01) that must be edited for correctness and transformed into database-specific date fields.

Measure the total volume of data that must be transformed, the CPU time and elapsed time used. If you think it is valuable, measure how many data values needed to be cleaned or converted.

Data staging and keying.  Here, the transformed data is staged into intermediate tables. In addition, a surrogate key is calculated and assigned to key fields.

Measure the total volume of data loaded, and the CPU time and elapsed time used.

Data load. This process step changes somewhat in the big data environment. In a data warehouse the load process takes staged data and loads data into fact tables in the DBMS (database management system). Other table types exist as well; see the references at the end of this article for examples. In a big data application this data may also be loaded into the big data appliance, allowing for faster execution of some queries.

Measure data volumes, CPU times and elapsed times used of table loads into the DBMS tables and appliance tables.

Data archive. A sometimes forgotten step in data warehouse processing, this step involves purging the data warehouse of data that is old or no longer needed.

Measure the data volume, CPU and elapsed times used during purge processes.

Big Data in Action

Now we’re ready to discuss query performance measurement. Big data queries are typically complex, and access a lot of data for either an extended time period or across multiple dimensions or both. An example might be, “Sum and average the dollars spent on products A, B and C in all geographic regions for the last five years.”

In general, queries arrive as SQL statements, are processed by the DBMS, and assigned an access path. The access path is a list of the objects that the DBMS must read in order to satisfy the query. These objects may include DBMS tables, indexes, and tables stored in the big data appliance.

This simplifies query performance measurement, as we can measure big data queries the same way we normally measure data warehouse queries. The only thing we need to add is a tool that captures the raw measurements we want from the big data appliance.

Here are some standard query measurements for data warehouse queries:

  • CPU time consumed
  • I/O wait time
  • Number of asynchronous prefetches
  • Objects accessed
  • Total elapsed time

DB2 users should contact database administration to determine which tools are available for gathering and reporting these metrics. These include:

  • IBM Resource Measurement Facility (TM)
  • Data Studio
  • IBM Omegamon

Quality Improvements Using Metrics

After you have familiarized yourself with your resource measurement toolset and chosen areas to concentrate your efforts you then construct and implement regular resource period monitoring. This is monitoring and gathering data about specific categories of objects over time. The purpose is to allow you to analyze objects in the context of their time dependence (if any) to resource constraints. Review the metrics with your team, and with users. This will help you to get an overall view of how your big data is accessed, and what resources it consumes.

Next Steps

Once you implement regular reporting you can leverage the data to help identify problem areas. These typically include the following:

  • Identifying resource constraints (CPU, elapsed time, etc.)
  • Identifying critical tables and indexes

If you identify a resource constraint you can perform resource balancing to address the problem. See the references for how to do resource constraint analysis.

Critical tables may be those that are accessed by multiple big data queries but may not be fully implemented in the big data appliance. Alternatively, they may exist in the appliance but query performance improvements have not materialized. In this case, review the performance documentation that describes your specific appliance. There may be a missing tuning or configuration parameter, or the timing of loading updated data may be causing resource constraints in the appliance.

Dashboards and Quality Measures

 

Many DBAs fail to realize how much IT management depends on numbers when measuring activity or productivity. In today’s climate, the extended IT enterprise will be forced to increase productivity and reduce costs. How can this happen if management can’t measure what the application does?

There are multiple dimensions to perceived performance:

  • The user dimension includes transaction rates, data throughput, and perceived transaction elapsed times.
  • The system dimension includes resource usage (CPU, DASD, etc.), availability of high-performance access paths (e.g., existence of indexes or data in a big data appliance), and data clustering.
  • The DBA dimension includes scheduling of utilities such as reorg and copy, along with some subjective items such as how well the DBA knows the tables, the SQL, and the application.

Summary

Knowing how management measures and perceives performance will be a priority, especially in an environment where the costs and benefits of big data implementations are being scrutinized closely.

References:

IBM Tools for Database Performance Management

IBM Resource Measurement Facility web site

Data Warehouse Architecture

Resource Constraint Analysis

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles