Big Data Analytics on Current Data

Typical big data applications have a load phase, where a snapshot of data is extracted from operational systems, transformed and cleaned, and then loaded into your big data appliance. Analysts and data scientists can then execute business analytics (BI) queries against the data store in search of patterns and potential marketing opportunities and cost savings. One disadvantage of this architecture is that your appliance becomes more like a large data warehouse, in that it does not contain the most up-to-date data.

IBM now provides an option to configure its Db2 version 11 for z/OS and complementary IBM Db2 Analytics Accelerator (IDAA) to permit concurrent transactional processing of operational data with analytics processing of data in the appliance. This new feature, zero-latency HTAP (hybrid transactional analytical processing) provides a patented replication process that propagates native Db2 table changes to the IDAA data store. This then allows BI queries to act on up-to-date data, providing more value to the enterprise, and enabling analytics embedded in operational applications.

State of the Art
Today’s Requirements
Early Issues
Early Solutions
Hybrid Solution from IBM
Summary

State of the Art

Early information technology (IT) systems and applications were fairly simple. Applications read and wrote records to and from keyed files or databases, and this held true for both internal applications (such as accounting, shipping and receiving, and various reports) and external or customer-facing ones, which included order entry, bank teller screens and information kiosks.

Over time application complexity grew along with data volumes. IT began to create data stores that were much more than simple keyed files. Daily extracts from order entry systems were sent to data marts that were analyzed to predict product shortages in stores and warehouses, which then sent data to shipping systems. As historical analysis and reporting became more important to the business, daily extracts were accumulated in a data warehouse, providing customer, account and product information that could be aggregated by region or time period. Analysts could then review these data for trends and make predictions about which products sold best in what regions or during which time periods.

Today, there are many more CPU-intensive and data-intensive operations across IT than ever before. Operational systems have grown from simple, in-house programs to include large suites of software from third-path vendors, including enterprise resource planning (ERP) packages and business intelligence (BI) systems for querying the data warehouse. Extract files from operational systems have grown much larger as analysts requested more and more data. These bulk data files became the source for loading into big data appliances, whose proprietary data formats and massively parallel data processing resulted in greatly reduced query response times.

Today’s Requirements

Now, IT is at a crossroads. The business has two new needs, and these needs are extremely difficult to satisfy simultaneously. They are as follows.

The need for analytical queries to access current data. Advances in big data query speed and the value of timely business intelligence made it essential that some analytic queries execute against today’s data. This can occur when real-time data is more important or relevant than historical data, or when the incidence of important events (and the data corresponding to them) is relatively rare. Consider the example of a BI analyst reviewing product purchases. The relevant queries may calculate the total profit realized for all products in every month for every sales region. For popular products with hundreds of orders per day, including today’s orders in the calculation may not make a significant difference; however, for products ordered only once per month per region, today’s data may almost double the totals.

The need for some operational applications to run analytical queries. This is mainly in situations where real-time data is far more valuable than historical data. For example, consider health care equipment monitoring a patient. Historical analysis may indicate the conditions under which you may be able to predict when events such as strokes or heart attacks may be likely; however, this may only be of value to a patient if the prediction is done immediately as data becomes available. Another example is a supply chain system that predicts product shortages and responds by scheduling shipments; as above, this is most valuable when analyses can make predictions based on current data. A third example is an on-line order entry system that needs to detect potential fraud. Fraud patterns may exist in the historical data stored in the warehouse or the big data appliance, and there may be a business requirement to detect fraud as early as possible, even at the point of initial order entry.

Early Issues

The first attempts to merge operational and analytical systems and data met with several problems. One overarching issue was the different performance and resource usage characteristics of the two environments. Owners of operational systems require strict service level agreements (SLAs) that include limits on maximum transaction elapsed times. To address this, DBAs tend to use indexes and data distribution schemes such as data clustering and partitioning to speed query performance. The analytic environment is quite different. Here, massive amounts of data are stored in a proprietary format that is optimized for fast analytic query operations such as aggregation and subsetting. Indexes and classic partitioning schemes are generally not available, with performance depending more upon parallel access to a large disk array.

Other significant issues included the following.

Data latency. Data extracted from operational systems took time to make its way to the warehouse or big data appliance, mostly because the extract, transform and load (ETL) processes needed to pass all data through multiple processes. For example, today’s current orders may require shipment from the system of origin to the analytical environment, passing through several transformation and cleaning jobs, before being loaded. This delay meant that queries could not access up-to-date data.

Data synchronization. Operational data extract jobs were usually executed on a file or database basis; that is, separate files were created for each of products, customers, orders, accounts and so forth. Since each file was then transformed and loaded separately, the analytic environment might contain the most recent data for only some entities, leading to inconsistencies (such as today’s orders but not today’s customers).

Infrastructure complexity. Applications across IT usually exist on multiple hardware platforms, with data in differing formats. Adding cross-system interfaces that couple the operational and analytics environment may require sophisticated hardware or connectivity techniques, special performance and tuning processes, and staff that are experienced in a host of different technologies.

Early Solutions

IT management was able to meet the two conflicting needs (queries in the analytic environment run against current data, and operational applications running analytic queries) in some limited cases. They developed methods of quickly loading selected operational data into the warehouse or big data appliance for operational access. These methods usually selected limited amounts of real-time data, bypassed transformations and used various fast load processes.

One possibility was to use data replication technology to extract current data on a transaction by transaction basis and pass it to the analytics environment for immediate publishing, rather than accumulating all transactions for the entire day for a nightly load.

The next logical step was to determine if there was a combined, or federated, solution that could logically merge the operational and analytic environments. Alternatively, was some kind of hybrid solution possible?

In early 2014, Gartner defined a new category of environment that they called “hybrid transaction / analytical processing” (HTAP). They characterized this as a new “opportunity” for IT to “… empower application leaders to innovate via greater situation awareness and improved business agility”.

This description did not specify how to create an HTAP environment. Businesses and vendors attempted several different configurations, each with their advantages and disadvantages.

A multi-site solution. This is probably the simplest, and is the most likely one to exist currently in your IT shop. Operational and on-line transaction processing are executing in one system, while analytic data is stored and queried in another. If the two systems are physically separate, then you can either develop a federated answer (where a new, central hardware/software solution will access both environments) or develop methods to transfer processing and/or data from one environment to another. For example, you can copy relevant subsets of analytic data to your operational system for applications to do analytics locally. Another method would be to develop pre-coded analytic processes that are installed in your analytics environment, then couple that with a high-speed execution and data transfer method. Stored procedures are one possibility for doing this.

A single-site solution. Merging your two differing environments into a single one may seem like a recipe for disaster; after all, they have two different performance characteristics. In addition, you wouldn’t want resource usage such as CPU cycles used for analytics queries causing performance issues with on-line transactions. Luckily, there are several ways to address this issue. IBM z-series mainframes allow the definition of multiple instances of the operating system called logical partitions (LPARs). Each LPAR can be defined with specific resources and resource caps. For example, the analytics LPAR could have a limit on the amount of CPU it could use.

A hybrid solution. This is one with special-purpose hardware and/or software coupled with dedicated disk storage and large amounts of memory. The intent is to permit sufficient specialized environments for transaction and analytical processing while at the same time defining federated access to processes and data.

Hybrid Solution from IBM

A hybrid solution was recently presented by IBM as a Technical Preview (i.e., not yet generally available). It is a combination of three products: Db2 11 for z/OS, IBM Db2 Analytics Accelerator (IDAA) V5.1 for z/OS and IBM InfoSphere Change Data Capture (CDC) for z/OS V10.2.1.

Db2 11 stores operational data in a standard relational database for use by operational applications. The IDAA stores big data for analytics processing. Finally, CDC is used to replicate completed transactions from Db2 to IDAA. There are several advantages to this solution.

All hardware and software are available from a single vendor, thus avoiding the issues of interfacing and maintaining multiple solutions from multiple vendors;
Db2 11 and IDAA run on IBM z-series hardware that easily supports enterprise-sized applications and data;
Operational and analytic workload processes will not compete for CPU, network and I/O resources, as the database and big data appliance are separately maintained.

IBM’s solution will be soon fitted to execute with the latest version of its relational database product, Db2 12 for z/OS.

Summary

The promise of HTAP is that the enterprise can use real-time analytics to increase business value. Analytics added to operational systems for fraud detection, security and real-time monitoring will provide a competitive advantage. Analytics executed against real-time data can give more accurate answers to business questions. The issue that remains is whether these gains will offset the increased resource usage and infrastructure complexity costs that will naturally follow.

Additional Information:

Wikipedia — HTAP

February 2018 – Db2 for z/OS News from the Lab

See all articles by Lockwood Lyon