Analytical reporting against big data has become standard in many companies, along with embedding simple analytical queries in real-time applications. What’s next? Combining big data and on-line transaction processing methods into a single infrastructure, allowing transactional and analytical processing to co-exist.
A Quick History of Analytics
Several technologies such as Hadoop, NoSQL and database appliances are used to store and retrieve massive amounts of data using parallel I/O channels and proprietary hardware.
The next step in this quest to squeeze value out of the company’s data was to have operational applications do simple analytical queries. Most of these were financial applications making real-time decisions regarding customer credit, presenting possible alternative purchases or attempting to mitigate potential fraud. One common example was for an on-line product sales application to suggest purchases commonly made by similar customers. The program would execute an analytical query against historical data and return potential choices.
At the same time, data scientists began to move beyond the boundaries of their big data solution and analyze current data, perhaps even in real-time. However, there were several technical issues that argued against direct access to production data. These included the possibility of queries locking current data, as well as using CPU and I/O resources that would negatively affect critical production application performance.
The technology infrastructure tended to be split into two groups:
- One set of infrastructures to handle critical production, including dedicated resources (CPU, memory, I/O channels, etc.), performance tuning for quick transaction throughput, and provisions for disaster recovery; and,
- A second infrastructure with a possibly combined data warehouse and big data storage facility, its own independent resources, analytics software, and various performance features that promoted fast analytical queries.
In order to provide a hybrid infrastructure that can promote different kinds of queries against different types of data and still deliver acceptable performance, we need to briefly review how the IT department supports analytics processing.
The Current State of Analytics Infrastructure
From the exterior, an analytics environment looks simple. It consists of a large data store containing (usually) historical data in a proprietary format that provides fast query execution times. However, from the database administrator’s (DBA’s) perspective, there is much more. DBAs must implement several sets of processes, and these create various issues.
Extract / transform / load (ETL) processing. These are the jobs that feed the big data application. Usable data starts as operational data created and maintained by production applications. ETL jobs extract the data from the production databases or files. Then then make any required transformations (including filling in missing data, fixing bad data elements like “00/00/0000”, assigning surrogate keys, and so forth), sometimes called “data cleansing”. The final data records are then loaded into one or more sections of the data warehouse, big data application and perhaps several standalone data marts.
ETL jobs tend to be resource intensive. Operational data may begin as a single relational table or file that is then move and altered through several jobs and intermediate files. All this takes disk space, CPU and I/O resources. In addition, since operational systems may be running around the clock, ETL jobs tend to execute once per day during slow periods. This leads to both latency issues (where big data is constantly a day old or more) and consistency issues (where, say, today’s Customer data has been loaded but the Order data has not).
Finally, even when these various processes can be standardized there will exist a delay when DBAs attempt to add a new data source. For example, when a new operational system moves to production it may consist of hundreds of tables, and each of these may require a set of several ETL jobs in order to load the big data application properly. Creating and testing these jobs takes time, causing delays for business users.
Stale data handling. Analytics is only useful when run against relevant data. Consider predicting products that customers may wish to purchase. You wouldn’t include products that are no longer being made or have been replaced by new ones. Similarly, when forecasting product sales revenue, you wouldn’t include products no longer made.
This means that every big data application must include processes that remove data that is either irrelevant, old or “stale” on a regular basis. The DBA must define what this means and implement the appropriate jobs. One common variation is to retain relatively recent data in the big data application (for example, data collected within the past two years) and archive anything older in a separate area. This separate area can be a secondary big data application, a database, or even one or more files. The intent is to provide for multiple categories of analytical queries:
- Queries against recent data, which will run against a pared-down database;
- Queries that may require both recent and old data;
- Queries that run solely against old data, where the performance and capacity of the primary big data application are not required;
- Queries that may arise due to regulations or audits that may require SQL-level access to the purged or archived data.
Each of these categories may require a specific implementation of data purge and archive to one or more locations. The issues here include not only performance but capacity planning.
The New Infrastructure
It is possible to address all of the above issues while allowing operational and analytical processing to co-exist. IBM Db2 for z/OS provides the following set of features and services that allows for high-speed ETL processing, automated stale data purge and archive, and hybrid transactional and analytical processing. In the following description, we refer to the IBM Db2 Analytics Accelerator (IDAA) product, sometimes called the accelerator. This was originally a separately installed hardware item. In IBM’s latest offering, the IDAA hardware is mounted directly in the IBM z14 hardware chassis for maximum I/O throughput.
Automatic Data Archive
The first step is to implement automatic table data archival. One way to do this is with a feature called Db2 archive transparency for operational data. The DBA configures each table as “archive enabled” and creates a second table with the same column definitions called the archive table. In addition, the DBA adds a Timestamp column to each table. The base table timestamp indicates when the row was last changed; the archive table timestamp indicates when the row was archived.
There are a few other settings for the DBA to consider, including under what circumstances rows will be archived (for example, when a row is deleted from the base table, or during a database reorganization). The DBA now has various performance alternatives. Commonly, base tables are partitioned by date, so that purging old data can act against a single partition. With the auto archive option, base tables may now be partitioned by key value. This may provide performance improvements depending upon current SQL statements. The archive table can now be partitioned by the timestamp column.
There are SQL options to allow applications to specify if they are attempting to access only current data or only archive data or both.
Multi-site Table Instantiation
The next step is to define each base table and archive table pair in both native Db2 and in the IDAA accelerator. Data in Db2 is added in normal ways such as SQL Insert processing or execution of the Load utility. Data moves to the archive tables based on how the DBA defined archive processing. Both tables are loaded in the accelerator using the normal Load utility or using the IBM Db2 Analytics Accelerator (more on this product below).
Having the tables in both Db2 and IDAA allows for the greatest amount of query flexibility:
- SQL from operational systems will access the base tables in Db2;
- SQL from business intelligence software or from power users, including online analytical processing (OLAP) will access either the base tables or archive tables or both in the accelerator;
- Hybrid queries may run against either.
The Db2 Optimizer is cost-based, and it chooses the appropriate data access path based on the SQL from the application. We now have the data co-existing in both the operational and big data environments with Db2 deciding the most efficient data access paths.
Upgrading ETL Processing
Last, we turn to our ETL jobs. Originally, these jobs were built in order to extract operational data, perform the required transformations and cleansing and finally load the data into the target database. What do we do with our new infrastructure?
This is one purpose of the IBM Db2 Analytics Accelerator Loader software product. It provides not only fast loading of Db2 tables into the accelerator, but now has the capability to load non-relational data directly into the IDAA. The intent is to use a single execution of the loader to replace the multiple ETL jobs that may exist for each data source.
Consider a Customer table. It may exist as a local Db2 table, a Db2 table on a remote system, a table in a non-Db2 database or even a flat file. Common ETL processing would be to extract this data from the source system into a flat file, transfer it to a central hardware platform to execute the transform process, then execute one or more load utilities to put the transformed data into Db2 and the accelerator. All of these steps and multiple files may now be replaced by a single execution of the new loader product using SQL to code the data transformations.
The Infrastructure of the Future
Automatic archive greatly reduces the need for data purge jobs. Using the loader reduces the number of data ETL steps and resources used, with a matched reduction in overall elapsed time. Finally, placing tables in both Db2 and in the accelerator allows the Db2 optimizer to choose either transactional or analytical data access paths based upon the SQL.
The result is lower resource consumption and shorter job and query elapsed times, leading to reduced costs. In an age when we must consider building analytical capabilities into operational applications, this is a must.
# # #
See all articles by Lockwood Lyon