How Analytics Began
Business analytics began as the science and art of analyzing large volumes of data to determine trends and correlations that could be used to make business decisions. The origin of analytic processing began with the emergence of the enterprise data warehouse in the late 1990s. The typical warehouse began as an accumulation of regular (usually daily) snapshots of production data that were transformed and cleansed to remove invalid or missing data. While warehouses began as co-located with their operational sources, it soon became obvious that the ever-growing amounts of data in the warehouse and the ever more resource-intensive queries against that data were consuming resources that operational systems used.
Some years later, many IT enterprises were creating or purchasing hybrid hardware and software systems for use by the warehouse alone. This provided several advantages, including system configuration and parameters settings that gave better performance for warehouse processing, which typically included huge volumes of data loads (usually overnight) followed by high-intensity read-only querying during the business day. Large data stores and databases were common, as well as changes in database and object definitions that promoted warehouse performance. Some of these included special data clustering, indexes that supported analytic queries, and forgoing unneeded referential integrity and data integrity processing.
Big Data Arrives
By the turn of the century, the phrase “big data” entered the IT lexicon. This usually referred to large volumes of highly-accurate data moving rapidly across multiple platforms. Big data was the data warehouse, only expanded, more rapidly updated, and bigger! Performance concerns continued to drive standalone hybrid solutions. One typical offering was the IBM Db2 Analytics Accelerator (IDAA), a combination of Db2 software running on IBM zSystem hardware with an attached hardware appliance from Netezza. The appliance consisted of hundreds of disk drives that stored data in a proprietary format and could be accessed in parallel. One example would be to split a database table into several hundred parts, store each part on a separate disk drive, and then execute an SQL query against all parts simultaneously and then combining the results. This massively-parallel storage and processing made analytical queries execute at crazy fast speeds, sometimes even hundreds of times faster than against a conventional Db2 database.
Today’s Big Data Dilemma
Fast forward to today. The IT enterprise has grown, databases are bigger, companies have more product lines, more services, more customers and more transactions. Business analysts used to be satisfied with subsets of only certain data elements for their queries, and now they want it all. In addition, analysts are clamoring for access to current data, not just historical data. They see the need to analyze current transactions for patterns of fraud detection, give customers suggestions on additional products and services, react to real-time sales transactions with price updates or preemptive shipping of materials, and the list goes on.
Analysts’ current perceptions of the state of business analytics is that performance is suffering, even with special purpose solutions, and the immediate problem is the constant movement of massive amounts of data from operational systems to the data store. After all, data movement takes time. IT systems move operational data to an intermediate area (usually called an operational data store, or ODS) where it is transformed and cleaned. The data is then loaded into one or more target systems including data warehouses, big data applications and other data marts. How can we reduce this data movement, or at least reduce the number of steps taken?
One proposed solution is to move processing and data closer together. The two major options are:
- Move or copy operational data to the analytics engine in real-time;
- Move analytics data and processing to the same platform as operational applications.
Combining Transactional and Analytics Processing
IBM announced one such hybrid solution in October 2003. This “hybrid analytics solution” used Db2 on a zSystem mainframe hardware platform as the primary database management system. Then, using a high-speed network attached IDAA appliance as an analytics query accelerator, the solution allowed data replication of operational data changes to the appliance. The system even provided the capability of attaching multiple accelerators to permit high availability.
In effect, this solution addressed performance concerns by providing options for publishing operational data in real-time to multiple big data accelerators. Some options for configuration included the following:
- Choosing specific accelerators to store certain sets or subsets of data;
- Configuring accelerators with different storage or processing parameters for faster performance;
- Designating one accelerator for mission-critical or high-performance queries of relatively current data, with others positioned for historical data or lower priority processing.
Business analysts could now direct queries to the accelerator best able to execute them and return results within the appropriate timeframe.
Some IT enterprises opted to copy significant portions of operational data to within their big data solution. Some tables now existed in multiple places, including the original operational systems database, the data warehouse, and one or more big data applications. By combining all of these using Db2 as the database manager, Db2 could accept queries from radically different applications, determine the best place to access the data, and then direct the SQL to execute in the appropriate location. Thus, transactional application processing and business analytics processing became centralized on the same platform, removing a lot of the massive data movement that consumed resources.
The Accelerator on the Mainframe
Beginning in 2016, IBM announced multiple software and hardware changes that would truly co-locate transactional and analytical processing.
In October 2016, IBM’s Db2 for z/OS version 12 became generally available. Along with the usual assortment of new features and improvements, this Db2 was positioned to take advantage of new high-performance IBM hardware, the IBM z14 server, which became available in late 2017. The z14 had several features important for analytics processing, some of which were:
- New hardware for improved data compression;
- New I/O channels for ultra-fast Db2 log writes and database I/O;
- A soon-to-be-announced hardware features for order-preserving compression for Db2 indexes.
In addition, the z14 contained 170 CPU cores (up from 141 on the z13) and could be configured with up to 32 terabytes of memory on each server.
However, the biggest new feature was the integration of the IDAA accelerator version 7.1 directly into the z14 mainframe. In effect, the original IDAA appliance was now “in the box” with the server, rather than a separate hardware entity.
This combination of Db2 version 12, IBM zSystem z14 hardware and IDAA version 7.1 allows both analytical and transactional workloads to co-exist and execute on the same server. In addition, IBM can include an upgraded version of its replication software (change data capture) to provide current data replication from operational tables to the accelerator. This patented technique means that analytical queries are guaranteed to always access consistent data without waiting for data loads to complete.
This solution provides unique, enterprise level hybrid transactional and analytical processing, or HTAP, and the IBM solution is engineered to scale both out and up.
Prioritizing HTAP
HTAP is the way of the future, but few IT shops have implemented solutions like these. Still, as data volumes and transaction volumes grow, the needs of departments for good user response times and actionable analytics results will only increase. Your current solutions exist across multiple hardware platforms and encompass a variety of operating systems, software packages and database management systems. Each of these requires a staff with specific skill sets in order to implement new applications, diagnose and fix problems, and install, configure, and manage software. Centralizing transactional and analytical processing will reduce the wide diversity of required staff skills and ease your maintenance schedule.
A separate but related issue is services that you defer to external providers, such as cloud storage, application development or database management. Delegating some people-related tasks such as database storage and management to a cloud provider (sometimes called Database as a Service, or DBaaS) can help you shorten development times; however, having your data in the cloud still involves lots of data movement to and from the provider. As data volumes and transaction rates increase, you may want to consider clawing back some cloud services and implementing one or more HTAP variations.