Until recently, business analytics against big data and the enterprise data warehouse had to come from sophisticated software packages. This was because many statistical functions such as medians and quartiles were not available in basic SQL, forcing the packages to retrieve large result sets and perform aggregations and statistics locally. Today, many database management systems have incorporated these functions into SQL. This includes IBM's flagship product, Db2.
Many large IT shops implemented big data solutions over a decade ago. At that time, the science of statistics was well-established. Business analysts already had lots of experience analyzing subsets of data from operational systems as well as time series data in the enterprise data warehouse. These analyses included basic statistical functions such as minima, maxima and means, as well as advanced functions such as percentiles, cubes and rollups.
Along with big data solutions came software packages that allowed business analysts to use a visual interface to select data elements and specify aggregation criteria and statistical calculations. The software then automatically creates SQL to gather the relevant data. However, a critical performance issue arose with big data. While scanning large amounts of data quickly was a feature of big data solutions, advanced statistical calculations could not be executed there.
This required returning huge amounts of data to the analyst’s software, which required a configuration with large amounts of memory and CPU power. This also spawned the idea of creating local data marts to hold subsets of the warehouse and big data in order to run extensive statistical calculations locally. Luckily, database management systems (DBMSs) stepped up and delivered multiple new SQL functions to help business analysts.
Most DBMSs already provided basic statistical operations such as the following:
- Sum, minimum and maximum
- Average (arithmetic mean)
- Standard deviation
- Variance and covariance
In modern DBMSs that support a big data solution (and, to a lesser extent an enterprise data warehouse), it is now necessary to support more advanced functions for usability and performance reasons.
IBM Db2 SQL Enhancements
IBM has implemented multiple statistical functions in its flagship relational database product Db2. These include Median and Percentiles as well as Cube and Rollup.
Calculating an average of a set of numbers seems like a basic operation. However, the term “average” has multiple meanings in statistics. Three of these are the mean, the median and the mode. Even the mean has several variations. The arithmetic mean is the most commonly used mean, and this function has existed in ANSI standard SQL for decades. In recent version of Db2, IBM has expanded its SQL variant to include a median function.
The percentile is an aggregate function that returns the data value within a group of values that corresponds to a given percentile. For clarity, the median value of a group of numbers is the value that is at the fiftieth percentile. If the number of values in the group is even, then the median is interpolated as being between the two nearest values. For example, in the set (1, 2, 3), the number 2 is the median, or fiftieth percentile. In the set (1,2,3,4) the median is calculated as 2.5. Percentiles are a common way of depicting data graphically, the most common example being pie charts.
The percentile function keeps data aggregation, sorting and calculation operations on the host computer, avoiding downloads of massive result sets for local calculation. It also simplifies SQL statements, allowing the database administrator (DBA) the opportunity to capture and tune analytical queries with the goal of increasing performance and decreasing resource usage. For example, if medians and percentiles are required for a particular column value, the DBA might consider an index on that column; alternatively, there are various methods (discussed below) to “pre-aggregate” common calculations.
Cube and Rollup
The cube and rollup functions are similar, in that they are a method for analyzing subtotals of a group of data items, with rollup being a subset of cube. As an example, consider an enterprise accounting system with multiple accounts owned by each department. To analyze account balances across the organization an analyst wishes to calculate subtotals for each department, as well as an overall total. In this scenario, the accounts roll up to departments, which then roll up to a grand total. This might be coded in SQL in this way:
SELECT Department_ID, SUM (Account_Balance)
GROUP BY ROLLUP (Department_ID)
ORDER BY Department_ID
This SQL statement generates a result set with an account balance subtotal row for each department followed by a final grand total row. More complex statements are possible, including multiple levels of subtotals and specification of different grouping characteristics. As with medians and percentiles, including the rollup definition in the SQL allows the DBMS to do a single pass of the data and perform the required calculations efficiently.
The cube function works in a similar fashion by allowing specification of grouping criteria. (See link at the end of this article for details.) Consider the following Account table:
The last three columns are candidates for grouping, as a business analyst may want to review a summary of accounts for specific customer types or account types, or they might want to compare customers across multiple regions. In terms of the rollup function, the need might be to create subtotals for each of these columns, or for combinations of these columns. Some possible requirements might be:
- Average balance for each customer type
- Minimum and maximum balance for each combination of customer type and account type
- Average balance in each region with subtotals for each customer type
Rather than code multiple SQL statements for each possible rollup, the cube function can be used to accomplish this in a single statement:
SELECT Customer_Type, Account_Type, Customer_Region, SUM(Account_Balance)
GROUP BY CUBE (Customer_Type, Account_Type, Customer_Region)
ORDER BY (Customer_Type, Account_Type, Customer_Region)
The result returned by this statement is a result set of rows with the following:
- Subtotal for each combination of (Customer_Type, Account_Type, Customer_Region)
- Subtotal for each combination of (Customer_Type, Account_Type)
- Subtotal for each combination of (Customer_Type, Customer_Region)
- Subtotal for each combination of (Account_Type, Customer_Region)
- Subtotal for each combination of (Customer_Type)
- Subtotal for each combination of (Account_Type)
- Subtotal for each combination of (Customer_Region)
- Grand Total
The ability of this relatively simple SQL statement to deliver multiple rollups is a great boon to both the business analyst and the DBA. Simplified SQL means fewer errors, easier debugging and better awareness of tuning needs.
Tuning for Analytics
The DBA who supports business analysts has several options for reducing system resources while delivering fast query response times. One method is to implement a big data solution such as the IBM Db2 Analytics Accelerator (IDAA), a hybrid of software and hardware that combines a huge disk storage array with massively parallel processing. Allocating Db2 tables in the IDAA permits the Db2 Optimizer to direct SQL statements against these tables to the IDAA, and this usually means extremely fast query execution times. Another option is to store tables in both native Db2 and in the IDAA. The advantage of this option is to provide multiple access paths to a particular table, since native Db2 tables can have indexes defined on their columns.
A third option is to create summary tables, sometimes called materialized query tables (MQTs). The DBA creates these tables by defining an SQL statement that is used to populate the table, and then defining the times when the SQL statement is to be executed. An example will help clarify this.
Consider our Account table defined earlier:
Assume that Account_Table exists in a data warehouse. This means that it is not part of an operational system with ongoing on-line activity and batch processes; rather, it contains rows that are loaded once per day and remain static throughout the day. Let’s also assume that the DBA has captured and analyzed common SQL statements submitted by business analysts and determined that many queries require subtotals by Customer_Region. That is, many queries contain the following SQL syntax:
SELECT SUM (Account_Balance), ...
GROUP BY ROLLUP (Department_ID)
For this static table, the rollup is calculated every time one of these queries executes, with identical results. The DBA can decrease resource usage by creating a materialized query table like this one:
CREATE TABLE Rollup_Acct_Dept_ID
AS (SELECT SUM (Account_Balance), ...
GROUP BY ROLLUP (Department_ID)
After creating this table, the DBA issues the REFRESH command to populate it. Table Rollup_Act_Dept_ID now contains rows with the subtotal information from Account_Table. Since the table data is static in this example, the rollup data need only be calculated once per day. Queries that need the rollup data can now query the MQT directly; alternatively, they can remain coded as-is, and the Db2 Optimizer will automatically access the MQT rather than re-calculate the subtotals.
In the early days of big data, business analysts grew accustomed to getting fast results from their queries. However, tables grew, both in number of columns and in the number of existing rows, and queries became more complex and required lager amounts of data. Eventually, SQL analytical query performance became an issue.
One of the most challenging performance concerns was the increase in complexity of the statistical processes and methods performed on the data. Data-intensive functions that required multiple aggregations and subtotals or other functions that were not supported directly in SQL had to be performed by the analyst’s software. This led to gathering large result sets and transporting them from the big data application across the network to allow the BI software package to complete the calculations.
IBM’s Db2 now includes SQL options that can perform such diverse statistical functions such as percentiles and cubes in the DBMS. This not only greatly reduces the amount of data traversing the network, it also provides the DBA opportunities to tune entire sets of tables and applications rather than one SQL statement at a time.
# # #
For more information:
See all articles by Lockwood Lyon