Big data applications are now in place to support business analytics processing. However, many standard performance tuning options (such as indexes) may no longer apply. Here we investigate the differences between operational systems and big data systems, and present common options for big data performance tuning.
Warehouse Tuning
In most cases, big data must be tightly integrated into your enterprise data warehouse (EDW) environment. Most of the attributes that analytical queries will use for aggregation (summation, averages, etc.) are already defined in the EDW. Some of these include:
- Geography — countries, provinces, states, and other location information relating to customers and product and services locations. Used when comparing measures such as product sales or customer buying habits between regions.
- Time — Months, quarters, years. Used for year-over-year or month-to-month analyses.
- Hierarchies — Organizational (how are different departments performing), products (which product classes are selling well), etc.
In SQL terms, analytical queries will join big data transactional data with warehouse data, then order or group by the above aggregation dimensions. Consequently, analytics performance must take into account the performance of the enterprise data warehouse.
The EDW is designed as a data store to hold regular snapshots of data from operational systems. A typical warehouse has an overnight load cycle that loads one day’s worth of data, followed by query-only access during the day. This type of access allows the database designer much more freedom, especially when doing performance tuning.
Data warehouses are not systems-of-record; consequently, database recovery has a low priority. Also, many features typically used in a relational design such as primary and foreign keys, referential integrity, and triggers are not needed.
The warehouse is typically designed for fast query access and data availability. Commonly this results in what is called a dimensional design. Here, the data is stored in two primary categories of tables. Dimension tables contain selection criteria (these are the categories of data used for groupings or aggregations mentioned above). These tables typically do not change very often, so they can benefit from additional indexes for faster access.
Fact tables contain the data which is to be aggregated or summarized. In a relational design, these tables would all have primary keys, and would represent entities such as customer, product, order, etc. These tables are typically time-partitioned; that is, a separate physical database dataset or partition is used for data in the same time period. For example, each partition of an order table might contain orders for a single day.
Data warehouse processing happens in two phases: extract-transform-load (ETL) and query. Here are some tuning techniques for each phase.
Tuning Techniques for the ETL Phase
During the ETL phase large volumes of data are loaded into the DW. The DBA should take this into account during database design.
The DBA can use key clustering to ensure that inserted rows do not create hot spots in the data. One common way is when new records are assigned a key based on a sequentially ascending value (such as an Identity column or a Sequence), and the table is clustered on this key. The result is that newly-inserted rows are added together in a section of the table.
These options must be coordinated with index design, because physical clustering of the table may depend on designating one of the indexes as the clustering index. Since adding a surrogate key to a table in order to cluster it necessarily means creating a clustering index for that key, this may affect the total number of indexes for the table.
Another technique is using bulk loading methods rather than SQL Insert statements. The most common bulk loading method is the DB2 Load utility. The DBA will ensure that the extracted data is pre-sorted in key clustering sequence before passing it to Load.
If possible, the DBA will schedule DW table loading during a time when queries are minimal, in order to avoid contention with user queries. Another option is to use one of these techniques:
- The Shadow Table — The DBA creates a copy of a table. During ETL, the copy is loaded while the original is available for querying. After the load is finished, the DBA executes a process that switches the names of the tables. The copy now becomes the original, and queries access this version.
- Active / Inactive Partitions — Tables are designed with multiple partitions. A separate control table contains one row for each of the original table’s partitions. These rows contain activity switches that indicate which partitions of the main table are active. The ETL process turns an indicator off; it is now free to load the inactive partition. When done, the indicator is turned on. BI queries are designed to access only active data.
Tuning Techniques for the Query Phase
Analytics users will use tools to construct queries rather than writing SQL statements themselves. These tools provide a visualization of warehouse tables along with descriptions of table relationships, usually using some form of entity-relationship diagram.
Data warehouse queries are sometimes described as ad hoc, or invented for a one-time purpose. Thus, DBAs consider tuning a data warehouse as a complex process that must balance the performance of potentially thousands of different queries.
In reality, the database design of the data warehouse results in highly-predictable access paths. Queries will almost always require one or more columns from a fact table, with aggregations requiring joins to one or more dimension tables. This means that the DBA can be quite proactive about tuning SQL queries against the DW using one or more of the following techniques:
- Standard fact table index — Create an index on the fact table that includes all columns that may be joined to dimension tables. (In relational database designs these would all be foreign key columns.) This standard index will be required in order for the database management system (DBMS) to choose what is called a starjoin access path. This high-performance access path involves pre-joining dimension table rows into a large temporary table before joining to the fact table.
- Additional fact table indexes — Since fact tables contain the largest part of the data returned by BI queries, these tables may require additional indexes in order to provide alternative access paths to the table. These indexes will usually take the form of one dimension column and one or more additional fact table columns that are accessed frequently.
- Gather rich table and column statistics — The DBMS uses data distribution statistics in order to determine the lowest cost access path for a query. Most DBAs will have the system gather and store statistics about table cardinality (number of rows) and column statistics about the leading columns of indexes. Rich statistics means gathering additional information, including frequent column values, column value correlations (e.g., columns like ZipCode and City should correlate), and even histogram statistics.
- Regularly reorganize non-clustered indexes — Each table having one or more indexes has one designated the clustering index. The key columns of this index define the preferred physical order of rows in the table. In a DW the ETL process usually bulk loads new rows into the table in this order, so row clustering is maintained. However, the keys of the remaining indexes (if any) are not in this order. The result: such indexes may become disorganized, leading to poor performance. The DBA should ensure that such indexes are regularly reorganized.
- Consider summary tables for pre-aggregation — Summary tables (also called materialized query tables in DB2) are populated after fact table data load and before BI querying begins. These tables contain common aggregations of the fact table data such as summations over time, minimums, maximums and averages. This provides BI queries with ready-made results, allowing the DBMS to bypass complex and expensive calculations (perhaps several times in one day).
Big Data Application Tuning
With these things in mind, here are some suggestions for performance tuning your big data application.
Review data distribution statistics. Use the RunStats utility to gather information about table keys and data distribution. This is especially true for indexes that support primary and foreign keys, as related tables are usually joined on these columns.
Review data access paths. While it is true that many analytical queries are ad hoc in nature, data tables will most likely be accessed and joined using the same or similar access paths across these queries. Capture and analyze these access paths looking for common join methods. This information, plus that from data distribution statistics, will assist you in determining which tables should be distributed by key in your big data application.
Store data access paths for analysis. Expanding upon the suggestions above, you should have a method of capturing and storing the access paths for all analytical queries. The results will show how tables and indexes are accessed, which indexes are used, what sorts are performed, and so forth. As queries access more data and data volumes increase re-review your historical access paths and compare. Factor in increases due to data volumes, and watch for access path changes that may indicate performance problems.
Summary
Big data applications are tightly coupled to the enterprise data warehouse. Any study of overall performance, especially that of analytical queries, must take this into account. The first step in performance tuning big data is to ensure that the data warehouse is performing well.