Early data warehouse implementations began as collections of financial and customer data that were accumulated over time. This allowed for historical reporting and some trend analysis. Modern warehouses have evolved into complex and elegant enterprise analytics platforms, hosting a broad collection of multiple data types, queried by advanced business intelligence software.
Every day more data arrives, the total amount of data grows larger, and analysts across the organization are presenting the warehouse with higher volumes of complex queries. As the warehouse environment becomes more valuable, capacity planning becomes critical.
In this article we present several strategies for managing data warehouse capacity planning and performance tuning.
General Principles
Best practices for managing data in enterprise databases require adherence to a structure of rules that guide the database administrator in prioritizing tasks. In order of highest priority to lowest, these rules state that data must be:
- Recoverable;
- Available to applications;
- Secured;
- Quickly accessed.
In the context of a data warehouse, the priority of these rules changes somewhat.
Data recoverability. In reality, few data warehouse implementations are mission-critical. If disaster strikes, the importance of recovering the warehouse take a back seat to accounting systems, external customer-facing systems, and the like. It is common for warehouses to have a disaster recovery designation of “recoverable within 3 business days”. In addition, warehouse extract-transform-load (ETL) processes are usually executed at high speed using processes that do not back up the data. For these and other reasons, data recoverability is not the primary consideration in the data warehouse environment.
Data availability. This is the paramount issue in a data warehouse environment. If the data isn’t available for loading or for analytics, then you aren’t serving your customers. The database design usually does not contain primary and foreign key definitions, since these already exist in the source systems. There is little need to enforce business rules such as data validation; again, this is already done at the source.
Security. Oddly, this is commonly not much of a concern in the warehouse. Why? Data warehouses typically contact facts that are gathered from existing systems and analyzed and aggregated, along with dimensions used for selection and subsetting. For example, “analyze account balances across geographic regions X, Y, and Z; determine if average balances are higher in certain sub-regions.” Here, the facts are the account balances, the dimensions are the regions. With this type of analysis one does not find highly-secure ID data such as birthdates or ID numbers. Yes, true, facts such as account balances should be secure. However, there is rarely a complex, column-based and accessor-based security mechanism. Rather, the warehouse is typically given an overall security designation such as Internal Use, and user access is granted depending upon their need to use the data.
Performance. Performance, like data availability, is important in the data warehouse. Performance is usually factored in to two distinct processes: bulk loading of current data, and business analytics queries.
Warehouse Capacity Issues
Resource capacity analysts must consider the warehouse and its infrastructure as a set of resources that need to be managed. These resources are:
Data storage. This usually includes current disk arrays for data storage, storage for database backups (disk and/or tape), storage for table indexes (important for query speed), and storage for archive data. This last can be important. For example, consider a warehouse containing customer financial data. Warehouse queries and business analytics may act on only the most recent data; however, the enterprise may be required for either legal or compliance reasons to maintain a 7-year history of such data. If this data needs to be queried, it must be available in some for either immediate analysis or loading into the data warehouse.
Data movement and analysis. Data moves into and out of the warehouse via various processes:
- Regular data load processes (ETL) take current operational system data and load to the warehouse;
- Regular purge or archive processes may remove selected data;
- Queries against the warehouse return results to users;
- Other processes may ‘publish’ subsets of the data warehouse to other warehouses or data marts for more detailed analysis.
This data movement requires disk storage for transmission and reception, network hardware and software for data transport, and CPU power at all notes to ensure that data move quickly and accurately. The ability of external software packages and SQL queries to access the warehouse is also a resource. As the number of query users and the size and complexity of query SQL grows, database processing may become a constraint.
Capacity Planning Specifics
As with planning for a big data application, enterprise data warehouse capacity planning revolves around managing the resource categories mentioned above. Performance tuning is an important part of the equation, as increased performance (in terms of resource utilization) can sometimes mitigate the need for adding resource capacity.
The Warehouse Datastore
Consider a data warehouse containing 100 terabytes (TB) of data. Sounds pretty big, doesn’t it? Yet the numerical size of the warehouse has no meaning without some measure of data movement and data retrieval. For example, let’s compare two data warehouses with similar performance (here we refer to daily data load times and query data retrieval times). If one contains 1 TB of data and the other 100 TB, why is performance seemingly similar? The point is that size by itself may not be a major factor.
In the case where we have a lot of data there are, of course, potential issues that the database administrator, DW process manager and BI Analytics manager must face. From the physical database perspective, the size of the warehouse probably prohibits executing tablespace reorgs, and maybe index reorgs as well. This naturally leads to database design that incorporates bulk loads of data pre-sorted into clustering sequence. Execution of full image copies may also strain media storage resources.
Another consideration is that from the DW process manager’s perspective, a large datastore is commonly accompanied by a large number of physical and logical objects. These include tables, indexes, columns and other data architecture items such as stored procedures, triggers, and primary and foreign key relationships.
Warehouse Datastore Recommendations
Object naming standards. Smart table and column naming conventions are a must, as well as a metadata dictionary. These will support business analytics software engines that create SQL queries based upon the warehouse table design.
Standardize ETL processes. Create a standard process template for operational systems data extract, transformation and data cleaning, and warehouse data load. This speeds up the creation and population of new warehouse tables, as well as making it easier for warehouse support staff to create and maintain ETL logic. One typical method is to create processes that are SQL-based; that is, create SQL statements that extract data from current systems to create loadable data files. This minimizes the number of tools that staff must understand, and speeds onboarding of new staff as well.
Long-running ETL processes may lock data, preventing query access until loads are complete. There are options to mitigate this risk, including so-called on-line loads. On the other side, eager users that get in early and begin running queries may lock out ETL processes. Depending upon the size of the user base and the bandwidth of ETL processes it may be necessary to invent ways for the processes to co-exist.
One option is a variation of the shadow table concept. A frequently-used DW table is designed as two partitions. (For tables already partitioned by date range, simply double the number of partitions with two partitions reserved for each range.) At any given time one of the partitions is designated active, the other inactive, with a separate control table to indicate which partition is currently active.
Enterprise data directory. This documents the names, attributes, and usages of data elements within the warehouse. As the warehouse grows, business analysts must understand more and more of the warehouse data in terms of its use in the business. Does an account number refer to a customer account or a general ledger account? Can an account balance be negative? Can an account maturity date be missing? Answers to these questions will affect the results of analytical queries.
Warehouse Data Movement
The total size of the data warehouse may not be great. Indeed, if you are implementing your first data warehouse, it starts off empty. A lot now depends upon how much data must be extracted from operational systems, transformed and cleaned, then loaded into the data warehouse. The rate of data retrieval from external systems may depend upon the ability of batch jobs to extract data, FTP connections to transmit files, and external network feeds to bring in data from outside the organization. There may be multiple steps involved in transforming the data received: how are dates and times stored; what is to be done with missing or unknown data; should NULLs be employed; should defaults be implemented for missing data in order that BI queries work correctly.
Warehouse Data Movement Recommendations
As noted previously, data movement to and from the data warehouse occurs in three separate stages: the ETL processes that load data; the purge/archive processes that remove data; and the SQL query and analytics processes that compute and return results to users.
Standardize data archival. Eventually, old data must be removed. The data warehouse and its processes must be designed so that archiving and purging are low-cost. One common method is partitioning fact tables by date or date range. This allows a purge of the partition by the simple expedient of a Load utility using no data (or the equivalent truncate table statement, depending upon your database management system vendor and version). A variation on this method is to use rotating partitions. If DW process management can set a standard for data retention (say, 10 years), then database administration can create tables partitioned so as to hold exactly 10 years of data. At the end of that time, rather than simply deleting the oldest data the partition is, instead, re-used by loading in the newest data. This method requires a second control table that specifies which partitions are assigned to each date or date range.
Apply standard SQL tuning techniques. The data warehouse is somewhat different from operational systems in several ways. The biggest difference is that queries against the warehouse usually do not interfere with simultaneous data load or update processes. This means that the warehouse tables and indexes can be tuned in ways that might be inappropriate for operational systems. There are two major tuning mechanisms available to the data warehouse administrator: data partitioning and index creation.
It is typical to partition operational tables so that ‘hot spots’ are avoided. These are physical locations within the database where simultaneous updates, deletes, inserts and queries take place. Queries must wait until database updates are committed or rolled back in order to ensure that the data is consistent. In the data warehouse, most queries take place after ETL has loaded data. In effect, the warehouse looks like a query-only environment. This means that tables can be partitioned to make queries execute more efficiently. For example, an account table may be partitioned by account creation date. This groups recently created accounts into a single partition, allowing for high-performance access to these accounts.
Warehouse indexes can speed queries by providing multiple access paths to table data, even allowing for index-only access in some cases. In operational tables a proliferation of indexes can greatly slow down insert, update, and delete operations. In the warehouse, there is no such activity. Inserts are segregated to the ETL process, and deletes are segregated to the purge and archive process. Hence, the warehouse administrator can add more table indexes as necessary to enhance query performance.
In Conclusion
One additional area of concern for the growing data warehouse is the support staff. Warehouse analysts must expand their horizons to become aware of data now stored in the data warehouse. They must be aware of any new BI analytics software in order to assist users in developing reports and queries. Finally, data scientists that understand the data as well as the database management system and warehouse will be needed in order to monitor and tune performance. Finally, experts knowledgeable in source and operational systems will be called in to determine what enterprise data should be added to current data feeds to the warehouse.