Tricks for Integrating Big Data into Your Data Warehouse

The promise of big data is the ability to use high-speed advanced analytics on large stores of relevant customer data to discover new relationships that lead to a deeper understanding of customer needs, optimal pricing, better product and service delivery, and ultimately higher profits.

The ability of these analytics to deliver quality solutions depends not only on the large stores of raw data analyzed, but also on integration with the enterprise data warehouse.  For example, your big data application may contain incredibly detailed transaction information on all customer purchases; however, it is the data warehouse that contains entity identification (product and service names), summarization and aggregation categories such as geography (what stores are in what areas), product categories, sales territories, management hierarchies, pricing data, customer categories and profiles, and so forth.

In many cases, the success of a big data application can be traced to how well it is integrated into your enterprise data warehouse. This article presents several ways to get this done quickly and efficiently from the beginning.

The Big Data Appliance

In today’s commercial information technology systems, larger and larger volumes of data are being gathered and stored. To be able to capture, store, and analyze this data, most enterprises turn to specialized hardware and software solutions. One common solution category is a special-purpose hardware data store (sometimes called an appliance) coupled with business information analytics software that is tailored to access that appliance. One example is the IBM DB2 Analytics Accelerator (IDAA).

Configure appliance for high-capacity.  Appliances are able to execute high-speed queries because they have a proprietary high capacity disk storage array that allows for massive parallel processing. This process only works if all tables referenced in queries are present and stored in the appliance. If your big data application is referenced by multiple user communities issuing complex queries that span many business entities and relationships, you should accept the fact that all of your data warehouse tables will be stored in the appliance.

Note that your data warehouse tables should also exist within your current database management system (DBMS); this is because your current operational and common warehouse queries can execute more quickly in the DBMS.

Limit initial access to the appliance. Most big data applications are installed to support specific use cases. These are proposed or current analytics which are expected to provide immediate savings or profits. It is best to limit appliance access to these users, especially if they are using a standard suite of business intelligence (BI) tools for query constructions.  BI tools that are customized for big data allow the users to construct queries in block or diagrammatic form, while building the actual SQL queries in the background. This allows the software designers to take advantage of the appliance by using appropriate SQL syntax such as star join access paths.

An additional reason for initially limiting appliance access to a small number of consumers is related to performance tuning. The first big data appliances were delivered to customers with no performance tuning capabilities. This was because the state-of-the-art hybrid software and hardware appliances were deemed to be so extremely fast that no performance tuning was possible or even desired.  Now, as more and more users acquired experience with appliance data storage and query access paths, newer appliances have appeared with performance tuning options.  Limiting appliance access will allow your appliance support team to get experience with tuning the appliance in a semi-static environment.

Determine how you will handle changed data. In order for an appliance to return useful information, the data must be up-to-date. This includes the data warehouse data, which we have already recommended be instantiated into the appliance. However, since most data warehouse extract-transform-load (ETL) processes execute daily, you must now coordinate loading this data into the appliance as well.  ETL usually involves a final step of executing a database load utilities to get the transformed data into the data warehouse. Now, there is an extra step required, loading the data into the appliance. The danger of not loading new data into the appliance is that queries that attempt to access data will get one answer from the data warehouse tables, and a different answer from the appliance tables.

The simplest way to handle this is to add steps to your warehouse ETL process that loads data into the appliance immediately after loading it to the database. However, there are many data load options.  What if the ETL process is adding records to a table, rather than replacing the entire table?  Appliance load processes usually allow a specification of ‘load changed data’ that allows the appliance to load only data that has changed recently. This method also works for partitioned tables, where only one partition is loaded at a time.

Preparing the Warehouse

Upgrade the data warehouse architecture. The enterprise data warehouse already serves as a platform for business intelligence queries. Data warehouse business analysts submit queries that are used to analyze data in the warehouse; hence, the warehouse already contains category and dimension tables used for subsetting, summations, and aggregations. Such category data are usually stored in dimension tables such as product category, customer type, geographic areas, and so forth.

With the big data use cases in mind (BI queries that have already been designated as high-return with  the big data implementation), review your warehouse to ensure that all appropriate category and dimension tables exist, and are populated with the proper data. Further investigation may be needed to determine if the original source system data is correct and timely. Review your enterprise data dictionary and metadata to confirm that data element definitions are correct and relevant.

Speed up ETL processing, especially data warehouse data loads.  As the number and complexity of queries increases, BI users will demand greater use of the big data repository. However, since big data requires integration with the data warehouse, this naturally translates into a requirement for faster data processing of warehouse data. Analyze the entire data  transformation chain, include acquisition of source data, data cleansing and transforming, and the final data warehouse load process. Consider capacity upgrades of the network and associated servers in order to handle the increased data volumes. Analyze old or stale warehouse data, and purge or archive that data which is not used. Review data load procedures for speed enhancements, including parallel table loading, or partitioning schemes that allow loading of new data while old data is queried simultaneously.

Begin training staff for their new jobs. Your big data application and enterprise data warehouse will become integrated and grow together. This will change the makeup of the data warehouse support staff. Warehouse analysts must expand their horizons to become aware of data now stored in one or more big data applications. They must be aware of any new BI analytics software in order to assist users in developing reports and queries.

As big data consumption expands, performance will become an issue. Data scientists that understand the data as well as the database management system and appliance will be needed in order to monitor and tune performance. Finally, experts knowledgeable in source and operational systems will be called in to determine what enterprise data should be added to current data feeds to the warehouse and appliance.

Summary

It is clear that implementing a big data application requires significant hardware, software and staffing resources. In order to ensure that this application integrates well with your data warehouse, consider the suggestions mentioned:

  • Configure the appliance for high-capacity, and plan to include your entire data warehouse within it;
  • Limit initial use of the appliance to allow your analysts to gain familiarity with the data and typical BI queries;
  • Review current plans for purging stale or old data;
  • Increase the speed of your ETL processing, especially data warehouse table loading;
  • Finally, prepare your staff for their greater responsibilities in the new big data environment.

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles