Big Data implementations are more than just lots of data. Of equal importance is the analytics software used to query the data. Analyzing business data using advanced analytics is common, especially in companies that already have an enterprise data warehouse. It is therefore only natural that your big data application must be integrated with the existing warehouse.
Every new big data application means more and more data arriving quickly. Data must be extracted from source systems, verified, transformed, and loaded into databases and, optionally, a big data appliance (such as the IBM DB2 Analytics Accelerator, or IDAA). A natural outcome of adding these new data and processes is the expansion of the hardware and software infrastructure and support staff. This typically means:
- Network upgrades (to handle the large increase in data transport);
- Storage media upgrades for big data storage and backup and recovery;
- Additional staff with skills in managing big data hardware and software; and,
- Upgrades to the data warehouse environment.
Of these, the last item is the focus of this article.
Upgrading the Data Warehouse
Consider your current enterprise data warehouse. It consists of a large (or very large) database, a database management system and processes to perform data extract, transformation and load. A complement to this is the analytics or reporting software that exists across departments that consume warehouse data.
When you begin your first big data project it is common to consider an appliance like IDAA. As part of your feasibility study, you should consider the following:
- Will the IDAA store only production data, or will test data be stored there as well? This affects how much storage you allocate, how many non-production environments (such as development, test and user acceptance) will require an appliance, how many database administrators and business analysts will be needed that have special skills, and so forth.
- How many appliances are required in the short and long term? Many enterprises implement big data applications and analytics that provide valuable information about customers, sales trends, product delivery, marketability, and more. The result: lines of business now consider the big data application to be mission-critical. In other words, you must now implement a disaster recovery environment … complete with a backup appliance.
- Will all data reside in the appliance, making it eligible for high-speed analytics? If not, what data will be stored there? Current data or historical data or both? Static data, dynamically changing data (such as transactional information) or both? This choice will affect the performance of data load processes and analytics queries.
Communicating with Warehouse Users
There are three categories of data consumers for the data warehouse in general, and big data in particular.
Technical users running direct queries. These users create their own queries against data tables using structured query language (SQL). They then use an online SQL execution tool to run the query and produce the results in raw data form, which they can then observe directly or download to a spreadsheet program for further analysis. These users know the data tables, have SQL expertise, and use simple tools to refine the results.
Sophisticated report analysts. These consumers typically use a sophisticated reporting tool that displays a pictorial data model of the data. They then manipulate the model by dragging and dropping tables and columns to a report window. The tool then creates the appropriate SQL statements based on the model and other parameters, executes the query, and displays the results. These users know the data, usually do not have SQL expertise, and require some advanced query and statistical reporting techniques.
Data mart consumers. These users have their own highly-specialized business data analytical software. They directly extract business data from the source and store it on a local server. They then use special-purpose software to analyze the data.
Any changes to the enterprise data warehouse infrastructure must take these communities into account. They should get advance notice of upcoming changes, especially the availability of advanced analytics packages. This is usually done as a part of the feasibility study for big data; the more use cases you have for obtaining actionable results from analytics, the greater your ability to sell the big data implementation to management.
Some Data Warehouse Challenges
Integrating big data into a data warehouse involves some risks. Here are some common issues.
Big data initial load. Appliances hold lots of data. How long will it take to load your initial data? Once it is loaded, how will you keep it up to date? How will you purge old or stale data? How will you back it up for recovery purposes? Finally, will any of this activity affect query performance?
Disaster recovery. Most data warehouses exist for analysis and reporting, not for processing business data or customer transactions. Consequently, most data warehouses are rated as a low priority for recovery in the event of a disaster. However, as noted above, a highly successful big data application soon becomes mission-critical. How, then will you prepare for the recovery of this huge amount of data in the event of a disaster?
Performance tuning. Most big data appliances are implemented as-is, with very few ways to configure or tune performance. While this is fine during the initial implementation, when data volume grows and the amount of querying, reporting analytics rises, is the appliance up to the challenge?
Managing customer expectations. Current data warehouse consumers typically do not have advanced analytical tools, at least for direct access to the warehouse. This is a direct result of large data volumes and the complexity of analytical queries. As the number of tables accessed rises, performance falls. Queries that originally took seconds now take minutes, or sometimes hours.
With the advent of the appliance, “crazy fast” queries now become a reality. Naturally, consumers will then increase the number and complexity of their queries and reports, access more data across more dimensions and extended historical time periods. Queries that originally accessed customer data for a single geographic area for the current month now span all geographic areas for the past five years.
The result will be an increase in appliance usage leading to a decrease in performance, typically a steady increase in query elapsed times. Data warehouse owners must manage their consumer’s expectations, perhaps even limiting their access or capping resource usage.
The last major change will be the makeup and management of data warehouse support staff. This must expand to include business analysts that know warehouse data, are familiar with the new analytics software, and can interface with consumers to help them find solutions to their problems.
In addition to being familiar with the warehouse database design and big data application, business analysts must become adept at implementing standard processes. These include:
- Bulk data transfer from operational systems into the warehouse staging areas;
- Data cleaning operations to remove or fix invalid fields and deal with missing data;
- Surrogate key algorithms for keying important fields such as account numbers;
- Loading data from staging areas into dimension and fact tables;
- Coordination and implementation of data extracts sent to external systems.
Thus, the analysts become quite valuable members of the data warehouse team. They must respond quickly to business needs, balance requirements against implementation options using their technical knowledge, and build the requisite structures and processes.
Big data has been presented to information technology departments as a wonderful, fast new solution. While big data provides many potential benefits, the inevitable integration into the enterprise data warehouse means you should proceed with caution. Just as adding a large engine to a small car requires strengthening the frame, transmission, and brakes, implementing a big data application means strengthening your data warehouse infrastructure.