Big data applications are now fairly commonplace in large organizations. They typically begin as part of an information technology (IT) project that will extract, store and analyze large amounts of existing data in order to reduce costs, predict customer buying patterns, speed product time-to-market, and forecast raw materials and production capacity requirements.
It is, however, difficult to simply ‘drop’ these applications into an existing IT infrastructure and expect to run smoothly. In addition to energy and cooling requirements for new hardware to support the new big data application, other IT areas need to prepare. The major factors that determine whether enhancements will be needed to existing applications include large data storage needs, larger data transmission capacity, and the demands these will place on existing hardware and software.
Big Data Basics
Big data involves the acquisition, transformation, and storage of large volumes of data and its subsequent analysis. To do this, most organizations acquire one or more ready-made solutions from vendors. A popular choice is the IBM DB2 Analytics Accelerator (IDAA), a hybrid hardware and software solution from IBM. This hardware (sometimes called an appliance) includes a multiple terabyte disk storage array in a special-purpose hardware chassis, as well as high-speed networking cables to transfer data from enterprise storage. Once the data is stored in the appliance it can be accessed as if it were a database.
Once the appliance is installed, data from production systems is loaded into it. The appliance and the database management system (DB2 in this case) will coordinate handling data queries. If a query accesses data that is stored in the appliance, the query will run extremely fast. It is this performance advantage that drives an organization’s decision to purchase an appliance. Large amounts of data take a long time to analyze. Many such analytical queries that run for hours accessing standard database tables may run in minutes (or seconds!) when accessing data in an appliance.
This complex and large-scale hardware solution requires a lot of electrical power. Your infrastructure team must review the power requirements and determine if current electrical circuits will be able to handle this additional load. One more point: most IT hardware installations mitigate the risk of power outages by installing backup generators. These generators must be reviewed to ensure that they can provide the additional power required for the appliance.
In addition to the power concerns mentioned above, IT staff must also address the following questions.
- Will the appliance store only production data? If so, does this mean that analytical queries must be tested on production data? If not, how will such queries be tested, and against what data?
- If your big data solutions are (or become) mission-critical to your organization, will you need to consider installing an appliance in your disaster recovery environment? If so, how will you keep the large amount of data at the disaster recovery site in synch with current data?
- If you choose a solution that can store only a portion of your current production data, what criteria will be used to load data into the appliance? In other words, if data in the appliance allows for high-speed querying, what data tables will be placed there?
Other architecture requirements center around the need to process and move large amounts of data. Data must be extracted from source systems, verified, transformed, and then loaded into databases and into the appliance. Naturally, larger amounts of data movement will lead to the following:
- Network expansion, including possible additional parallel data channels to transport data;
- New and bigger storage media, usually in the form of disk arrays, for storing primary data as well as backups;
- Upgrades to database administration automated processes such as database backups, restores, reorganizations, index maintenance, and the like;
- Additional staff to manage and monitor all of the above.
Upgrading Data Warehouse Architecture
Sometimes forgotten during the initial analyses of new big data solutions is the need to review and upgrade your current data warehouse environment. Data warehouse business analysts already know this: business analytical queries that are used to analyze large amounts of data typically require breakdowns of the data by category or dimension. It is not enough to forecast what products will sell; you must have the data summarized by geographic area, customer type, product category, time period (such as season), and so forth. These dimensions already exist in the data warehouse; further, your current analytics environment (which includes regular reporting and ad hoc users) already contains many queries that utilize these dimensions.
Any big data analytical solution will require integration with the data warehouse, and with this integration comes some issues.
- Appliance initial data population. Appliances hold a lot of data. How long will it take to load the data into the appliance? Will successive updates to the data be made synchronously, or will they be loaded in some nightly batch cycle?
- Stale data archive. What is the current state of your data warehouse archive process? How often is old, stale, or unused data purged or archived? How will this affect data in the appliance? Are there ways to purge large amounts of data from the appliance?
- Disaster recovery. How will you back up data in the appliance? Will you have enough room on other storage media to hold all of this data? In the case of a disaster, how long will it take to recover the data?
- Performance and growth. Users will be pleasantly surprise by the quick execution time of their queries. Perhaps so much so that they begin to submit more and more queries against larger and larger data volumes. As larger numbers of users submit more and more queries, will the appliance continue to perform well? How will you monitor this performance, and what performance tuning options exist?
A Plan for Integration
Preparing your enterprise for big data requires a plan. One common method is to begin with the largest integration issue, the data warehouse. Consider how data flows through the warehouse. One common way of mapping this flow is extract, transform and load. Extract refers to the acquisition of data from source systems, transform includes all changes, fixes, and summarizations of the data, and load involves loading data into the warehouse. Consider how a big data solution will affect each point.
The Source Systems
These include your core operational systems. These systems process orders, do accounting, execute customer transactions, do billing, and so forth. These systems produce transactional data, some of which is extracted and sent to the data warehouse. A big data solution may involve including additional information in these extracts, or it may involve totally new data sources.
The IT organization may decide to implement a big data solution at this point, extracting data directly from production systems for immediate storage and analysis. However, there are several disadvantages to this idea. Data has not yet been transformed, so many data element values may be invalid or missing. In addition, the data has not yet been stored in the data warehouse, so it may be difficult to match data in the big data solution with warehouse data. Last, staff that support production applications may not have the expertise to assist in analytics.
If you must implement a big data solution at this point, involve production systems support staff from the beginning. Include them in discussions and design decisions, including how the data will be compared with and merged into current data. Insist on documenting all fields and transformations in a data dictionary.
Data Movement and Transformation
As data elements flow from source systems they may require fixing or ‘cleaning’. Some issues include missing or invalid data items. What do you do with a date field that contains all zeroes? The field value cannot remain this way; analytical queries will query by month or by date range. Clearly a default value is needed for this case. Other issues include missing data, or data that requires validation against another system. One example will be a customer order that contains key values such as product number.
Another concern is external data. This includes data acquired from external vendors, web page data and form captures, and so forth. Some common issues are missing data, non-numeric characters in numeric fields, and free-form text fields that must be parsed to obtain details (such as address fields containing street address, city, state and zip code). These data fields require transformation logic to resolve data issues and assign default values.
Your data warehouse has a standard process for data transformation. Many transformations are standardized across multiple data fields from multiple sources. (For example, missing or invalid expiration date fields may be set to 12-31-2099.)
Any big data solution implemented at this point must have some process for mirroring or duplicating the standard warehouse data field transformations already in-place. The key to getting this right will be good documentation of the current logic.
Data Load to the Warehouse
This is the point where most big data solutions are implemented. IT specialists install the appliance where it can be accessed by warehouse processes. As data comes into the warehouse, current database load processes are enhanced to include loading of data to the appliance. Data is now available for analysis.
Coordination is required among the major processes: data load, appliance load, and user analytics. Even with high-speed hardware and high-volume data stores, loading data will take time. During this time, will the new data or other data in the database be available simultaneously? Will there be defined a loading ‘window’, during which time warehouse or appliance access is not permitted?
These and other questions covered above are the central concerns regarding integrating a big data solution into an existing IT enterprise. While planning for the physical implementation of big data hardware is important, the essential ingredient is planning for integration into the existing data warehouse. Analytical users already query the warehouse, so expansion of their possible field of data is a natural one for them. Warehouses already store large amounts of data, so maintaining another large data store is usually not a major issue for database administration staff.
The central point of integrating a big data solution into the data warehouse is that big data solutions always require comparison to and aggregating against existing warehouse dimension data such as geographic area and time period. Close coordination between implementers and the data warehouse support staff will be a critical factor for success.