The end-of-year holiday season is an important one for grocers, retailers, manufacturers, shipping companies and financial institutions. Millions of customers will buy billions of products, many of which will be purchased electronically and delivered to customers’ homes or offices. This is also a critical time for your big data applications and new business analytics. They should have already proved themselves this year by improving customer service, decreasing time-to-market, and more efficiently distributing products to channels throughout your geographic footprint.
What’s the next step? Your big data repository won’t simply add another twelve months of data over the next year. More data is coming, more categories of data will be created, and your analytical environment must expand to fit future needs. Yes, it’s already big, and it’s going to get bigger!
But size alone won’t be your only problem. In the rush to accumulate a sufficient amount of valuable data and implement a business analytics environment that can produce usable results, several items may have been ignored, postponed, or simply forgotten. These missing details can make or break your company in the future.
Will Your Big Data Application Scale Out?
With any new data-driven process, a question arises about whether or not the application and its data storage can grow to meet future needs. This is usually asked in the form: Will the application scale up? The term scaling up usually refers to raw media storage and processing capacity. More data means more storage, and processing larger amounts of data requires either greater computer processor power or a longer period of time to execute.
This leads to the business and IT considering simple raw capacity numbers in forecasting future needs. Will we double the amount of data we will store in the next year? If so, we will need to double the size of our storage media, and double the number of central processor units or servers used to process the data.
However, these rules are predicated on the assumption that the data is typical business data: orders, customers, products, and so forth, and that the data arrives in well-known data types such as currency, dates, and text fields.
Today’s big data applications do not fit this mold. We process new and complex data types such as extendable markup language (XML), images, video and audio files, and large objects (LOBs). These require more than simply adding CPUs and more disk storage arrays. Big data means re-architecture of standard enterprise data models, and integration of multiple dissimilar data structures across multiple locations in the network.
This is scaling out. It is no longer sufficient to plan for more disks and CPU; instead, we must plan for storage and retrieval of new data types, efficient access to multiple hardware and software platforms across the network, and perhaps adding special hybrid high-performance hardware or software.
This enterprise-wide data integration will be a requirement of your future big data applications. Start by updating your enterprise data model and data dictionary to include these new data types. Do you have a company web site? So-called click streams from site visitors provide data to analyze. Do you have a telephone voice response system that customers may call for automated information and advice? Customer usage patterns of this resource are valuable indicators of product choices and customer satisfaction. Do you accept financial data from outside institutions or vendors? More often this data is presented in XML format.
These data streams are only a few examples of the new categories of data available for review. These will require integration of the data into your big data applications, as well as changes and enhancements to business analytics in order to analyze them.
Purge or Archive Stale Data
As noted above, your big data application solution will only get bigger. Typically a big data application will store data over different time periods, allowing analysis of trends over time. Data acquired today must be accumulated and stored in order to answer questions tomorrow. However, everything has limits. As data becomes older it becomes less relevant to current trends and analyses. To prevent filling storage with unusable data you should begin planning for a combined purge and archive of the data.
First you must examine your data needs. If old or stale data will never be used again, it can be deleted. On the other hand, you may have legal or compliance demands that require possible reuse of the data. In this case, you must choose a data archive method.
One option is to archive large volumes of stale data on tape, or some other large volume storage media. The data need not be in an easily-readable format, and in order to be re-used must be re-loaded into the big data application. This choice is favored if you are required to retain the data but do not reasonably expect to process it again.
An alternative is to store the stale data on high-volume media that is low-cost or low-performance. This removes the data from your big data application, while allowing for potential query or analysis of the stale data, albeit with a performance penalty.
Preparing for Disaster Recovery
Every mature IT organization goes through a regular round of disaster recovery (DR) exercises. These are usually legally mandated, required for compliance with standards or part of contracts with outside vendors. If your big data application was not included in your most recent disaster recovery test, you are not alone. Most enterprises consider big data analytics a supplement to marketing, pricing, and customer relationship systems, rather than an integral part of operational systems such as purchasing, order entry, shipping or general ledger. After all, big data isn’t mission-critical, is it?
Be prepared for this to change. Analytics against big data can provide a wealth of actionable information. As the data store grows it becomes more valuable; as data accumulates over time more trends can appear. Analytical queries that ran for hours or days before you implemented the big data application now run in minutes or seconds. Fast query execution means that analysts can submit more queries per day, and get more usable results.
The most valuable of the queries then become regular monthly reports. Then weekly reports, or even daily reports. Many of these reports result in cost savings, increased customer satisfaction, and higher profits. At some point, management deems that the big data application is critical to the business. Should a disaster occur, the application must be available, and soon!
Begin by documenting requirements to implement a disaster recovery solution for your big data application. If you implemented special-purpose hybrid hardware or software for your application, these may be required to be installed in your disaster recovery environment. Document the data transfer speeds and volumes existing at the DR site, including storage media and network cabling. Will upgrades be required in order to store and query the big data application? Last, coordinate with IT support teams to determine how the big data application will be either loaded from or synchronized with the current operational application.
Summary
The issues and steps noted above should feel familiar to the IT professional: implementation of a new big data application must be accompanied by planning and collaboration across the enterprise. Even though a big data application appears to be an independent, stand-alone implementation, best practices still apply, including considerations for data backup and recovery, data availability and throughput, and process performance.