Like management information systems, enterprise resource planning and relational database, big data is now a standard part of information technology architecture for most large organizations. Common applications include storage and analysis of customer data, web interactions, machine sensor readings, and much more.
As a database administrator, with the holiday season upon us, I have the following items and notions on my holiday wish list. Here’s hoping that I am gifted one or more of these; each one gives me something that I want or need. Some are wishes for others; perhaps with a little gentle prodding and instruction, we all can benefit.
Understand the Difference Between Scaling Up and Scaling Out
Throughout history, mankind has dealt with big data projects. From heavy industries like shipbuilding to mass printing of books to construction of the Panama Canal, man used basic ideas and information structures to manage and scale large projects.
Some of the basic assumptions about data and information include:
- Business data is hierarchical. One order has multiple line items, one assembled product contains multiple parts, one account has many transactions, and so forth.
- Business data comes in a few well-known types such as dates, names, identifying numbers, and amounts.
- Business data is usually stored as-is, with little transformation.
Hence, when we encounter a problem that contains a lot of data, our tendency has been to think of it as a scaling up issue. We need to gather and store more data, process more data, and make more decisions based on that data. Scaling up is a resource issue; we need more pencils, pens and paper, more filing cabinets, and more administrators to move paper around.
Today, big data applications are far different. We are faced with new and complex data types such as large objects (LOBs), data that contains its own description (XML) and data with dense and proprietary internal structure (images, audio and video files, and web click-streams). These all require more than buying more filing cabinets or disk storage arrays. Big data today involves re-architecture of data models and data storage, and data integration of multiple dissimilar architectures. Further, these data will exist on multiple hardware platforms spread across internal and external networks.
It is no longer enough to think that the only solution needed for a big data application is to buy lots of disk storage and high-performance hardware and software. Scaling out to multiple architectures and integrating the results requires a combination of resource capacity planning, data modeling and architecture, and enterprise-wide data integration.
Don’t Skimp on Non-production Environments
Most organizations implement multiple non-production environments, called variously development, test, load testing, user acceptance, pre-production, and so forth. Many of these have specific purposes. For example, the development environment will contain files and tables of relatively small size, and is only suitable for unit testing single programs. In contrast, the load testing environment will usually contain production-sized databases and resources such as disk space and CPU power.
This changes for big data applications. There is no sense creating or testing a big data application in a development environment. Big data applications, by their nature, require large data volumes in order to work properly. For example, consider an analytical application that reads a store of customer data and searches for trends. Without a nearly production sized database to analyze, results would be meaningless.
A common trend is to install special-purpose hardware and software to store and analyze large volumes of data. Such solutions are sometimes called appliances. One example is the IBM DB2 Analytics Accelerator (IDAA), which stores large DB2 tables in a proprietary format that allows extremely fast querying. If used, such appliances need to be attached to a non-production environment having a production-sized data store. This permits developing applications and queries that act on the big data stored therein.
Pre-plan Staff Training
A big data implementation usually requires the addition of staff with new areas of expertise. Examples of these include:
- Experience with large-scale data analytics
- Expertise with any special-purpose hardware or software such as appliances
- Experience with high-volume data movement methods, such as data replication
These experts can be rare. How will you find them, and how will they fit in your organization? One obvious alternative is to train current staff. The big question is how long will it take to learn these new technologies.
Whichever route you go to obtain the required expertise, plan ahead. If you would consider hiring a new employee with two years of experience in big data analytics, the alternative is to give some current employee two years of experience. Start early!
Understand Your Operational Data
Interestingly, some organizations consider implementing their first big data application accessing production source systems, bypassing non-production altogether. Why? Current production sources of big data will not be stored in the data warehouse (too big!). Since the big data only exists in production, production is the place where the application gets tested.
A successful big data implementation requires a thorough understanding of your operational data and metadata. Regrettably, internal staff with business knowledge of operational systems will not have experience with big data analytics; conversely, your staff with analytics and big data appliance skills may not have knowledge of current systems. Collaboration and cooperation will be required between these two groups. To do this, involve your data warehouse experts from the beginning. They already know about the source systems; after all, that’s where warehouse data comes from. Further, they understand data issues such as handling missing data, data cleansing, and any required data transformations, all of which are usually implemented with data warehouse extract, transform, and load (ETL) processes.
Despite being semi-structured or multi-structured, data acquired for a big data application may still need transform logic.
Design a Robust Purge/Archive Process
After a successful big data application moves to production, a funny thing happens; the data gets bigger! It is common to analyze a large data store by comparing different time periods and looking for trends. Big data acquired today needs to be accumulated over time in order to answer the questions of tomorrow.
There is a limit to everything. Big data, like all other data, becomes less useful or stale over time. In order to avoid totally filling storage arrays with data you should plan ahead for a combination of purging and archiving data.
There are several alternatives. You can partition tables by time period (year, month, week) and empty the “old” partitions at the end of the current period. Another method is to store only the most current data in an appliance, allowing for extremely fast queries. Older data is then stored in the data warehouse for query if necessary.
Give a Thought to Disaster Recovery
Most organizations consider analytical data and queries to be low priority during a disaster. Clearly, they reason, customer-facing systems, accounting systems, order entry applications, shipping and receiving, and so forth need priority. While big data analytics is nice to have, it is not mission-critical. But can it be?
Consider the following scenario. You implement a big data application that analyzes trends in your customer’s buying habits. Internal analysts execute queries against the big data, perhaps through an appliance that gives them extremely fast turnaround. Crazy fast query execution means that more analysts can run more and more queries in a day.
As your application matures various queries and reporting provide valuable information. This is good! So good, in fact, that management implements regular execution of these reports. More and more valuable queries and reports begin to run monthly. Then weekly. Then daily. All providing valuable information.
At some point, management may decide that the valuable and profitable decisions they make based on these regular reports are critical to the business. Yes, critical. At that point, you are now faced with pressure to ensure that this application is available if a disaster occurs. This may occur with no prior warning. Hence, I recommend considering disaster recovery options during implementation of any big data application. While it may not be needed in the immediate future, the potential exists for such a requirement. Pre-planning will allow you to foresee hardware, software, storage and network requirements for various disaster needs.
Holiday wishes aside, each of the topics presented here follows a familiar pattern; big data is a new method, a new process, a new IT paradigm for storage and retrieval of data. Newness means change, and some staff may resist change. Mitigate these problems by planning ahead for staff training, user training, and collaboration across teams. Finally, apply IT best practices to any big data implementation; consider backup and recovery, review and project performance needs, and ensure that any special-purpose hardware and software are integrated correctly into your current architecture.