Exploding the Myths of Big Data

Many companies have implemented big data applications. These applications consist of a very large data store, hybrid hardware and software to store and access the data, and a sophisticated software interface that accepts the queries of business analysts, accesses the data store, and provides answers that can be used to understand customer needs, simplify business transactions, and increase profitability.

As success stories (and failures) have appeared in the news and technical publications, several myths have emerged about big data.  This article explores a few of the more significant myths, and how they may negatively affect your own big data implementation.

Myth #1:  Big Data Applications can Stand Alone.

False. Your big data application certainly contains a lot of data. However, of equal importance is the analytics software used to query the data.  Analyzing business data is common, especially in companies that already have a data warehouse. The data warehouse contains time-dependent snapshots of operational data, and your current data marts and analytical reports depend upon dimensions in the warehouse.

Dimensions are entities by which an analyst would subset or categorize information. These include time, geography, customer type, store, department, and so forth.  A query that sums customer purchases of electronic items for retail stores in several states during the Christmas holiday season includes dimensions of product type (electronic items), stores, geography (state), and time (Christmas holidays). Each dimension gives a different way to summarize data, and may provide clues regarding customer preferences, item availability in stores, or profitability.

Big data applications require such dimensions as well. Since this data is already stored and maintained in your data warehouse, it is natural to integrate the data models of your warehouse and your big data application.

A natural outcome of this integration is that you will be upgrading your data warehouse so that analytical queries can encompass the warehouse data. A good enterprise data model and a comprehensive data dictionary are a necessity.

Warehouse upgrades will include adding new dimensions, inclusion of data from new operational systems, and storage of large objects such as scanned images and XML. This last is especially important, and was mentioned earlier in the discussion on budgeting. Large, complex objects may not be directly analyzable by your business intelligence software package, but basic information about them may be stored in the data warehouse. For example, XML documents can be decoded by some database management systems and stored in a database as tables. This table data may then be analyzed by the BI software.

Myth #2: The Only New Budget Items are Big Data Hardware and Software.

False. Despite some vendor’s claims, any IT enterprise implementing a big data application will incur significant costs beyond the investment in big data hardware and software.

First, plan for the near future.  Your big data application must have the ability to scale up. This refers to the ability of the system to react to larger volumes of data, faster data transmission speeds, and increasing numbers of users of data-consuming applications. Initial symptoms of this problem will be slower perceived response times, long job run times, and elongated transaction times.

For many applications these issues would be perceived as capacity-related and the response would be to add more CPUs, more memory, and more disk storage. However, in a big data environment more power may not be the answer. Most hybrid hardware and software for big data provided by vendors depends on proprietary data storage methods, including data compression, massive parallel processing, and coordination with the base database management system (DBMS). Scaling up in this environment requires re-thinking the way that your data is architected and stored, including possible denormalization of data, logical partitioning, more intelligent query re-write, and more attention taken to SQL performance analysis.

Next, plan for the medium-term by budgeting for scaling out. Big data stores are fed from operational systems, and such systems today consist of far more than simple character and numeric data. Some systems contain complex data types such as data in extensible markup language (XML), audio and video data, scanned images, and large objects (LOBs). Your big data application may need to analyze these data types while doing aggregations and other operations.

To implement this, you must budget for staff time. Of primary importance is an enterprise data model that spans hardware architectures, as well as data integration across your enterprise.

Another budget issue is the non-production environment. Typical non-production environments are used for software application development, testing, user acceptance testing, and system load testing. One of these should include a test version of your big data application.

Why?  The learning curve for business analysts in a big data environment is a steep one. To query a big data application effectively, most shops acquire a business intelligence (BI) software product. These products display data and relationships in the form of common data architecture diagrams, and use a point, click and drag interface that allows the user to specify which data elements are to be aggregated and which are used as dimensions. The interface is usually non-trivial, and almost always requires that the user be very familiar with the business data and its architecture.

The non-production environment is the best place for such ongoing user training.

Another consideration is disaster recovery planning. While analytical systems are usually deemed non-critical, big data applications have a way of growing in usefulness to the point where many users consider them mission-critical. Plan ahead by re-purposing your test environment with the big data application as a candidate for a disaster recovery environment.

The last big budget item is staff training. Your staff will be responsible for maintaining the big data application environment, adding new data feeds and storage, purging or archiving stale data, supporting query users, and perhaps supporting analytical software. In order to properly assist users your staff will need broad and deep knowledge of most of your major operational systems and their data.

In addition to current employees, you may need to acquire additional staff or consulting services. Typical uses of consultants include assisting users, performance monitoring, and capacity planning.

Plan your budgeting by looking ahead. Review these items and determine how and when they may reveal themselves.

Myth #3:  Big Data Applications Require Little or no Performance Tuning.

False. Yes, big data applications are advertised as having extremely fast access times. The promise of the technology is the ability to quickly and easily analyze large amounts of data and derive from that analysis changes to customer-facing systems. Management believes that this analysis and subsequent changes will drive up customer satisfaction, market share, and profits.

The key to big data performance lies in the data itself. IT systems must still acquire data from operational systems, transform it, and load it into your big data application. The more data you need the more work the supporting systems must do to provide up-to-date data.

Data acquisition from operational systems consists of data copies, files, and various database extracts. Some data items may be invalid (e.g., a date of 00-00-0000), some may be missing entirely. Each system has its own data cleanliness issues, as well as specific times when data can conveniently be made available for extract. All of these processes require performance tuning as data volumes increase.

Another issue is bulk loading of data into your application.  As the amount of input data acquired daily increases, load times also increase, and data loading is extremely I/O-intensive. You may need to look for specific vendor solutions for high-performance data loading.

Query tuning is also a requirement. While it is true that big data applications are built for fast querying, as your number of users grows so too will the number of queries they execute per day. Queries may access not only the big data store but also your data warehouse.  If you load your data warehouse into the big data application, you have yet another bulk load operation.

To accommodate multiple potential methods for accessing data, most DBMSs have an optimizer that measures the cost of a query prior to execution. In a big data environment this DBMS software determines the cost of accessing data in the big data application as well as the cost of accessing the data in the base DBMS. While the cost for execution in the big data application may be lower, directing low-priority queries to the DBMS may be more cost effective. To do this you will need to capture user queries and their estimated costs, then review them with users.

Summary

Big data applications do not exist in a vacuum. In order to reach their maximum potential they must be integrated with your data warehouse, supported by trained staff, and monitored for query performance and capacity planning. You will need an enterprise data model and data dictionary, staff training on the BI analytical software, and a budget that can support all of the above.

See all articles by Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles