Most large organizations have implemented one or more big data applications. As they accumulate more and more financial transactions, sales data and customer interactions they can generate more accurate sales forecasts and predict trends in customer needs. This translates into larger market share and higher profits.
However, as more data accumulates internal users and analysts execute more reports and forecasts. These lead to additional queries and analysis, and more reporting. The cycle continues: data growth leads to better analysis, which generates more reporting. Eventually the big data application swells with so much data and querying that performance suffers.
How to avoid this?
Big Data Information Acquisition and Storage
There are several areas where information technology (IT) support staff can proactively address big data application performance issues. The first is data acquisition and storage, sometimes called extract, transform and load.
Big data applications store their data in either a large database or a special hybrid hardware / software appliance. Sometimes both solutions are used. Large databases can be relational databases such as IBM’s DB2 or database machines such as Teradata. Appliances are a new breed of data storage combining massively parallel I/O channels and very large disk arrays. In many cases these can complement an existing data warehouse, since much of the organization’s current analytical data may already exist in the warehouse.
The final resting place of the data helps to determine the amount of preprocessing required for the operational data. This is sometimes called cleaning. Examples include handling missing or invalid data, substituting codes for descriptive category data, and assigning surrogate keys for entities having unique natural keys.
Data comes to the application from operational systems and external data sources. IT must ensure that the correct data elements flow from the operational systems in a form that is suitable for bulk loading. Data channels must both be fast enough to transfer the data and wide enough to allow parallel transfer of multiple data flows. The daily growth of your big data application gives you the base number of gigabytes added per day. Ensure that this data can flow quickly and efficiently.
Most database management systems (DBMSs) and appliances have proprietary utilities they used for data loading. These utilities sometimes require that input data be in a fixed format. Review these requirements and any other options to determine if there are ways to enhance data load speeds. One common method with relational database is to use partitioned tables. For example, tables can be defined to have four partitions for one day of data. This allows IT staff to design a load process that loads four partitions concurrently.
One last area is that of data archive and purge. While it may seem that more data naturally leads to better analysis, some historical data can become old or stale. Examples include sales data for products that are no longer sold, and customer data for geographical areas no longer served. These data can be purged, or archived. Purge reduces the total amount of data storage required, while archive allows for later re-loading if the data is needed. In either case, reducing the size of the working database will speed up queries that no longer require the outmoded data.
Big Data Users
Who are your users, and how do they query your data? The typical user is a business analyst and subject matter expert who understands your business data and has the technical knowledge to access the big data application intelligently and efficiently. This may mean they are expert in query languages such as SQL. Another option is to install analytical software that presents a pictorial view of business data and constructs queries based on user input.
In many organizations their first big data application was installed in order to solve specific problems or to analyze predetermined issues called use cases. Analysts already had ideas of which queries would deliver useful and actionable intelligence.
Next, the results of the first analysis generated “what if?” questions, leading to more queries of varying types accessing more and more data. As these queries provided even more useful results, management naturally began scheduling them as regular reports.
The IT support staff must plan for growth in the number and complexity of queries. This is more than simply planning future data storage capacity. Complex queries require the database management system to determine an efficient access path to the data. This may require adding performance-enhancing features such as indexes or data partitioning to the database. It may be that multiple queries use a common aggregation of data such as sales data by region or customer data by customer type by month. In these cases it may be possible to pre-calculate these aggregations and store them separately, thus providing a query performance boost.
Another area for concern is system resources. Some resources such as CPUs or disk storage may be limited or performance-constrained. Availability of these and other resources will determine the most visible performance metric, query elapsed time.
IT staff should monitor resource consumption and gather regular performance measures. Charting these metrics over time may indicate trends. In some cases it may be possible to relieve constraints in one resource by utilizing another. For example, if CPU is constrained and causing queries to execute slowly, it may be possible to add indexes to tables. This uses more disk storage, but allows for faster access paths to the data. Other methods include adding disk storage dedicated to sort processing, and adding memory for the DBMS to use for buffering data.
Big Data Queries
This naturally leads to query tuning methods. Most DBMSs have a feature for analyzing query access paths called explain. Explain inputs a query, analyzes the multiple possible data access paths, assigns a cost to each one, and reports the final access path with the lowest cost. Here, cost is a function of CPU usage and disk I/O required for data retrieval.
This query path optimization requires that the DBMS have a statistical profile of the data stored in the database or the appliance. Such statistics include high and low values for each data element, average values, most common values, and other data distribution information.
Consider a table containing sales data by date, with transaction dates ranging from 01-01-2014 to 12-31-2014. A query accessing data for only 01-01-2014 would only need access to a small portion of the table; therefore, in this case an index access path is likely.
Similarly, for the above table, consider that the data is partitioned into twelve datasets or partitions, and each partition corresponds to a month. A query against January data would only need to access a single partition, rather than scan the entire table for qualifying data.
Clearly, the requirements of the analysts greatly affect the types of queries they generate; at the same time, the storage methods and data access paths available will affect query performance. The obvious conclusion is that IT support staff must work closely with analysts, meeting regularly to review reporting requirements and document probable data access strategies. IT staff should also develop a method of capturing all submitted queries so that they can be analyzed for common patterns.
Summary
To avoid potential problems with big data application performance, IT staff should be proactive in coordinating with analysts and gathering data access metrics. On the technical side, document the extract, transform, load process to determine if there are methods of increasing data acquisition speed or handing larger volumes. Consider purging stale data, and determine if common database performance tactics such as partitioning or indexes will be useful.
Develop a profile of the user base. How many users are there, how skilled are they at query building, and how often do they submit queries? Will these numbers grow in the future?
Capture user queries, execute the explain process, and save the access paths. Analyze this data for performance problems and trends.
Monitor system resources used to support the big data application. Forecast trends, especially for resources that are or may become constrained. In those situations, develop methods of using available resources to offset some of the constraints.
Finally, meet regularly with business analysts to discuss your findings and possible future changes. Such communication and coordination is essential to maintain a good relationship with your users.
References
IBM DB2 for z/OS DB2 database design: Principles of peak performance 2012