Big Data is here, or will soon be here for the majority of IT enterprises. Database Administrators must now deal with large volumes of data and new forms of high-speed data analysis. If your responsibility includes performance tuning, here are the areas to focus on that will become more and more important in the age of Big Data.
Source Data Acquisition
For their first Big Data implementation, IT departments usually concentrate on large volumes of data that already exist in current production systems. Examples include semi-structured objects such as images or video and click streams indicating web page visits and actions. Other examples include raw transaction data such as credit card transactions or internet orders.
The Big Data process will attempt to access this data directly (or at least to store it for analysis), bypassing standard processes such as data cleaning and transformation. This happens because Big Data data sources have not yet been integrated into a data warehouse-like architecture. Indeed, these data sources may be brand new or never accessed.
Consider a business with a telephone automated voice response system. Here, the data could be the strings of touch tone digits pressed by a customer, along with coded information about the voice responses. Those who support the current system may not have the expertise to assist in the Big Data implementation.
The biggest performance issue in this area is that of data availability. The new Big Data system must extract large amounts of data without contending with current operations; especially data load or data update processes. Here are some tuning tips.
Avoiding hotspots. Database locations where current or newly-entered data exist can be troublesome. Transactions adding, updating or loading data must lock the associated table rows until the update is complete. This prevents concurrent access to the new data, including any select access for Big Data.
One solution is to break the Big Data extraction process into multiple extracts. Each extract can select data for a particular time period, geographic area, or any other subdivision that limits access to current data to only one of the smaller extracts. This reduces the number of processes that may content with current production applications, and allows for parallel execution of the multiple processes.
Selected partition access. Most large data stores that are stored in a database management system (DBMS) reside in a database that is physically partitioned into multiple datasets. Each dataset typically holds data rows having some common value-based characteristic: similar dates, similar departments, similar customers, similar geographic areas, and so forth. Performance varies across DBMSs for queries that access data in multiple partitions.
Consider a table of business transactions that is partitioned by date. A query that extracts data for a single day will access a single partition, while a query requiring a month of data will access between 28 and 31 partitions. Performance may vary for queries accessing multiple partitions depending upon how this is coded in SQL statements. For example, do you specify a date range (WHERE OPEN_DATE BETWEEN 01/01/2013 and 01/31/2013), or a partition range (WHERE PART# BETWEEN 1 AND 31)?
Consult the documentation for your DBMS to ensure that Big Data data extraction queries correctly specify partition ranges for best performance.
ETL Processing
Extract, Transform and Load (ETL) processes involve acquiring data from source systems, cleaning the data, and then loading it to a staging area. From there the data typically moves into a data warehouse. Processes that move and clean data are I/O- and CPU-intensive, so large volumes of data may strain system resources.
Despite being sometimes semi-structured or multi-structured, data acquired in a Big Data implementation will still require transformation logic. The key to addressing this will be good documentation of current ETL logic. If rules exist for cleansing data in the current systems, that logic may be used (with modifications) in data acquisition for the Big Data implementation.
Data Staging and Keying
Big Data is eventually brought into a staging area and keyed. This usually involves assigning surrogate keys to entity records. Typically, the staging and keying process will use a randomization or sequential key process to assign unique keys to unique accounts. Identity columns and sequences are common ways to do this. The result is that identification of much of the data is done using these surrogate keys. Hence, most of the SQL used to analyze these tables will Join and subSelect on these keys.
There is a potential for disaster here. It may be difficult to embed the keying process into the Big Data acquisition logic; for most special-purpose hardware built for fast analytics, the data loading process can be quite elongated. Keying will add some elapsed time to that process. On the other hand, many tables in the core data warehouse will not contain natural keys for the data. Will the DW support staff have to add the natural keys back into the core tables, making them bigger and adding more elapsed time and CPU usage to the normal DW extract, transform and load processes?
Big Data implementers should look ahead to the time when their implementation must be consolidated with other data stores such as data marts and data warehouses. Having a standard keying process will make this task easier.
Data Archive
At some point the Big Data implementation must deal with whether its oldest data is still useful, or ‘stale’. Another factor in this decision is that some data must be retained for a long period due to contracts or regulations. Stale data should be archived. This process may be permanent or temporary, the data may be stored locally or remotely, and the data may or may not be available for immediate use. All of these items come into play when deciding how often to archive, what data to archive, and the future need (if any) to retrieve the data for re-use.
Big Data implementations may have data needs that require access to large amounts of archived data. This may mean policy changes, direct access to the archived data, or sometimes removing archiving altogether, thus requiring budgeting for additional storage media.
Performance is always faster if data volume is limited. Consider data partitioning schemes where older or ‘almost stale’ data is stored separately. Another option is a tiered storage architecture where the currency of the data determines its storage location. You now have the capability of tuning performance for the data that matters. If current data is accessed more frequently you can store it on high-performance disk drives or even in flash memory. Stale data can be placed on slower bulk media, perhaps even on tape.
Test and Sandbox Environments
Your first Big Data implementation will probably be a pilot project where you assess the technical viability, costs and benefits of your solution. In these cases IT shops usually concentrate on the initial production implementation. This results in a quick implementation on a limited budget, allowing management to prove that Big Data provides benefits to the enterprise.
Forgotten in this effort are some best practices. Where will developers create and test new applications? How will new hardware and software versions be implemented? These things are usually accomplished by defining and maintaining multiple non-production environments variously called sandbox, development, test, user acceptance, or quality assurance.
The need for non-production environments exists for Big Data implementations as well. The DBA will need such environments for initial capacity planning and application performance tuning.
Summary
As Big Data implementations become more prevalent, database administrators must extend their performance tuning efforts to include analysis of how large data volumes are processed. I/O and CPU constraints will become more common. Use the techniques mentioned in this article to keep up-to-date in this new world.
References
IBM – What is Big Data?
“www.ibm.com/software/data/bigdata/
IBM developerWorks – Big Data Overview
www.ibm.com/developerworks/bigdata/
Three Success Factors for Handling Big Data and Data Warehouse Performance – Dave Beulke 2012
https://www.ibm.com/developerworks/community/blogs/DB2PerfTune/entry/three_success_factors_for_handling_big_data_and_data_warehouse_performance_part_12?lang=en