The term Big Data is very common, and is used to connote large data stores (typically in an enterprise data warehouse) that are queried with high-speed data analytics software. To understand what this means, we break down the components of Big Data and discuss the options you must consider if you wish it to succeed in your company.
Some big data implementations are described as ‘mission-critical’, a phrase that is often misused. By definition, data warehouses are historical stores of transformed operational data, and are typically not used by operational or real-time systems. This is primarily due to two reasons:
- Data warehouse data takes a long time to extract, transform, and load (or ETL) from the source operational data; and,
- Data warehouse data is typically a subset of operational data used for analytics, not real-time processing.
One essential characteristic of operational data is that it is recoverable. The database administrator usually implements database image copies as backups in order to recover operational data in case of a disaster. On the other hand, some data warehouse data is rarely backed up, especially data that is staged or in-transit to analytical tables.
We can see from the above that big data implementations that are described as mission-critical are essentially large operational data stores, rather than data warehouses. This means that the priorities for such systems are recoverability, data availability and security. On the other hand, the top priority for a ‘big data’ data warehouse implementation is performance.
We now break down the term Big Data into its constituent parts.
Large data warehouse implementations in today’s enterprise typically reach or surpass 100 terabytes. Sounds big, doesn’t it? However, the raw size of the stored data is not useful in and of itself as a predictor of performance. Other factors must be considered, including the number of concurrent users, complexity of analytical queries, and system resource constraints such as CPU, memory, and disk space. Enterprise requirements such as elapsed times and service level requirements must be factored in as well.
With a large amount of data, the most common issues are management-related:
- The database administration team must coordinate with the data warehouse staff to ensure that performance enhancers such as indexes are considered; in addition, they will design databases that favor key-sequenced loading and minimal system resource use.
- The data warehouse process manager must maintain a metadata catalog, since the number of data warehouse objects will be large and must have standard names, etc.; in addition, standard extract, transform, and load processes must be defined.
- The manager in charge of business analytics must understand data warehouse environment access issues, including how to code high-performance data access queries, standard data models, and possible resource contention issues.
Big Data Movement
Somewhat independent of the raw size of the data warehouse is the amount of data that must be loaded, transformed, and extracted. Each stage has its own unique performance issues.
Data load involves extracting data from source systems and loading into a ‘staging area’. These processes are sometimes executed as batch extract jobs and sometimes by direct file transfer. The data warehouse team will execute batch load jobs in parallel and use high-speed networks (if available) to expedite these processes.
Next is transformation. Operational data may not be in a form that is easily analyzed. One common issue is dates. Dates stored in the source system may be stored as raw character strings with special values embedded as indicators (e.g. ’99/99/9999′). Such data must be stored in a form that allows (for example) date range comparisons. To do this, the data warehouse analyst must develop a list of transforms for each data element. What is done if data is missing or NULL? The data warehouse analyst develops a list of default values.
Last is data loads. This usually involves combining multiple data sources into a single data warehouse object. Here, performance considerations are paramount as the loads usually involve comparing and matching of multiple master files, as well as sorting.
The final big data store issue is data archiving and purging. Eventually, stale, old, or unused data must be removed from the data warehouse, if only to keep its total size manageable. One common implementation involves splitting tables into multiple physical datasets, or partitions, with each partition holding rows from a particular date range. Data archiving or purging can not be accomplished by copying or deleting a single dataset rather than by executing a data-intensive process.
Analytics are user-written queries that combine historical data in the data warehouse in order to answer business-related questions. Most of these ‘what if?’ questions involve grouping the data by geographic region, by time, by company, or by some other subsetting criteria, followed by statistical analysis and comparison across the groups. Sometimes called business intelligence or BI, such queries tend to use a lot of data and CPU resources to retrieve large amounts of data, sort it, group it into sets, and derive statistical measures.
In one sense, big analytics is a good problem to have. The more operational data you store and accumulate, the more questions can be asked over longer time periods and wider groupings. This can be quite valuable to the enterprise. As BI users proliferate and the warehouse grows in size, management must coordinate data models and query performance tuning.
One example is long-running load processes. As large amounts of data are loaded into the warehouse queries may lock out or slow down the data loads. The reverse can be true, as mass loading prevents queries from accessing data until the loads are complete.
One common way to permit coexistence of mass loads and user queries is a table design called active/inactive. The target table is divided physically into two datasets or partitions. One partition is designated as the active partition. BI queries access the active data, allowing the load processes to load data to the inactive partition. Once the loads are complete, the designation of active/inactive partitions is switched.
One relatively new option for dealing with big analytics is special-purpose hardware. Once such choice is the IBM Smart Analytics System (see http://www-01.ibm.com/software/data/infosphere/smart-analytics-system/ for more information.)
The best way to approach performance tuning in the big data environment is through analysis of resource constraints. The most important resources are the disk subsystem for accessing data and CPU for statistical analysis. Since a typical day in the life of a big data implementation involves an ETL phase followed by an analytics phase, systems support and database administration can emphasize disk performance during the former and CPU performance during the latter.
Another aspect of performance tuning involves query tuning. The Explain command provides basic access path information that can be used by database administration. Some possibilities will include modern access paths like the starjoin, database design options such as partitioning, and additional tuning tactics such as indexes and data distribution statistics.
Big data can be broken down into four main areas: datastore, data movement, analytics and performance. Each area involves some management and coordination across support staff and managers alike. In order to plan a successful big data implementation, designers and managers must look to each of the four areas and ensure they get the attention they deserve.