Is It Just Big Data?
Current use of the term big data is a result of the confluence of several industry trends over the last decade. These trends include accumulation of large amounts of data in databases, the advent of the data warehouse, multiple means of high-speed data transfer, and an explosion in the design and use of sensors for gathering data.
Database designs have evolved to meet the demands of this new environment. Database administrators (DBAs) have developed multiple logical and physical design methods; database management systems (DBMSs) now provide additional storage methods and access paths to support big data storage and retrieval.
Application developers must understand how these database designs will affect how their applications are built and tested.
Database Designs for Big Data
In order to store and process massive amounts of data efficiently, database administrators have several physical database design options available to them. Here are the most common.
Horizontal Partitioning. This is the most common option. Data for a particular database or table is split into multiple physical datasets called partitions. Each partition contains rows with a common key or key range. One method is to have each partition represent a particular day (or month or year). This is common in data warehouses.
This results in several advantages. With each partition representing a specific date range, queries that analyze data for a particular time period will find the relevant rows stored contiguously in a small physical area, rather than being spread across the database. This also allows one partition to be loaded while others are being queried; that is, contention between the load process and analytics is minimized.
Another common horizontal partitioning scheme is active/inactive. A table's data is stored in two physical partitions, one designated as active. Data in the active partition is available for querying. New data may be loaded into the inactive partition without disturbing querying of active data. Once the new data load is complete, the switch or flag indicating which partition is currently active is updated, thus reversing the roles of the partitions.
Vertical Partitioning. In this method, the database administrator analyzes the column data in a table to determine if certain columns are updated or queried more or less frequently. The table data is then split into two tables, each with the original table's key. One table contains the frequently-used columns, the second table the remainder. This can greatly increase query performance, since SQL accessing frequently-used columns now accesses only one of the tables, which is now significantly smaller.
There are other criteria for splitting rows. These include:
- Security. Columns containing personal or sensitive data are segregated to a separate table having more strict access requirements.
- Performance. Columns containing numeric fields that are commonly summed or aggregated, such as account balances or on-hand quantities, can be segregated to a smaller table, making calculations faster.
- Recoverability. Some column data may be critical to the enterprise: think account numbers and balances displayed on an ATM. These could go to a separate table that would be backed up more often than other data such as addresses or telephone numbers.
Minimal Index Coverage. In a sense, indexes contain data just as tables contain data. If a large table has several indexes the result may be that the indexes take up more storage than the table! This happens more often that you would think. Like most database management systems, DB2 permits table data to be compressed. For text and numeric data, such as is found in most business applications, a compression of about 75% is possible. One typical financial database from a Fortune 1000 corporation contains about 1 terabyte of raw data. Due to compression, it only takes up about 300 gigabytes of storage. However, the indexes on the data take up about 500 gigabytes of storage.
Another concern is performance. During a table load or row insert process, each row added to a table must be accompanied by updating each index. The more indexes you have, the more I/O is involved. In a big data environment, this can be very costly in terms of loading elapsed times and CPU usage. Consequently, big data tables tend to have a minimal number of indexes.
Concurrent Access by Queries and Utilities. Of course, if you have a lot of data to load into a table it may take a long while. Some of the partitioning schemes mentioned earlier mitigate this somewhat, as it may be possible to load multiple partitions simultaneously. Still, many IT organizations must support application access during the entire 24-hour day. This means that data load processes must co-exist with application data access.
Infrequent Backups. A big data table has a lot of data. Creating a backup copy will take some time to run and will require lots of storage for the copy. Consequently, database administrators will tend to back up big data tables less frequently. For the partitioning schemes mentioned earlier, it is common to back up only the partition recently loaded. This technique means that successful full table recovery depends upon the existence and retention of backups of other partitions made in the past.
The implications of big data for applications are many and varied. Here are some of the most common concerns and recommended best practices.
Ensure Access to Correct Partitions. This advice seems obvious. However, most application SQL statements that retrieve data from a partitioned table (or set of vertically-partitioned tables) may be unaware of the partitioning scheme. To avoid scanning the entire table for data that resides in a single partition, application SQL must contain explicit partition selection criteria. For tables partitioned by date, this criteria may be in the form of a BETWEEN predicate specifying a date range.
In data warehouse environments it is common to have a table designated as the time dimension table. This table contains columns indicating which partition numbers contain data for which date ranges. Application designers then code SQL with a join between the base table and the time dimension table specifying the date range on the appropriate time dimension table columns.
Ensure Read-Only Access. There are several cases where what appears to be a simple query will be treated as a potential update. One common issue is the ambiguous cursor. This is an SQL statement that may be used to retrieve a table data row for the purposes of later updating or deleting it. If DB2 detects that a cursor is ambiguous, it may lock data in preparation for a possible later update. This locking behavior may conflict with other application access to the data. This issue can be easily addressed by reviewing cursor coding options including the FOR FETCH ONLY clause.
Perform Frequent Commits. Commits are used to release locks and "claims". Most application developers are familiar with the concept of data locking, where an application locks or reserves data for its exclusive use. For query-only applications, locking can be eliminated using the SQL syntax WITH UR, meaning uncommitted read. This allows the application to avoid locking data; however, there is another issue. DB2 controls simultaneous application and utility (e.g. LOAD) access using a facility called a claim. SQL access to a table results in a claim on the data. utilities are prohibited from executing on the data until the claim is released. For an application, a commit will release a claim.
This has implications in a big data environment. Recall earlier the discussion about 24-hour simultaneous access to data by applications and utilities alike. It is common for data load utilities to be running at the same time as application queries. Consequently, applications must be coded to minimize their potential blocking effects on such utilities.
Another issue is that of table split into active and inactive partitions. In order to access the active data an application must query the switch or flag that indicates the active partition. It is common to have a flag table that has one row for each table partitioned this way; hence, accessing the flag requires a table access and causes a claim on the flag table. Unless this claim is released, the process loading the inactive partition will be unable to change the value of the switch.
In summary, applications should take frequent commits to release held locks and claims.
Take Care with Big Data Queries. Queries on a big table may return a lot of data, sometimes by accident! Since the downside of an SQL coding error is now much more severe, application developers should take much more care to understand the performance of their queries. This involves two well-known best practices.
First, use the Explain facility to determine the access paths used by your queries. Remember that the big data database design may include partitioned tables and minimal indexes. This may result in an increased number of table joins when a decreased availability of index-based data access paths. In addition, there are now access paths available that may be new to the developer. One of these is parallelism, where the DBMS splits the query into multiple parts and executes them in parallel. Another access path is called star join, which involves a special table joining method.
Second, developers should insist on increased load testing. This involves both query Explains and actual execution in an environment with production-like volumes of data. Since the Explain process uses data distribution statistics to determine the lowest cost access path, a load test environment is essential if you want to ensure your production access paths are efficient.
Load with no Logging. Populating a large table with a load utility involves I/O against the table and against all of the table's indexes. In addition, the system default is that all table changes are logged. This means that the DB2 logs will contain row images that can be used for table recovery purposes.
In a big data environment, database administrators commonly turn off the logging feature during a table load. This is because most load processes are partition loads. The preferred method is to first load the data partition with no logging. This is then followed by executing an image copy. This practice has the advantage of minimizing the elapsed time and I/O resources required by the load while maintaining recoverability. Note that the timing of the image copy may be to execute it later that day, or on the weekend.
This affects the application developer in the following scenario. One method of accessing table data without interfering with other applications and without holding locks or claims is by unloading data from the image copy dataset. There are several utilities available for doing this. If applications consider this data access means, they must coordinate with the database administrator.
A big data environment means a change in the way database administrators design and manage corporate data. These changes will affect the way applications must be coded and tested in order to ensure data availability and application performance.
ETL, data cleaning
Managing DB2 Performance
Star Schema, fact table, dimension table
See all articles by Lockwood Lyon