Big data is the latest craze. Hardware and software vendors have overwhelmed IT departments with high-speed analytical software, proprietary high-performance hardware, and columnar-based data stores promising quick access and lightning-fast answers to ad hoc analytical queries.
Forgotten in this blast of technology are the database administrators' most important responsibilities: backup and recovery.
Are Backups Necessary?
Some vendors (and users!) argue that backup and recovery of big data is not an important issue. Some of the reasons are:
- After a disaster, recovery of the operational data (accounts, orders, customers, and so forth) has a much higher priority;
- Big data solutions are not mission-critical; further, since analytics are performed on a wide range of time series, recovery of big data need not be completely up-to-date;
- Big data is, well, big! The costs of the storage media required to backup vast amounts of data is prohibitive;
- The processing and data channel capacity required to take that storage and load it to the big data tables is costly and time-consuming; indeed, it may take several days or weeks to completely recover the entire data store.
These reasons may not apply in your situation. Many customer-focused systems employ big data analytics as a part of operational processing. This means that a failure of a big data implementation may directly affect the company's ability to serve customers. These big data applications have become mission-critical, making data recoverability a high priority.
Prudent disaster recovery planning involves the ability to recover your data to a consistent point (the recovery point objective, or RPO) within a limited time period (the recovery time objective, or RTO). The combination of backup strategies you choose is driven by these objectives. Several different backup methods exist. They fall into the following categories.
- Local and remote copies
- Virtual snapshot
Local and remote copies consist of a combination of disk file backups and database utility backups. IT support personnel execute jobs that make media-to-media copies of operating system files that contain DB2 data, while the DB2 database administrators run IBM or third-party vendor tools that access the data through a DB2 interface. Disk file backups tend to be cheaper, due to the costs of proprietary vendor software products; however, the vendor tools (such as the DB2 Unload and Image Copy utilities) tend to be much faster, and also offer a speed advantage during recover processing.
Virtual snapshot is a hardware solution that allows the storage media to create a virtual backup of an entire system. Database writes are disabled for a short period, while the hardware managing the storage subsystem takes internal copies of all files. This copy process can be extremely fast, sometimes completing in seconds. After copying is complete, the database management system is allowed to resume write operations.
Snapshots provide extremely fast recovery times, assuming that recovery to the point-in-time the snapshot was created is desirable. Beyond that, recovery to points other than that of snapshot creation require some method of applying all recent database changes (captured in logs) to the snapshot. Another issue is storage capacity. A snapshot has the potential to require double the currently used storage. And, if a disaster occurs and the snapshot is now used as current data, yet another snapshot area must be allocated in the case of a further disaster.
Replication involves creating a set of complete database images in storage at another site, and then capturing all local database changes that need to be made at the replication site. This second site would then become the primary site should a disaster occur. The advantage of this method is that the RPO can be achieved almost to the point-in-time of the disaster.
Assume that the second (recovery) site has been loaded with a copy of local data. The DBA now implements one of several methods to capture any database changes made locally. These changes are then sent to the recovery site and applied there. One method is to capture any Insert, Update, or Delete SQL statements. Another is to capture all log records made by the database management system. Multiple vendor tools exist that can automate this process.
Automating Recovery Processes
When it's time to recover multiple databases, every second counts. This is why processes, scripts, and job control language (JCL) for all the various combinations of database recoveries should already exist. The DBA creates these at the same time the database is created. As tables and indexes are added to the database, the DBA updates the backup and recovery processes.
These processes require regular review. As data volumes increase, backup times and recovery times may increase as well. As recovery times extend, they may exceed the RTO for the application.
Another part of the review deals with the RPO. The process of recovering to a particular point in time begins with restoring data from backups. Next, the DBA runs a process that applies log records to the backup. These log records are the ones created after the time of the last backup, and represent table changes since that time. As transaction volumes and data volatility increase, the number of log records since the backup increases as well. Consequently, the DBA may consider more frequent backups, backups of indexes, or even moving to high-speed backup options such as virtual snapshots.
As almost every IT application environment changes over time, recovery processes must be tested. This allows the DBA to confirm that processes are up-to-date, measure the recovery time, and plan for future capacity changes. In addition, recovery practice is a great way of training new or inexperienced staff in recovery procedures. One positive result of this training is a successful disaster recovery test, which some enterprises require due to either regulatory compliance or service agreements.
What do You Need?
DBAs should ensure they have all of the following:
- Documentation on the recovery status of all production objects;
- Regular measurements of required recovery times for objects belonging to critical applications;
- Development of alternative methods of backup and recovery for special situations (such as
- image copy of indexes, data replication to recovery site, and DASD mirroring);
- Regular development, improvement, and review of data recoverability metrics.
Big data applications as well as operational systems must be supported by a robust and rapid recovery process. Creating these processes requires designers to begin with database design, following through to the first production implementation.
Recovery requirements are an integral part of any application or database solution. Database designers and administrators can ensure that recovery time objectives are met by utilizing the appropriate combination of backup and recovery processes discussed here.
IBM -- Application Recovery Tool for IMS and DB2 Databases A Data Recovery Guide
IBM -- DB2 Recovery Expert for z/OS User Scenarios
IBM -- System z Mean Time to Recovery Best Practices
IBM -- High Availability and Disaster Recovery Options for DB2 on Linux, UNIX, and Windows
IBM -- Backup and Recovery I/O Related Performance Consideration
See all articles by Lockwood Lyon