Classic disaster recovery planning from the DBAs point of view consisted of making regular database backups and storing them off-site. If a disaster occurred, these backups could then be used to restore databases at a secondary site. Sometimes this disaster recovery (DR) site was maintained by a third-party vendor for multiple clients to use.
Today, the vast amounts of data companies use and maintain makes this method unworkable. Databases may be too large to back up conveniently, and large databases may take hours or days or longer to restore at the DR site. In addition, big data applications have changed the way that organizations view mission-critical systems.
What elements comprise current day disaster recovery planning, and how are large mainframe and big data applications recovered?
Recent Changes in Information Technology Infrastructure
In the last decade we observed incredible changes in the amount of data we process and store, and the amount of processing power available.
- Companies now have the ability to store data across multiple data centers or in the cloud;
- CPU speed increases and massive network bandwidth in many locations made audio and video available for presentation to users;
- Big data and business analytics drove organizations to implement massive data stores with historical data;
- Large enterprises needed to ensure that potential disasters would not adversely affect business continuity.
All of these factors combined to create an environment where terabytes of data (or more) must be available with quick response times to applications and users. Hardware and software evolved to meet this challenge, including massive disk arrays, high-capacity tape storage and even hybrid hardware and software solutions for big data.
The DBA’s Priorities
Despite a host of changes in environments, infrastructure and applications, the DBA still has the same priorities. The DBA must ensure that enterprise data is:
- Available to applications and users;
- Quickly accessible.
We concentrate on the first of these, data recoverability. The DBA must plan for contingencies surrounding multiple categories of disasters, from a single file corruption to a disk failure to complete site loss. In times past, things were straightforward: the DBA made frequent database and file backups to tape, and these were transported and stored off-site. When a disaster occurred, the backups could be retrieved and used as input to rebuild files or databases.
As data volumes grew, the time required to rebuild one or more databases became overly long. Companies could no longer afford several days of downtime after a disaster. DBAs needed better options. Backing up all of the data became onerous and prolonged, and consumed a great number of tapes and other backup media. Some backup tactics provided relief, such as using data compression algorithms to minimize backup file sizes or frequent backups of recently changed data combined with infrequent backups of read-only data such as the historical portion of the data warehouse.
The most common large enterprise disaster planning solutions rely on a combination of a secondary site with sufficient disk storage, some form of disk mirroring between sites and a high-speed network. Any changes on disks at the primary site are sent and applied to disks at the secondary site.
The mirrored storage solution has one major drawback: it requires that the secondary site have sufficient disk storage for all mission-critical corporate data, and sufficient CPU power available to run critical applications. The advent of big data made this even worse.
Big Data Disaster Recovery
A typical big data solution promises massive data storage coupled with fast query response times. One architecture by IBM, called the IBM Db2 Analytics Accelerator (IDAA), consists of an array of disk drives that store Db2 table data in a proprietary format. This is coupled with analytics software that takes advantage of both data storage and parallel disk processing by storing data across several hundred disk drives then querying every drive at the same time. By spreading data across multiple drives and processing a query against all drives simultaneously, requestors experience extremely fast response time. Such hybrid systems are generally called appliances.
When big data applications first appeared, they were not considered part of disaster planning. Most (if not all) queries against the big data database were ad hoc or “what if?” questions looking for patterns in historical data. However, as applications matured, and the database grew, users found that some previously submitted queries were consistently useful. These became regular reports. In addition, some operational applications began accessing big data as part of normal processing. For example, an on-line application might query the appliance data to analyze transaction history to make purchasing recommendations or to analyze patterns of purchases for potential fraud.
As connections to operational systems and regular analytical reporting became more common, it became necessary to include the big data application in disaster planning. Again, these huge volumes of data take too long to backup, too long to recover and require a secondary site with complete hardware and software available. Modern enterprises must consider implementing the next generation of information technology: cloud services.
The idea of the cloud began with the internet. Connecting multiple users to share a common work goal started with academics sharing and reviewing papers and then progressed to centralized applications that used connected hardware to share processing power. Presently, many vendors provide services that allow the enterprise to offload some of its processes to vendors’ sites. These include application execution (called Application as a Service, or AaaS) and database storage (called Database as a Service, or DBaaS). By judiciously choosing the appropriate services, you can allow third parties to worry about running your applications and storing your data.
To extend these options to your disaster recovery planning, you may choose services that promise specific recovery options in case of a disaster, or you can work with multiple vendors and manage backup and cutover of storage and processing yourself.
The greatest benefit of cloud services is to relieve you from hosting, managing and staffing multiple data centers and their associated high-speed networks. This can speed application development, particularly when you use DBaaS for database definition and creation. This benefit can also be a disadvantage, as you may lose control of certain aspects of your application and database administration such as performance tuning and software updates.
Another issue is using such services to host your data warehouse and big data. This is because it is common for the warehouse and big data application to be closely integrated. Many of the dimension tables in the warehouse are used in queries against big data to do aggregations and grouping. It is common to maintain two copies of dimension tables, one in the data warehouse and one in the appliance. This is because the appliance software depends upon local storage of all data in its proprietary format in order to deliver fast query responses.
If you are considering DBaaS for either your big data or your data warehouse you could consider implementing both in the cloud. This may be a good solution if you are already doing rapid application development in a cloud environment and some of your applications require access to this data. Keep in mind that query performance will depend not only on the specific hardware and software choices made by the DBaaS provider but also on your network connection to their servers.
Disaster Recovery Planning
If you choose to implement applications or data storage through cloud services, this radically changes your planning for disasters. Planning is now a multi-company effort, with both your staff and those of your service providers. Your industry or applications may require formal disaster recovery testing. This is particularly important for financial applications, which are subject to legal restrictions and regulations.
If your systems must be continuously available (that is, 24 hour, 7 day availability), then a disaster recovery test will take place while your operational systems are running. This will require strict segregation of testing hardware and data, including your own and that of the cloud service providers.
You should automate as many recovery processes as possible to avoid human error and delays. Most importantly, measure and review the time required for recovery. You should have a formal goal for complete recovery of data and application processing called the recovery time objective (RTO). As data volumes grow and the number of applications increases, your total recovery time may reach or exceed your RTO. Record and regularly review recovery time results in order to foresee potential future problems.
Most large enterprises have one or more big data applications and an enterprise data warehouse. As data volumes increase, the ability to recovery from a disaster becomes more expensive and time-consuming. Consider integrating cloud services into your mix of applications and database management. This can allow you to avoid some of the costs associated with disaster recovery, including multiple large data centers and high-speed networks. Take care to include a discussion of disaster recovery planning and testing with potential service providers, keeping in mind that data recoverability is the primary task of the DBA.