Data Archiving, Purging and Retrieval Methods for Enterprises
January 18, 2011
Enterprise growth leads to information explosion and vice versa. With various types of archiving, purging and retrieval methods available, the question becomes, 'what are the advantages/disadvantages in archiving and purging data.' Read on to learn the pros and cons of the different methods.
Enterprise growth leads to information explosion and vice versa. Corporations never used to see such great amounts of data flow. They used to handle and process data in only Kilobytes and Megabytes; now, almost all of the companies are starting to handling gigabytes and terabytes of data. Retaining current and historical data is inevitable, whether it is a data warehouse, Online Transaction Processing data, data from a decision support system or from an application.
Significance of Archiving and Purging Data
There are many advantages in archiving and purging data. It helps your current OLTP or application to perform faster. When archiving into NAS or tape storage, the current OLTP or application doesn't require more storage. Archive and purge processes usually happen at night or on weekends so that it does not take up resources during business hours.
How Much Data Should I Keep?
The amount of data that is required to be kept depends on the industry you are in and the company and government policies. Some companies keep records safe from the day the company started and others keep records for 7, 10, or 15 years and so on. This is driven by compliance or tax related policies or just company policies.
What Data Should We Keep?
The question "What data should we keep?" arises when you decide to archive. Before deciding what data should be archived, we need to analyze different criteria. For example:
a. Massive amounts of data will drag the current application's performance. Is it possible to separate actively used data and infrequently used data.
b. If we do separate the infrequently used data, we need to decide how often it needs to be retrieved. Should that data be kept in the current application or moved somewhere?
c. If we move the infrequently used data to another line of storage, how long will it take take retrieve them?
d. Should we move the archived data to an offsite location?
e. Is the data being stored for mere retrieval in case of recovery from a disaster?
f. Should we just archive old data based on month or quarter or semi-annual?
Separation of active, less active and inactive data is essential when it comes to archiving and purging. There is always cost, performance and design effort attached to it. Additional man-hours, storage, hardware or third party vendor costs are associated with archiving; it is the necessary spending to improve performance and secure the data. Without determining what to archive, you cannot archive the data. Even if you do archive such data, it just occupies too much space.
How to Separate Data?
No matter what kind of data you have, the only way you can separate active data from the infrequently used data is by monitoring the data access. Most of the RDBMS that we use now have some kind of tool to monitor the activity of the data. If your data is not in relational form, then either you need to create an application to monitor the data or you need to buy vendor products to monitor the data to separate the active and infrequently used data.
How Frequent Should I Archive and Purge?
Depending on the amount of data flow in your application, you can archive monthly data, weekly data, quarterly or semi annually, etc. Again, it all comes down to what is being used most. One archive process in one application will not be the same for another application. Some applications do not require an archive at all while others need partial or complete archives. It is not necessary to have both archive and purge on the same schedule. Purging may be slower on certain systems, so it can be run quite often. Make sure that your purge and archive process is scheduled only during off hours, during the nighttime and weekends. Writing proper T-SQL/ PL SQL and stored procedures are necessary for archive, purge and retrieval.
Media for Archiving
Costs related to archiving determine the media for archiving. SAN is more expensive that NAS. NAS is more expensive than tapes. Some companies store their archived data on DVDs and compact discs. Then again, it all depends on the current market and the latest innovative technology in storage.
Types of Archiving
There are different types of archiving in different companies and even within same company. The following are a few examples of archiving and purging methods. The databases and tables that are discussed here could be any RDBMS like Oracle Database or SQL Server database.
Database Backups: In many companies, archiving is simply taking a backup of the full database daily and moving the backups to a different storage media like NAS or tapes. The retrieval process on this type of archiving is a pretty long process.
Archive Tables: In certain RDBMS, such as SQL Server, features like table partitioning helps in archiving old data to various file groups and allows you to take a backup of those file groups. The retrieval process is pretty straight forward if the naming conventions of the file groups have decent standards. Please refer the Microsoft white paper on partition tables.
Archive Database: A few companies use a dedicated database on the same server or on a different server for archiving. These databases are named based on the archival schedule, so the retrieval process is easy and faster.
File Backups: If your source of data is not RDBMS, then you can covert data to files and take a backup of those files on to NAS or tape storage.
Summary table or database: Sometimes, if the granular data is not important for any future reference, you could summarize the data and purge the granular data. Those summary data could be on a separate table on an archive database.
Archived data should be secured as well. The data is always valuable and sometimes it contains sensitive data that you don't want to be mishandled. Always protect the data by encryption or with a password. That should also be part of your archive process.
It is not always necessary to archive the data in the same form as in the source database. In OLTP or DSS, the data is de-normalized in many cases to improve performance. However, when archiving you can reduce the storage space by normalizing the data, by changing the data type or by compressing the column data. Keep in mind that when retrieving the archived data you have to de-normalize the data the way it was before, so that your application can access it with any issue.
Compression of archived data is also important in the archival process. Compression can be at the source or at the target or both. Products like Microsoft SQL Server have compression features available at row level and page level and backup level. If such features are not available you can always buy vendor products like WinZip or use open source products like GZip, etc. to compress your data.
Data Retrieval Process
When demand arises, we need to get the data from archive. Depending on the archive process, the retrieval process may include the following:
c. Merging many granular archived data to one
d. Attaching data back to the current data
Retrieval would also include writing T-SQL/ PL SQL and stored procedures.
Writing T-SQL and Stored procedures are more efficient when it comes to automating. Keep in mind that purge takes more time and resources if the query is not optimized. Also, try to run the purge in batches to reduce bulk resource usage.
This article discussed and described the various types of archiving, purging and retrieval methods available. Based on your company's need and data archive policies, you could choose any of these methods to develop and implement the archive, purge and retrieval process.