Enterprise growth leads to information explosion and vice versa. With various types of archiving, purging and retrieval methods available, the question becomes, ‘what are the advantages/disadvantages in archiving and purging data.’ Read on to learn the pros and cons of the different methods.
Enterprise growth leads to information explosion and vice versa. Corporations
never used to see such great amounts of data flow. They used to handle and
process data in only Kilobytes and Megabytes; now, almost all of the companies
are starting to handling gigabytes
and terabytes of data. Retaining
current and historical data is inevitable, whether it is a data warehouse, Online
Transaction Processing data, data from a decision support system or from an
Significance of Archiving and Purging Data
There are many advantages in archiving and purging data. It helps your
current OLTP or application to perform faster. When archiving into NAS or tape
storage, the current OLTP or application doesn’t require more storage. Archive
and purge processes usually happen at night or on weekends so that it does not
take up resources during business hours.
How Much Data Should I Keep?
The amount of data that is required to be kept depends on the industry you
are in and the company and government policies. Some companies keep records
safe from the day the company started and others keep records for 7, 10, or 15
years and so on. This is driven by compliance or tax related policies or just
What Data Should We Keep?
The question "What data should we keep?" arises when you decide to
archive. Before deciding what data should be archived, we need to analyze
different criteria. For example:
a. Massive amounts of data will drag the current
application’s performance. Is it possible to separate actively used data and
infrequently used data.
b. If we do separate the infrequently used data, we
need to decide how often it needs to be retrieved. Should that data be kept in
the current application or moved somewhere?
c. If we move the infrequently used data to another
line of storage, how long will it take take retrieve them?
d. Should we move the archived data to an offsite
e. Is the data being stored for mere retrieval in
case of recovery from a disaster?
f. Should we just archive old data based on month
or quarter or semi-annual?
Separation of active, less active and inactive data is essential when it
comes to archiving and purging. There is always cost, performance and design
effort attached to it. Additional man-hours, storage, hardware or third party
vendor costs are associated with archiving; it is the necessary spending to
improve performance and secure the data. Without determining what to archive,
you cannot archive the data. Even if you do archive such data, it just occupies
too much space.
How to Separate Data?
No matter what kind of data you have, the only way you can separate active
data from the infrequently used data is by monitoring the data access. Most of
the RDBMS that we use now have some kind of tool to monitor the activity of the
data. If your data is not in relational form, then either you need to create an
application to monitor the data or you need to buy vendor products to monitor
the data to separate the active and infrequently used data.
How Frequent Should I Archive and Purge?
Depending on the amount of data flow in your application, you can archive
monthly data, weekly data, quarterly or semi annually, etc. Again, it all comes
down to what is being used most. One archive process in one application will
not be the same for another application. Some applications do not require an
archive at all while others need partial or complete archives. It is not
necessary to have both archive and purge on the same schedule. Purging may be
slower on certain systems, so it can be run quite often. Make sure that your
purge and archive process is scheduled only during off hours, during the nighttime
and weekends. Writing proper T-SQL/ PL SQL and stored procedures are necessary
for archive, purge and retrieval.
Media for Archiving
Costs related to archiving determine the media for archiving. SAN is more
expensive that NAS. NAS is more expensive than tapes. Some companies store
their archived data on DVDs and compact discs. Then again, it all depends on
the current market and the latest innovative technology in storage.
Types of Archiving
There are different types of archiving in different companies and even
within same company. The following are a few examples of archiving and purging
methods. The databases and tables that are discussed here could be any RDBMS
like Oracle Database or SQL Server database.
Database Backups: In many companies, archiving is simply taking a backup
of the full database daily and moving the backups to a different storage media
like NAS or tapes. The retrieval process on this type of archiving is a pretty
Archive Tables: In certain RDBMS, such as SQL Server, features like
table partitioning helps in archiving old data to various file groups and
allows you to take a backup of those file groups. The retrieval process is
pretty straight forward if the naming conventions of the file groups have
decent standards. Please refer the Microsoft
white paper on partition tables.
Archive Database: A few companies use a dedicated database on the
same server or on a different server for archiving. These databases are named
based on the archival schedule, so the retrieval process is easy and faster.
File Backups: If your source of data is not RDBMS, then you can
covert data to files and take a backup of those files on to NAS or tape
Summary table or database: Sometimes, if the granular data is not
important for any future reference, you could summarize the data and purge the
granular data. Those summary data could be on a separate table on an archive
Archived data should be secured as well. The data is always valuable and
sometimes it contains sensitive data that you don’t want to be mishandled.
Always protect the data by encryption or with a password. That should also be
part of your archive process.
It is not always necessary to archive the data in the same form as in the
source database. In OLTP or DSS, the data is de-normalized in many cases to
improve performance. However, when archiving you can reduce the storage space
by normalizing the data, by changing the data type or by compressing the column
data. Keep in mind that when retrieving the archived data you have to
de-normalize the data the way it was before, so that your application can
access it with any issue.
Compression of archived data is also important in the archival process.
Compression can be at the source or at the target or both. Products like
Microsoft SQL Server have compression
features available at row level and page level and backup level. If such
features are not available you can always buy vendor products like WinZip or use open source products like GZip, etc. to compress your data.
Data Retrieval Process
When demand arises, we need to get the data from archive. Depending on the
archive process, the retrieval process may include the following:
c. Merging many granular archived data to one
d. Attaching data back to the current data
Retrieval would also include writing T-SQL/ PL SQL and stored procedures.
Writing T-SQL and Stored procedures are more efficient when it comes to
automating. Keep in mind that purge takes more time and resources if the query
is not optimized. Also, try to run the purge in batches to reduce bulk resource
This article discussed and described the various types of archiving, purging
and retrieval methods available. Based on your company’s need and data archive
policies, you could choose any of these methods to develop and implement the
archive, purge and retrieval process.