Introduction
There are two approaches to loadrefresh data from a source system to a data warehouse; the first approach is to load the full set of data every time and the second approach is to load the full set of data for the first time and then subsequently load only the changed data set as incremental data load. As you might have guessed, the first approach is not feasible especially in cases where you have humongous volume of data in your source table whereas in case of the second approach you need to have some mechanism to identify the changed data set in the source table after the last data pull so that only those changed data sets can be considered while pulling it from the source table and to load into the data warehouse.
Prior to SQL Server 2008 there was no in-built support to identify changed data set for incrementally pulling data from a source table and hence we had to write our own custom logic (for example by adding Last Created Date or Last Modified Date columns in the source table and updating it accordingly or by some other means) so that changed data sets can be identified for incremental data pull. Starting with SQL Server 2008 we have two different in-built mechanisms (please note, you don’t need to write code for leveraging these features though you just need to enable it accordingly as per your need) to identify DML changes (INSERT, UPDATE, DELTE) happening at the source table so that only changed data sets can be considered for data pull from the source table and to load into the data warehouse.
These two in-built mechanisms are Change Data Capture (CDC) and Change Tracking (CT). In this article series, I am going to talk about CDC in detail.
Understanding Change Data Capture (CDC)
Change Data Capture (CDC) is an Enterprise edition (available in Developer and Evaluation editions as well) feature and once enabled for a table, it captures DML changes (insert, update and delete activities) on a tracked table. The captured information then becomes available in a relational format for consumption. When you enable CDC on a table, SQL Server creates a table that contains same columns as a sourcetracked table along with the metadata needed to understand the changes that have occurred. Table-valued functions are created to systematically access the change data from the CDC table over a specified range, returning the information in the form of a filtered result set.
Change Data Capture (CDC) vs Change Tracking (CT)
Change Data Capture is an asynchronous process which reads the transaction log asynchronously in the background to track and record the DML changes (complete history of changes) whereas Change Tracking is a light-weight synchronous process, which tracks what has happened with the last changed data (no history). As the Change Data Capture feature captures the complete history of the changes it requires more storage space than Change Tracking. Change Data Capture requires SQL Server Agent to capture the information from the SQL Server transaction log as it works in asynchronous manner whereas Change Tracking captures information about the changes in synchronous manner as part of the user transaction itself.
How CDC Works
Once CDC is enabled on a tracked table, SQL Server uses an asynchronous capture mechanism that reads the SQL Server transaction logs and populates the CDC table (table which keeps track of history of changes along with meta-data about changes) with the row’s data that changes. This feature is entrenched in transaction log architecture, thus a lot of the metadata in CDC is related around the concept of a Log Sequence Number (LSN, every record in the transaction log is uniquely identified by a LSN. LSNs are ordered such that if LSN2 is greater than LSN1, the change described by the log record referred to by LSN2 occurred after the change described by the log record LSN1).
To enable Change Data Capture for a table, you first need to enable it at the database level using sys.sp_cdc_enable_db system procedure, which creates the change data capture objects that have database wide scope, including meta-data tables and required Data Definition Language (DDL) triggers. It also creates the CDC schema and CDC database user. You can verify if the database is enabled for CDC or not by looking at the is_cdc_enabled column for the database entry in the sys.databases catalog view.
Next you can enable CDC for the required table using sys.sp_cdc_enable_table system stored procedure. When CDC is enabled for a table, a CDC table (a table that keeps track of history of changes along with meta-data about changes) and one or two table-valued functions are generated along with capture and cleanup jobs for the database if this is the first table in the database to be enabled for CDC. You can verify if the table is enabled for CDC by looking into the is_tracked_by_cdc column of the sys.tables catalog view.
Once enabled, each DML change to the tracked table is captured from the SQL Server transaction log and it writes changes to the CDC table that is accessed by using a set of table-valued functions.
By default, all of the columns in the tracked table are considered for capturing changes though you can specify only a subset of columns, for privacy or performance reasons, using the @captured_column_list parameter of sys.sp_cdc_enable_table system stored procedure. Also, by default, the CDC table is created in the default filegroup of the database though you can specify to create it on another filegroup using @filegroup_name parameter.
It’s not mandatory to have SQL Server Agent service running when enabling a databasetable for CDC but it must be running for CDC to work properly. Please note, if the SQL Server Agent job is not running then transaction log will keep on growing and will not get truncated after transaction log backup and hence you need to ensure SQL Server Agent job is running to ensure changes are read from transaction log and written to the CDC table.
What about CDC Tracking Table Growth?
You might be wondering about the CDC table (table which keeps track of history of changes along with meta-data about changes), will it keep on growing? The answer is NO, there is an automatic cleanup process that occurs every three days by default (and this is configurable and can be changed as per specific need). For more intense environments, where you want to directly manage the CDC table cleanup process, you can leverage the manual method using the system stored procedure sys.sp_cdc_cleanup_change_table. When we execute this system procedure you need to specify the low LSN and any change records occurring before this point are removed and the start_lsn is set to the low LSN we specified.
What Happens When Tracked Table is Changed?
DDL changes are not prevented for a CDC tracked table but a new column added will not be reflected and a dropped column will return null values for the column in the subsequent change entries. It means, CDC ignores any new columns that are not identified for capture when the source table was enabled for CDC for the current capture instance as it retains or preserves its shape when DDL changes are applied to its tracked table. However, it is possible to create a second capture instance for the table that reflects the new column structure. Please note, a tracked table can have a maximum of two capture instances.
Conclusion
In this article, I talked about Change Data Capture (CDC), which captures DML changes (insert, update and delete activities) on a tracked table and can be used to incrementally pull data from the tracked table. In my next article on this series, I am going to demonstrate how this new feature can be leveraged, in detail, with an example.