February 18, 2001
One of the constructs that I find myself building quite often is a process that periodically retrieves an incremental amount of data. So many reports that I build, whether for myself or for others, are looking for the new data since some x amount of time. It might be the sales for the last day, changes on a particular table, etc., but usually it is the change in some data over the last day.
Of course, this is usually a fairly easy report structure to build. I have routines that determine the previous day's date, resetting the time to midnight (00:00:00) and then select all data that is new since this date. Since most of the tables I am working with have some sort of datetime field to track changes, this results in a query that looks something like:
declare @dt select @dt = cast( substring( cast(dateadd( day, -1, getdate()) as char( 20)), 1, 12) as datetime) select * from sales where SaleDate > @dt
Where @dt is the previous day's date, calculated with a time set to some value.
This works great for the first week that it is implemented. The next Monday when someone comes into work and runs this report, they usually see a smaller report than they are expecting. Why?
Well, in a few e-commerce applications I have worked with, there is a bunch of activity on Friday and Saturday. I guess it's the end of the week and people do more surfing and purchasing. If we are calculating the previous day's date, then on Monday, we get all the sales for Sunday. This leaves out the sales from Friday and Saturday.
No problem, we can easily change our code that calculates the beginning date to account for weekends. With the following case statement, we can decide whether we need to get one day or three days worth of data.
declare @dt datetime, @d tinyint if datepart( weekday, getdate()) = 2 select @d = -3 else select @d = -1 select @dt = cast( substring( cast( dateadd( day, @d, getdate()) as char( 20)) , 1, 12) as datetime)
This will check for a Monday (datepart = 2) and then set the number of days to go back for the next statementl.
Since most of my processes are automated and I send an incremental amount of data to a user automatically on some schedule, this used to be my preferred solution. I have used this and it usually works until one of two things happens. Either the task fails for some reason and it takes a day or two to get fixed, or a holiday occurs. If the task breaks on Monday and it does not run again until Tuesday ( this usually happens when I go on vacation) the user gets a report on Tues that shows Mon sales, but is missing all the weekend sales. After a holiday, the user gets two (or more ) reports instead of a single consolidated report.
I came up with this solution while building an extract of data for a business partner. They wanted to receive an incremental load of mail information every two weeks rather than a complete load of data. Since our mail information was growing rapidly, this was a reasonable request and so I was tasked with building this.
I used one of the techniques above which worked for a month and then the task broke. By the time I had fixed it, a couple of days had gone by and I had to manually generate the report, otherwise my task would have missed a few days of data. It was then that I decided to build my process tracking table. Here is the code
Create table PrcssTrckr ( PrcssNm varchar( 80), LastXfr datetime, LastStat tinyint )
Nothing fancy, but the coumns are defined as follows:
How I Use This
This table allows me flexibility in a few ways. One, I can use the same data set, like mail information, for a number of different processes. Suppose I have snail mail and email extraction processes that run on the same customer information. For these two data feeds, I would add two rows to my PrcssTrckr table as follows:
PrcssNm LastXfr LastStat -------------- ------------------- ------------------------------ SnailMailFeed 01-01-1900 00:00:00 0 EmailFeed 01-01-1900 00:00:00 0
These would be the initial rows inserted. Now we start the snail mail export process which runs a query something like the following:
select name,address, city, state, country, zip from Customer c, PrcssTrckr p where c.created > p.lastxfr and p.PrcssNm = 'SnailMailFeed'
This returns a list of all members (since non were created before the beginning of the last century. After this data is extracted, the LastXfr date is set to GETDATE() and the table looks like the following:
PrcssNm LastXfr LastStat -------------- ------------------- ------------------------------ SnailMailFeed 02-10-2001 08:15:00 0 EmailFeed 01-01-1900 00:00:00 0
Now we know that this process last completed the time stamped on the first row. If we now run the same process tomorrow, the query returns an incremental feed of data containing customers created between 2-10-01 and 2-11-01. If I were to wait and run this every other day, then the incremental feed would include two days of customers.
The process now tracks its own progress and always returns an incremental feed of data. This is independent of the email process which could be running on a completely different schedule.
This also allows the process to fail and when it is fixed, still return the correct incremental amount of data to the user.
Over time, I find myself using this technique for many different processes, from data extractions to customized reports. Anyplace the incremental data feeds are needed, this has proven to be a valuable technique. It has even ensured correct reports when my server was down for a few days after our company moved to a new office. As with most techniques, I am sure that many of you will find even more and better uses for this than I have.
As always, I welcome feedback.Steve Jones