Divide and Conquer - Batching Part 1
December 10, 2000
There are many queries and processes that a DBA may perform on an SQL Server that run quickly and efficiently with small amounts of data, but do not scale up when run on large sets of data. The power behind a RDBMS and SQL Server come from the extremely efficient and rapid processing of sets of data. Most other programming languages operate in a row-by-row or procedural fashion working with one piece or line of data at a time. The ability to write SQL code that operates on a batch of data allows the alteration of a large amount of data with a small amount of code. It also allows the very quick and headache creating destruction of data with even smaller amounts of code, but that is another subject and another article.
I always preach about the writing of tight code that operates on sets of data rather than a row by row approach. I avoid cursors and temp tables when possible to ensure that ever operation is a batch-oriented statement rather than an individual row operation. When I see another DBA send me a procedure with a set rowcount xx in it, I am often returning this for a rewrite to work on a batch of data. As I write this, however, I am getting ready to eat my words.
There are times where the use of SET ROWCOUNT can be a great time saver and maximize the use of resources. As with most of my rules, there are always exceptions and the use of the best tool for the job is something that should always be kept in mind. Batches are something that all SQL programmers use to operate on a set of data, and there are times when it may sometimes be necessary to sub-divide your batches into sub-batches to more efficiently complete the process.
An example I will give may best illustrate the reason why batching a process is sometimes better involves the following objects:
--Transaction table Create table MyTrans ( AccountID int, TransID int identity (1, 1), TranDt datetime, Amt float, Balance float )With these items defined as follows: MyTrans table
Update transaction Set balance = sum( t.amount) from inserted I, transaction t Where i.accountID = t.accountID And i.accountID = transaction.accountid And t.date <= i.date)This table contains about 5 million rows of transactions spanning 15,000 or so accounts over about a decade. Recently I needed to correct a large number of rows historically in this transaction table. The corrections involved changing amounts of previous transactions that had been rounded incorrectly. I had received a file of data that contained the ids, dates, and corrected amounts. This data was loaded into a temporary table called TempSteve. It would have taken only one line of SQL code to perform this update:
Update transaction Set amount = t.amount >From tempsteve t Where t.id = transaction.id
This code would update each existing transaction with the new amount and fire the trigger to reload the balance with the sum of the previous transaction. In theory this looks like a relatively simple process that I could run from my SQL window and leave early for a light run before dinner.
It Didn't Work
In actuality there is a problem with this code. Not a bug, since it will work and perform the updates, the problem lies in the actual execution of this code on the server. Performing a single update, for any transaction, whether the most recent or one five years ago takes a split second on my SQL Server. Running the ten thousand corrections on my decent sized SQL Server (a 4-way PP200, 2048MB RAM) was still running nearly twelve hours later. Why the delay?
Normally I would not have allowed this query to run so long, but it was started on a Friday evening and let run until I found it Saturday morning. That first cup of coffee was not very enjoyable when I found this still running on my server. Let's examine what has to occur for this code to complete:
Transaction 1 is altered with a new inserted table that the trigger evaluates for the sum of prior transactions (0), and updates the ledger. Before this occurs, however, all the remaining transactions in the statement must be completed. The next transaction being evaluated is for the same account and its amount is updated (in memory again) and its trigger fires, but the balance update query requires the amount update from the previous update be evaluated in order to get the accurate sum. Therefore the query processor must search its current altered table in memory to get the sums along with the rows of the physical disk. Carry this forward a few hundred transaction and you find a recursive effect where each new update for an account requires results from a pending update of the same account in this same batch.
I suspect my SQL Server had problems with the recursive requirements of this query because of the trigger. With my morning coffee I reluctantly canceled the query and set about examining the process and trying to devise a better solution before my kids woke up and came down to help.
The Solution - Batching!
Initial testing of single row updates, showed each one completing in less than a second. After a few of these queries completing and a second cup of coffee, I realized the recursive problems in this update. At that point I added a tinyint column to the TempSteve table for remaining rows, set the value to zero and ran the following code:
Declare @t int, @i int Set @i = 5000 Set rowcount 1 While @I > 0 Begin Select @t = transid from TempSteve Update transaction Set amount = t.amount From tempsteve t Where t.id = transaction.id And t.transactionid= @t Update tempsteve set status = 1 Where transid = @t Select @I = @I - 1 End
This ran for the remaining 9000+ rows in a couple minutes. This was timed perfectly as I heard the kids starting to move around upstairs. By batching this very simple SQL statement and removing the requirement the trigger implement recursion the server could perform the update with a minimal use of resources.
So where else does batching come in handy? Preventing locks and blocks is another area I use this. When I have a process that requires a significant amount of time, like inserting new rows into a table from a large batch or deleting old archived information, I use this. To prevent issues with other users, I usually use some type of flag to indicate processed or unprocessed, then set a batch size and process x number of rows. I often also include a short waitfor in my loop to slow this down for a second or two and let other users get in a query on the table.
I hope nobody was laughing about my "decent" sized SQL Server. This process actually occurred a couple years ago. Now even my desktop is bigger than that server. The next part of batching will look at an alternative method of batching data that does not use a cursor.
As always, please rate this article and feel free to send some feedback.Steve Jones