Tame Those Strings!
I guess I could have called it “String Manipulation”, but that was nowhere near as exciting as the other title. Often I find developers performing string manipulation in stored procedures and queries and chastise them for it. Using T-SQL is, IMHO, in general, a poor use of resources. Often the middle-tier or front-end can perform the manipulation where I think it is a better use of resources. After all, I can much more easily add another web server or app server box and cluster them than I can with another database server. Of course, SQL 2000 with the distributed partitioned views makes this easier, but still not easy.
Enough of that. It’s time to talk about string manipulation. What is it? Why do it? When should we do it? All good questions and I will get provide some answers. Like everything else I write, I do not have all the answers, but I hate reading general solutions to a problem. I like to talk about and provide specific solutions that you can use or adapt to your own situation. I will describe a recent problem I had where I needed to perform some string manipulation in the database and I think I had a good reason for doing it. I will also provide some guidelines at the end for when I think string manipulation is a good idea.
My Problem
So why did I choose to do string manipulation after I just told you I chastise developers for doing it? Well, the company where I work collects phone numbers of new users for the sales staff. We then assign a user to a sales rep based on area codes. I do not know if this is a good system, but I don’t care. I had to ensure that the database can figure out who the rep is based on an area code. There were two problems in implementing this: the legacy data was stored in one large phone number field, and the new data passed in from the front end was not always parsed into an area code.
Now the routine for both cases is the same, one was just a one time execution and one was a daily batch. Why did I do this in the database? Well, I could choose a time to schedule this and SQLAgent makes this easy. Also, if I had to get a middle tier developer to do it, they would have to write some component, test it and be sure it works. The component would have to be scheduled and then get the data from the server, parse it, and send it back. Lots of work and I think the CPUs on the database during off-hours are less expensive than the time to build the component, maintenance down the road, and the network hassles of moving data in and out of the server.
The Code
The routine for the string manipulation was written as a stored procedure (ALWAYS USE STORED PROCEDURES) that was then scheduled using SQL Agent. The code involves taking the phone number from a table and parsing this to get an area code. This area code is then used to update the same table. The objects involved are given below. Since this was corporate code, I have altered names and schemas slightly to protect my company. Other than name changes and some missing fields for other data, this is the same code in our database.
- UserDetail – Table that holds the data
- dbspScrubAreaCodes – stored procedure that is used to parse out the area codes from the UserDetail table.
The stored procedure runs a series of UPDATE commands against the table. The substring function is used to gather information from the phone number field. There are a few different cases that came up in the analysis of this and the quickest and easiest method for developing this solution was to use a series of statements to handle each case.
Notes From the Analysis
OK, for those that are interested, how did I arrive at this procedure? First, someone asked that we assign reps to users based on area codes. Easy one, we added an areacode field to the table to ensure we could easily matchup with the sales rep details without using a function in the where clause. This ensures we have SARG (search arguement) in the where clause for the index which is built on areacode.
The next step for me was to physcially look at the data in the phone number field. Here is a sample:
phone ---------- 605-555-2862 (561)555-2700 904-555-5680 N/A 580-555-5371 2815558368 (254)555-8430 336-555-2797 3365557233 592-555-3181/4951 96615551222 Ext. 249 +44 7930 555271
The results are abbreviated (and no these are not real phone numbers) but they show a number of cases for me to work with. When I am building an algorithm, I always look for patterns. If you can find the key to a pattern, then you have your algorithm. When I look at the first phone numer, I see the following pattern:
- 3 digit area code
- dash
- 3 digit exchange
- dash
- 4 digit number
This is easily handled in the second update statement in the proc (Obviously this was not the data I looked at). Now if I move to the second number, I see:
- open paren
- 3 digit area code
- close paren
- 3 digit exchange
- dash
- 4 digit number
This is handled in the third update statement. And so on. If you keep checking the data, I have handled most of the cases in the update statements. There are a number of areacodes, however, that I do not parse. While trying to handle all cases, I checked the result set of each statement by converting it to a select query before running the update. An example of this is here for the first update.
update userdetail set areacode = substring( phone, 1, 3) where areacode Is Null and len( phone) = 10
becomes
select substring( phone, 1, 3), phone from userdetail where areacode Is Null and len( phone) = 10
After running this query, I check the number (well actually eyeball the number) of correct to incorrect conversions. This gives me an estimate of whether I am helping or hurting the sales reps. If I am parsing out more incorrect than correct areacodes, I am not really helping and thus drop this parsing routine from the stored procedure.
After each statement I track the number of changed rows (@@Rowcount) so I have an idea of how much incorrect data is being sent into the database. After all, the front end should be able to parse all of the same area codes that I do prior to inserting into the database. At the end I log this number in a table that I keep around for this purpose.
Once I complete the stored procedure, I then compiled it on a test server and ran it against some data to be sure that it worked as expected. When it moved to production, it was scheduled to run every night to try and catch errors that the front end let slip through.
Conclusion
I do not advocate this code in a heavily used production environment. Why? A few reasons. the biggest is that when you include a function in the where clause, the query engine cannot use an index to resolve this query. A big cause of performance issues in many queries. Second, as I mentioned at the beginning, many string manipulations can be better performed in the front end. If you need to concatenate first name and last name, the front end can do this just as easy as the back. If you are going to spend CPU time doing it, spend it in the front end. Many data scrubbing routines, however, can be better performed in the back end and string manipulation is a big part of this. I often do import data into some temp table and then scrub it in SQL Server because it is easier and quicker and a (relatively) rare process compared to other database accesses.
I do find many times when I need string manipulation, especially in adhoc queries I write. Like any other tool in SQL Server, it has a time and a place where it is helpful and effective in its application. Spend some time experimenting and learning about the SUBSTRING() function. In another article, I will go into more detail on some other types and uses of string manipulation. As always, send comments, critiques, and suggestions to me
Steve Jones
)2000 dkranch