This is a continuation of string manipulation techniques. If you are interested, you can read the other articles,
though you do not need to read it before this one. These are mostly beginning
programming articles, but advanced T-SQL programmers may still find something useful here.
Part 1 deals with SUBSTRING and how it can be used to extract some information
from a field of data.
from a field of data.
It’s time to talk about string manipulation. What is it? Why do it? When should we do it?
All good questions and some of the answers are in Part 1. This article will work on some advanced
string manipulation techniques that I used to solve another problem. If you are new, I write about
specific problems, not general ones, in general :), so I hope you enjoy this and learn something.
I hope that few of you have run into this, but given the experiences I have had, it’s quite likely more than a
few of you have had
this problem. In one table, the developers had decided that they needed to store answers to questions asked on the
website. Some of the questions that were asked contained multiple possible answers. An example of this type of questions
is shown below:
1. What types of computers do you have in your company? A. Servers B. Workstations C. Laptops D. Mainframes
From this you can see that the user may choose to select 1, 2, 3, or all 4 of the answers. Not a problem, but the
developers decided that the easiest way to store this information was in a table like this:
UserID Answer ------ ----------- 1 A,B,C 2 B,C 3 A,C,D 4 D
So why is this a problem? Well, how would you find all the people that answered “A”? This one is easy, you would select
all users like this:
select UserID, answer from Answer where Answer like ‘%A%’
No problem. Well, what about all people that answered “B” and “C”? This one is tougher; you could try
select UserID, answer from Answer where Answer like ‘%B%’ and ‘%C%’
It does not work. I want to get this result set:
UserID Answer ------ ------- 1 A,B,C 2 B,C 3 A,C,D
What you end up with is a result that looks like this:
UserID Answer ------ ------- 1 A,B,C
In order to get the result set I want, what I would have to do is union multiple queries.
This really does not scale and as the number of answers grows
( our biggest had 12 answers) the queries become increasingly complex.
The solution in this case was to “bend” that data so that it is in a more normal form. What I mean by this is to take the table
above and change it to look like this:
UserID Answer ------ ------ 1 A 1 B 1 C 2 B 2 C 3 A 3 C 3 D 4 D
With the data in this form, we can more easily build a scalable and self-maintaining solution that will grow and handle
new questions and answers with minimal work from the development staff and (more importantly) the DBA.
The objects involved are given below. Since this was corporate code, I have
altered names and schemas slightly to protect my company. Other than name changes and some missing fields for other data, this is
the same code in our database.
- Answer – Table that holds the original data
- NewAnswer – Table that holds the new data
- dbspBendAnswers – stored procedure that “bends”
the data in the Answer table.
In order to extract each answer out and place it in a separate row, I decided to row process
this data one row at a time rather than develop a set oriented solution. I normally frown upon
this type of code, but this needed to be done quickly and was a one-time maintenance
operation with a piece of code that I would have thrown away (if I was not writing this article).
To process this data, I built a cursor of the primary key from this table which was
the userID field and the questionID field. I then looped through this cursor and processed each row’s
answer data. The processing of each row was where the string manipulation came in.
Each row can be formatted as a pattern that looks like this:
where x is a single answer and the number of answers in this field is unknown. However, each field
can also be written in the following manner.
"x" + ",y" + ",z" + ",..." + ",n"
In this format, an algorithm starts to become apparent. If I can delimit each answer by looking at
the commas, then what I need to do is parse this string by finding the location of the
commas. Once I do this, then the SUBSTRING function can easily be used to extract the answers
that are between the commas. If there is no comma in the string, then the assumption is that
this is a single answer, so we insert the entire string as a new answer and loop to the next row
In my haste to build this, I decided that I would use the charindex function to find commas and
then process the string based on the commas. charindex allows you to find the numeric location
in a string of another string. In other words, it returns the position in any string of
some character that you are looking for (in my case, commas). So if I had the following string:
I could run the following code:
select charindex( ',', 'A,B,C')
and get these results:
Where 2 is the position (2nd character) in the string of the comma. One thing to be
aware of is that only the first occurance of the string is found. So now I have the
location of the first comma. If I include this in a SUBSTRING statement that starts
at the beginning of the string, I can extract the first answer.
answer ------ A,B,C select @c = charindex(',', answer) select @a = substring( answer, 1, @comma - 1) select @a ------- A
Now I have a single answer, how does this help me get the rest of the answers? In
my algorithm I have stord the location of the first comma off. What I now do is remove both
the first answer and the first comma from the string. This would result in this code and result:
select @str = substring( @str, @comma + 1, len( @str)) select @str ------ B,C
Now I repeat the process and find the next comma. Again, if I get a zero (0) back
for the location of the comma, I know I have a single answer left, insert it and end
the loop. I check with an IF statement if there is a comma since passing a -1 to the
SUBSTRING function returns an error.
This is probably not the most elegant or efficient solution, but it does work. And it was very quick
to develop. In fact, it took three or four times as long to write this article as it did to develop the
algorithm, test it, and execute it on my data. I do think, however, that this technique will be useful in
some situations and is a worthwhile addition to the DBA toolbox. I find many times when I need string manipulation, especially in adhoc queries I write. Like any other tool in
SQL Server, it has a time and a place where it is helpful and effective in its application. Spend some time experimenting
and learning about the CHARINDEX() function. It will probably come in handy at some point.
Here is the online reference for
CHARINDEX for those of you interested.
As always, I welcome feedback and comments and hope that you learned something from this article.