DBAs often have a need to identify why a problem has occurred, or is occurring in their SQL Server database. This article covers some of the tools you can use to look for clues, and the steps you might go through to help troubleshoot a SQL Server problem.
We all have problems that occur from time to time, where we need to work
through some problem solving steps to identify why the problem occurred, or is
occurring. In these situations, you need to act like a Crime Scene Investigator
(CSI) to uncover the root cause of the problem. No, we do not put yellow
caution tape around our servers, our network cables, and our desktop machines.
However, sometimes we do unplug the network cable to prevent the machine from
being further contaminated, in rare cases. Sometimes it is obvious what caused
the problem, but not always. Regardless of the problem, you will need to do
some forensic analysis to determine the cause of the problem. In this article,
I will cover some of the tools you can use to look for clues, and the steps you
might go through to help troubleshoot a problem.
Problem Solving Tools
There are a number of different tools you can use for troubleshooting. I
cannot cover, nor do I know of all the tools you could possibly use for
troubleshooting, therefore I will cover the most common tools that are
available within SQL Server and the Windows OS. In most cases, you should be
able to find enough information using these tools to provide you with enough
clues to determine the cause of a particular problem.
Here is a list of those commonly used tools:
- Profiler
- Notepad
- Event viewer
When SQL Server starts up, it starts a default trace event, provided the
default trace is enabled. Profiler can be used to review the information
captured from the default trace event. It is amazing what you can find by
exploring the default trace information. Additionally you might find it useful
to create your own traces while troubleshooting a problem.
Notepad does not seem like much of a tool; however, it can be used to open
different log files. Notepad allows you to do string searches within large log
files to quickly locate information. . If your log files are too big you might
have to use WordPad as an alternative.
Sometime when your system is having problems, event records will be written
to the Windows event logs. You can browse through the events one at a time
using the Event viewer. Events in the event log may provide a quick answer to
why your SQL Server instance is not behaving as it should, provided there are
event records associated with the problem you are trying to solve.
Information Gathering Phase
In order to diagnose a problem you first need to gather some information
about the problem. You also need to review log files to determine what kind of
system error messages and log records exist that might help you to diagnose the
problem. Below are a set of steps you should consider when going through the
information gathering phase of your forensic analysis.
Step 1: Gathering the Facts
The first step in any problem solving exercise is to gather the facts. You
need to know what kind of problem is happening. This is where you need to
interview the customer, or programmer to understand how and when the problem
occurs. You need to determine if it is system wide, or is it more localized to
a particular application, or component of an application. You also need to know
the timeframe around when the problem occurred, and whether or not it is still
a problem. In addition to this, you need to know the last time the system was
working correctly. You need to determine if any new system or application
changes were introduced that might have caused the problem. Armed with some
facts about the problem you can start to look for clues that might help
identify the root cause of the problem.
Step 2: Test in Different
environments and Machines
It is worth testing in different environments, if you have them. This is a
fact gathering exercise, but I spelled it out as a separate step because lots
of times seasoned staff do not think about performing tests in separate
environments when they gather facts.
You might find only one environment is affected, a set of environments or
all environments. If only one environment is affected, the problem might be a
configuration issue with that environment, or the other working environments.
Alternatively, it might be the data in the environment that is causing the
problem.
Additionally you might want to try different client machines, or application
servers. Occasional, you might find that a different configuration or a set up
is causing the application to work, or not work. You need to explore all the
different setup and configuration options and then document those that work,
and those that do not work.
Step 3: Review the SQL
Server Error Log
SQL Server creates a log file called "ERRORLOG". A new ERRORLOG
file is created every time SQL Server starts up. SQL Server by default keeps six
old errors log files, where each one has a sequential number associated. The
ERRORLOG file by default is stored in the "Log" folder within the
standard "…Program FilesMicrosoft SQL Server…" folder
structure.
Find the log file that is associated with the timeframe for when the problem
first occurred. Look to see if there are any anomalies in the messages being
outputted by SQL Server. Sometime if SQL Server detects a change, or encounters
a problem, it will be logged in the ERRORLOG file.
Step 4: Review the Event
Log
You should use the Event Viewer to look at the different event log records.
The event log contains both informational warnings and error events. You should
look at all the events that occurred shortly before, during and after the
timeframe of the identified problem. You need to make sure you review both the
"Application" and "System" events, as well as the
"Security" events.
Step 5: Review the Default
Trace
The default trace, as stated earlier, is a trace that SQL Server starts
automatically when it starts up, if the default trace option is enabled. It is
similar to the flight recorder in a modern jet. This trace captures all
configuration changes to an instance. By reviewing the default trace
information you can identify what kind of database changes might have been made
during the period of time when the problem was identified.
The default trace files are stored in the same log folder as the ERRORLOG.
They are named like "log_xxx.trc", where xxx is a sequential number.
You can open these files with profiler to see the recorded events.
Alternatively, you can use the "fn_trace_gettable" function to
process the file using T-SQL, like so:
SELECT * FROM fn_trace_gettable ('C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLLOGlog_155.trc', default);
Step 6: Review the Change
Log
Review your organization’s change log. I hope your organization has one. A
change log is some centralized location that identifies all changes the have
been introduced. If your organization has one, this helps identify any changes
that have recently occurred. This log might provide you with some clues as to
why a particular problem is occurring, especially if the application that is
having the problem is the one that has been modified recently. If your
organization does not have a change log, then during step 1 you might ask a
programmer when the last application change was made.
Analysis Phase
Now that you have gathered some information, you need to analyze the data
that you have gathered. Review the information collected in each step. Look for
anomalies that would support the problem identified by the customer or programmer.
Take the situation identified in step 1 above and try to determine how each
log or the trace file might help you identify why the problem is occurring.
Review the information available in each step to see if there are any clues
that will allow you to gaining a better understanding of what is causing the
problem.
After you have done this analysis, you might be lucky and identify the cause
of the problem. However, there will be times when the steps above do not yield
a solution to the problem. In this case, you will need to move on and do some
additional testing and information gathering.
Additional Testing and Information Gathering
If you can’t find the problem by reviewing the different system logs, the
change log, or the default trace then you will need to resort to analyzing the
actual process that caused the problem. This means you might have to look at
code and even runs some different tests. The rest of the steps identified here
is only a starting point. They should help you organize your thoughts on how
you might go about performing additional testing and gathering more information
to help you resolve the problem at hand.
Step 7: Develop a testing
plan
Sit down with the customer and the application programmer and document the
steps they are going through, which is causing the problem. Much of this
information might have already been collected in step 1, but it is at least
worth going over again. Identify if the problem is repeatable. If it cannot be
repeated then it might be difficult to determine what caused the problem. The
point here is to identify how the application is connecting to SQL Server and
the T-SQL code that is being executed. Based on the problem being investigated
you will need to develop a set of tests to run and what information you might
want to capture along the way to identify what is going on. Prior to doing any
testing, I would suggest you go to the next step.
Step 8: Backup Database
Before moving forward with doing additional testing, analysis and
troubleshooting, it might be prudent to do a backup of the problem database.
This backup can be a full, differential or log backup depending on your current
database backup strategy, and the status of your last backup. This backup will
provide you a recovery point should you want to start tweaking SQL Server as
part of your diagnostics troubleshooting steps identified in step 6.
Step 9: Perform Additional tests and logging
Break up your test into small logical pieces if possible. For those steps of
the tests that connect to SQL Server, you might consider turning on SQL Server
Profiler so you can monitor what kind of T-SQL statements and batches are being
executed. Profiler will allow you to capture the code that is running, which
sometimes is different than what the programmer expects, Sometimes Profiler and
the additional steps are all it takes to narrow down what is causing the issues
at hand.
CSI approach to Resolving Problems
One of the most rewarding thing as a DBA is to help a programmer and
customer resolve a problem. The harder the problem is to resolve the greater
the reward. Above, I have identified a stepwise approach to troubleshooting a
problem. The approach identified some logs to look at and some tools to use, to
help you identify and troubleshoot problems. You may find that these steps do
not necessarily meet your needs and that is ok. The important point is to
understand that you need to develop and follow a troubleshooting process for
your environment. Having a set of questions, steps and tools will allow you to
be proactive in developing an approach that you can use in resolving problems
quickly, and methodically.