Nine Steps to Troubleshooting SQL Server problems
August 31, 2010
DBAs often have a need to identify why a problem has occurred, or is occurring in their SQL Server database. This article covers some of the tools you can use to look for clues, and the steps you might go through to help troubleshoot a SQL Server problem.
We all have problems that occur from time to time, where we need to work through some problem solving steps to identify why the problem occurred, or is occurring. In these situations, you need to act like a Crime Scene Investigator (CSI) to uncover the root cause of the problem. No, we do not put yellow caution tape around our servers, our network cables, and our desktop machines. However, sometimes we do unplug the network cable to prevent the machine from being further contaminated, in rare cases. Sometimes it is obvious what caused the problem, but not always. Regardless of the problem, you will need to do some forensic analysis to determine the cause of the problem. In this article, I will cover some of the tools you can use to look for clues, and the steps you might go through to help troubleshoot a problem.
Problem Solving Tools
There are a number of different tools you can use for troubleshooting. I cannot cover, nor do I know of all the tools you could possibly use for troubleshooting, therefore I will cover the most common tools that are available within SQL Server and the Windows OS. In most cases, you should be able to find enough information using these tools to provide you with enough clues to determine the cause of a particular problem.
Here is a list of those commonly used tools:
When SQL Server starts up, it starts a default trace event, provided the default trace is enabled. Profiler can be used to review the information captured from the default trace event. It is amazing what you can find by exploring the default trace information. Additionally you might find it useful to create your own traces while troubleshooting a problem.
Notepad does not seem like much of a tool; however, it can be used to open different log files. Notepad allows you to do string searches within large log files to quickly locate information. . If your log files are too big you might have to use WordPad as an alternative.
Sometime when your system is having problems, event records will be written to the Windows event logs. You can browse through the events one at a time using the Event viewer. Events in the event log may provide a quick answer to why your SQL Server instance is not behaving as it should, provided there are event records associated with the problem you are trying to solve.
Information Gathering Phase
In order to diagnose a problem you first need to gather some information about the problem. You also need to review log files to determine what kind of system error messages and log records exist that might help you to diagnose the problem. Below are a set of steps you should consider when going through the information gathering phase of your forensic analysis.
Step 1: Gathering the Facts
The first step in any problem solving exercise is to gather the facts. You need to know what kind of problem is happening. This is where you need to interview the customer, or programmer to understand how and when the problem occurs. You need to determine if it is system wide, or is it more localized to a particular application, or component of an application. You also need to know the timeframe around when the problem occurred, and whether or not it is still a problem. In addition to this, you need to know the last time the system was working correctly. You need to determine if any new system or application changes were introduced that might have caused the problem. Armed with some facts about the problem you can start to look for clues that might help identify the root cause of the problem.
Step 2: Test in Different environments and Machines
It is worth testing in different environments, if you have them. This is a fact gathering exercise, but I spelled it out as a separate step because lots of times seasoned staff do not think about performing tests in separate environments when they gather facts.
You might find only one environment is affected, a set of environments or all environments. If only one environment is affected, the problem might be a configuration issue with that environment, or the other working environments. Alternatively, it might be the data in the environment that is causing the problem.
Additionally you might want to try different client machines, or application servers. Occasional, you might find that a different configuration or a set up is causing the application to work, or not work. You need to explore all the different setup and configuration options and then document those that work, and those that do not work.
Step 3: Review the SQL Server Error Log
SQL Server creates a log file called "ERRORLOG". A new ERRORLOG file is created every time SQL Server starts up. SQL Server by default keeps six old errors log files, where each one has a sequential number associated. The ERRORLOG file by default is stored in the "Log" folder within the standard "…Program FilesMicrosoft SQL Server..." folder structure.
Find the log file that is associated with the timeframe for when the problem first occurred. Look to see if there are any anomalies in the messages being outputted by SQL Server. Sometime if SQL Server detects a change, or encounters a problem, it will be logged in the ERRORLOG file.
Step 4: Review the Event Log
You should use the Event Viewer to look at the different event log records. The event log contains both informational warnings and error events. You should look at all the events that occurred shortly before, during and after the timeframe of the identified problem. You need to make sure you review both the "Application" and "System" events, as well as the "Security" events.
Step 5: Review the Default Trace
The default trace, as stated earlier, is a trace that SQL Server starts automatically when it starts up, if the default trace option is enabled. It is similar to the flight recorder in a modern jet. This trace captures all configuration changes to an instance. By reviewing the default trace information you can identify what kind of database changes might have been made during the period of time when the problem was identified.
The default trace files are stored in the same log folder as the ERRORLOG. They are named like "log_xxx.trc", where xxx is a sequential number. You can open these files with profiler to see the recorded events. Alternatively, you can use the "fn_trace_gettable" function to process the file using T-SQL, like so:
SELECT * FROM fn_trace_gettable ('C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLLOGlog_155.trc', default);
Step 6: Review the Change Log
Review your organization's change log. I hope your organization has one. A change log is some centralized location that identifies all changes the have been introduced. If your organization has one, this helps identify any changes that have recently occurred. This log might provide you with some clues as to why a particular problem is occurring, especially if the application that is having the problem is the one that has been modified recently. If your organization does not have a change log, then during step 1 you might ask a programmer when the last application change was made.
Now that you have gathered some information, you need to analyze the data that you have gathered. Review the information collected in each step. Look for anomalies that would support the problem identified by the customer or programmer.
Take the situation identified in step 1 above and try to determine how each log or the trace file might help you identify why the problem is occurring. Review the information available in each step to see if there are any clues that will allow you to gaining a better understanding of what is causing the problem.
After you have done this analysis, you might be lucky and identify the cause of the problem. However, there will be times when the steps above do not yield a solution to the problem. In this case, you will need to move on and do some additional testing and information gathering.
Additional Testing and Information Gathering
If you can't find the problem by reviewing the different system logs, the change log, or the default trace then you will need to resort to analyzing the actual process that caused the problem. This means you might have to look at code and even runs some different tests. The rest of the steps identified here is only a starting point. They should help you organize your thoughts on how you might go about performing additional testing and gathering more information to help you resolve the problem at hand.
Step 7: Develop a testing plan
Sit down with the customer and the application programmer and document the steps they are going through, which is causing the problem. Much of this information might have already been collected in step 1, but it is at least worth going over again. Identify if the problem is repeatable. If it cannot be repeated then it might be difficult to determine what caused the problem. The point here is to identify how the application is connecting to SQL Server and the T-SQL code that is being executed. Based on the problem being investigated you will need to develop a set of tests to run and what information you might want to capture along the way to identify what is going on. Prior to doing any testing, I would suggest you go to the next step.
Step 8: Backup Database
Before moving forward with doing additional testing, analysis and troubleshooting, it might be prudent to do a backup of the problem database. This backup can be a full, differential or log backup depending on your current database backup strategy, and the status of your last backup. This backup will provide you a recovery point should you want to start tweaking SQL Server as part of your diagnostics troubleshooting steps identified in step 6.
Step 9: Perform Additional tests and logging
Break up your test into small logical pieces if possible. For those steps of the tests that connect to SQL Server, you might consider turning on SQL Server Profiler so you can monitor what kind of T-SQL statements and batches are being executed. Profiler will allow you to capture the code that is running, which sometimes is different than what the programmer expects, Sometimes Profiler and the additional steps are all it takes to narrow down what is causing the issues at hand.
CSI approach to Resolving Problems
One of the most rewarding thing as a DBA is to help a programmer and customer resolve a problem. The harder the problem is to resolve the greater the reward. Above, I have identified a stepwise approach to troubleshooting a problem. The approach identified some logs to look at and some tools to use, to help you identify and troubleshoot problems. You may find that these steps do not necessarily meet your needs and that is ok. The important point is to understand that you need to develop and follow a troubleshooting process for your environment. Having a set of questions, steps and tools will allow you to be proactive in developing an approach that you can use in resolving problems quickly, and methodically.