How many SQL Server
instances are you running? Do you ever have customers calling you stating that
their application is not working, then when you research the problem you find
that the instance that supports their application is unavailable? Have you
ever had someone mistakenly shutdown the SQL Agent service and forget to
restart it, causing a number of scheduled jobs to not be run? One of the tasks
of a DBA is to monitor the availability of all SQL Server instances and
services. If you have a large number of instances this monitoring task could
take you a fair amount of time. This article will discuss why a DBA should
monitor instance availability, and provides an approach to automating this process.
Why Monitor Server Availability
A DBA needs to monitor the
availability of the SQL Server machines they support. Without monitoring, you
can never be sure whether all your SQL Server machines are available and
performing as expected. If you don’t monitor then your first indication of a problem
is when someone reports they cannot connect to one of your servers. No DBA
wants to hear that an instance of SQL Server is not available.
If you are going to monitor,
you need to do it routinely. The goal of monitoring is to identify a problem
soon after it occurs. By monitoring routinely and frequently, you can be more
pro-active in resolving server availability issues. The sooner you know about
a problem, the sooner you can resolve it. Ideally, you want to resolve
problems before they are discovered by your customers. Monitoring server and
service availability can be a time consuming task. Because of this, it is best
if the monitoring process can be automated.
What to Monitor
One of the first things you
need to determine is what should you monitor. If you think about what needs to
be available to make your customer applications work then you can easily
develop a list of things to monitor. In my shop, I wanted to make sure that an
application/customer could connect to SQL Server. If an application/customer
could not connect then either the MSSQLSERVER service was not running, the box
was shutdown, powered off, or there was some kind of network connectivity
issue. I also wanted to verify that SQL Agent was running. If SQL Agent is not
running then routine scheduled jobs might not be started as scheduled.
The Automated process
The automated process to monitor
my servers consisted of a table, 4 different stored procedures, and two
different monitoring servers. Let me discuss each one of these pieces.
The two different monitoring
servers are just two different SQL Server machines, on two different networks
with different power supplies. One server is defined as the primary monitoring
server, and the other one as the backup monitoring server. The reason for two
monitoring servers is to allow the backup monitoring server to take over the
duties of monitoring should the primary monitoring server become unavailable.
Each SQL Server machine has connectivity to all servers that are being
monitored. Linked server definitions are defined on the two monitoring servers
for every machine that is being monitored. These linked server definitions are
used to submit commands to verify that the monitoring server can connect to the
server being monitored, and that all necessary services are running.
On each monitoring server, I
have created the following table:
CREATE TABLE MonitorServerServices (
Server nvarchar (100),
Service nvarchar (100))
This table contains a list
of servers and services that will be monitored. Each monitoring server has an
identical list of servers and services. Here might be a typical set of records
for this table for a shop that only has three different servers to monitor:
Server Service
————————— ——————————–
SERVERA MSSQLSERVER
SERVERA SQLSERVERAGENT
SERVERB MSSQLSERVER
SERVERB SQLSERVERAGENT
SERVERC MSSQLSERVER
SERVERC SQLSERVERAGENT
Here you can see I am only
monitoring the “MSSQLSERVER” and “SQLSERVERAGENT” services. If in your shop,
you want to monitor additional services on your machines this process can
handle that by just adding additional table entries for the additional services
you wish to monitor.
The 4 different stored
procedures to support this process are: “sp_send_cdosysmail”, “usp_check_connection”,
“usp_services_monitor”, and “usp_service_monitor_backup”. I have placed each
of these stored procedures on both monitoring servers. Although the “usp_service_monitor_backup”
stored procedure (SP) really only needs to be placed on the backup monitoring
server. The first two stored procedures play a supporting role in the
monitoring process. The third SP, in the list above, does the actual
monitoring, while the forth SP is used to detect whether the primary monitoring
server is working.
The “sp_send_cdosysmail” SP provides
a mechanism to send SMTP mail. I obtained this SP from Microsoft; therefore,
I have not included the code for this SP. I like using SMTP mail, instead of
using “xp_sendmail”. If you want to use SMTP mail and do not already have this
SP, you can find the code at the link below. Or, you can replace my executions
of this SP with however your shop supports mail from T-SQL. Here is the link
to a Microsoft page that describes using SMTP mail and the “sp_send_cdosysmail”
SP:
http://support.microsoft.com/default.aspx?scid=kb;en-us;312839&sd=tech
The next supporting SP is “usp_check_connection”.
This SP is used to verify that the monitoring SP can connect to the servers
being monitored. This SP is critical to this process since it keeps the
process from failing when it cannot connect to a server being monitored.
Testing connectivity is accomplished using OLE automation. This SP tests by connecting
to the server being monitored via a linked server definition from the primary
or backup monitoring servers. This allows for this process to not to have to
store passwords, since the passwords are stored in the linked server security
definitions. Here is the code for this SP:
CREATE proc usp_check_connection (
@server varchar(100),
@linkedserver varchar(100))
as
—
— Written by: Gregory A. Larsen Date: 12/29/2004
— This procedures is used to determine if a server (@server),
— Can connect to a linked server (@linkedserver). If
— the server cannot connect to the linked server then it
— sets the return code to 99.
—
— Parameters:
— @server – This variable identifies the server that
— this sp will try to connect to, to verify the
— linked server connection. It is either the
— primary or backup monitoring server.
— @linkedserver – This variable identifies the linked
— server that the sp tries to connect to.
—
———————
— Begin Section A —
———————
— Declare Local Variables
DECLARE @object int
DECLARE @hr int
DECLARE @src varchar(255), @desc varchar(255)
DECLARE @location varchar(100)
DECLARE @CMD nvarchar(1000)
DECLARE @objresults int
— Set the location in the code for error processing
set @location = ‘Create SQLServer Object’
— Create a SQL Server object
EXEC @hr = sp_OACreate ‘SQLDMO.SQLServer’, @object OUT
— check to see if object creation had an error
IF @hr <> 0 goto ERROR
— set the Connection properties to use Window Authentication
EXEC @hr = sp_OASetProperty @object, ‘LoginSecure’, ‘True’
— Set the location in the code for error processing
set @location = ‘Connect to Server – ‘ + rtrim(@server)
— Connect to server
EXEC @hr = sp_OAMethod @object, ‘Connect’,Null,@server
— Check to see if connection to server failed
IF @hr <> 0 goto ERROR
——————-
— End Section A —
——————-
———————
— Begin Section B —
———————
— Set the location in the code for error processing
set @location = ‘Execute TSQL to Linked Server – ‘ + rtrim(@linkedserver)
— Set the command to be executed on the linked server
set @cmd = ‘ExecuteWithResults(“select top 1 name From ‘ +
rtrim(@linkedserver) + ‘.master.dbo.sysobjects”)’
— execute command on linked server
EXEC @hr = sp_OAMethod @object,@cmd, @objResults OUT— check to see if process could connect to linked server
IF @hr <> 0 goto ERROR
— Successfully connected, return without error code
RETURN
——————-
— End Section B —
——————-
———————
— Begin Section C —
———————
— Error Handling Routine
ERROR:
— Print the location of the error
Print @location
— Print SQL-DMO error
EXEC sp_OAGetErrorInfo @object, @src OUT, @desc OUT
SELECT hr=convert(varbinary(4),@hr), Source=@src, Description=@desc
— Connection Failure, return with a 99 return code
RETURN 99
——————-
— End Section C —
——————-
As you can see this SP creates
a SQLServer object, sets the “LoginSecure” property to “True” and then makes a
connection to the monitoring server using Windows Authentication. Once
connected to the monitoring server this SP builds a simple dynamic query using
a linked server “SELECT” statement to determine if the server being monitored
is reachable from the monitoring server. If it cannot connect to either the
monitoring server or the server being monitored then an error is returned to
the calling SP.
The reason I go through all
this logic just to verify that a server is up and running is because a normal
inline T-SQL linked server “SELECT” command would cause my SP to stop executing
if the “SELECT” command failed because a server was down. By using OLE
automation to submit my T-SQL command, this SP is able to detect the outage,
and continue running, so it can monitor the rest of the servers. Now that you
understand the supporting SPs, let’s look at the two different monitoring
stored procedures.
Prior to reviewing the code
in the monitoring SP, let me first explain the method I use to determine if a
particular service is running. Microsoft provides an undocumented extended
stored procedure called “xp_servicecontrol”. This undocumented stored
procedure can perform a number of service related functions. One of these
functions, “QueryState”, returns the status of a particular service. A status
for a service like “SQLSERVERAGENT” can have 3 different possible values: “Running.”,
“Stopped.”, or “Starting…”. I use the “QueryState” function of “xp_servicecontrol”
to verify that the service I want to monitor has a status of “Running”.
As mentioned earlier, I have
a stored procedure “usp_service_monitor_backup” that verifies that the primary
monitor appears to be functioning. The “usp_service_monitor_backup” SP is scheduled
to run every 10 minutes, on the backup server, via a SQL Agent job. If this SP
cannot connect to the primary server, it assumes it needs to take over
monitoring duties and executes “usp_service_monitor” the primary monitoring
SP. Here is the code for this backup monitoring SP:
CREATE proc usp_service_monitor_backup (
@masterserver varchar(100),
@email_dist varchar(1000))
as
— Written by: Gregory A. Larsen
—
— Date: January 3, 2005
—
— Description:
—
— This SP is used as a backup monitor for the availability of windows services of specific SQL Server machine.
— This backup monitor should be run on a different server, different network, and different power source
— then the primary monitor. This is done some in case the server running the primary monitor is down
— this backup monitor will be able to perform the monitoring. This backup monitor only monitors if
— if cannot contact the server doing the primary monitoring.
—
— Parameters:
—
— @masterserver – name of server doing primary monitoring.
— @email_dist – This parameter identifies the distribution list for any emails sent by this process.
set nocount on
——————–
— Begin Section 1 –
——————–
— declare local variables
declare @rc int
declare @cmd nvarchar(1000)
declare @cmd2 nvarchar(1000)
declare @state1 varchar(100)
— determine if backup monitor can connect to primary monitor
exec @rc=dba.dbo.usp_check_connection @@servername,@masterserver
— Is the server doing primary monitoring unreachable?
if @rc = 99
begin
— Display message that primary monitor is down
print ‘Master server “‘ + rtrim(@masterserver) + ‘” for monitoring is not available.’
— Perform monitoring
exec dba.dbo.usp_service_monitor @email_dist
end
——————
— End Section 1 –
——————
——————–
— Begin Section 2 –
——————–
else begin
— Build command to determine state of SQLSERVERAGENT service on Master Server
SET @CMD = ‘create table #state (state varchar(2000))’ + char(10) +
‘declare @cmdx varchar(1000)’ + char(10) +
‘insert into #state EXEC master..xp_servicecontrol ””QueryState””, ””SQLSERVERAGENT””’ +
+ char(10) + ‘select @state=state from #state’ + char(10) +
‘drop table #state’
— Build command to execute command that determines state of service being monitored
set @cmd2 = ‘declare @state varchar(100)’ + char(10) +
‘exec ‘ + rtrim(@masterserver) + ‘.master.dbo.sp_executesql N”’ + @CMD + ”’,’ +
‘N”@state varchar(100) out”,’ +
‘@state out’ + char(10) +
‘set @state1 = @state’
— Execute command and return state of service being monitored
exec master.dbo.sp_executesql @cmd2,N’@state1 varchar(100) out’,@state1 out
— Is the service that was monitored not running
if @state1 <1 ‘Running.’
begin
— Display message that primary monitor is down
print ‘Master server “‘ + rtrim(@masterserver) + ‘” for monitoring is not available.’— Perform monitoring
exec dba.dbo.usp_service_monitor @email_dist
end
else
print ‘Master Server is running’
end
——————
— End Section 2 –
——————
Here you can see I use the “usp_check_connection”
SP to determine if I can connect to the primary monitoring machine from the
backup monitoring machine. If I cannot then this SP assumes it needs to
perform the monitoring and executes the “usp_service_monitor” SP, which is the
main monitoring SP. If this SP can connect to the primary monitor, it then
checks to verify that the SQL Server Agent is up and running. If SQL Server
Agent is not running then this SP takes over the primary monitoring duties by
executing “usp_service_monitor”. If this SP finds that SQL Server Agent is
running then it assumes that the primary monitor is functioning normally and
therefore the backup machine does not need to take over primary monitoring
duties.
I have already mentioned
that the “usp_service_monitor” SP is the primary SP for monitoring SQL Server services.
This SP needs to be placed on both the primary and backup monitoring machines.
Here is the code for this SP:
CREATE proc usp_service_monitor@email_dist varchar(1000)
as
— Written by: Gregory A. Larsen
—
— Date: Dec 22, 2004
—
— Description:
—
— This SP is used to monitor the availability of windows services of specific SQL Server machine.
— If a service is not available then an email is send to the DBAs. The SQL Server machines to check
— and the services to monitor are contained in table “MonitorServerService” in the DBA database.
—
— Parameters:
—
— @email_dist – This parameter identifies the distribution list for any emails sent by this process.
set nocount on
——————–
— Begin Section 1 –
——————–
— declare local variables needed
declare @state1 varchar(100)
declare @CMD nvarchar(4000)
declare @CMD2 nvarchar(4000)
declare @server varchar(100)
declare @subject varchar(100)
declare @body varchar(1000)
declare @save_server varchar(100)
declare @service varchar(100)
declare @save_service varchar(100)
declare @rc int
— Initialize the variables used to control the while loop and reading of servers and services
— from table MonitorServerServices
set @save_server = ”
set @save_service = ”
— get the first server and service to monitor
select top 1 @server=Server,@service=Service from dba.dbo.MonitorServerServices
order by Server, Service
— Process all servers and services
while @save_server + @save_service < @server + @service
begin
——————
— End Section 1 –
——————
——————–
— Begin Section 2 –
——————–
— Are we processing the same server
if @save_server <> @server
— Display server being processed.
print ‘Processing Server ‘ + rtrim(@server) + ‘———————————–‘
— Display service being processed.
print ‘Checking Service ‘ + rtrim(@service)
— Save the current server and service being monitored
set @save_server = @server
set @save_service = @service
— Determine if server being process is reachable from monitoring server
exec @rc=usp_check_connection @@servername,@server
——————
— End Section 2 –
——————
——————–
— Begin Section 3 –
——————–
— Is server reachable
if @rc <> 99
Begin
— Initialize state of service
set @state1 = ”
— Build command to determine state of service being monitored.
SET @CMD = ‘create table #state (state varchar(2000))’ + char(10) +
‘declare @cmdx varchar(1000)’ + char(10) +
‘insert into #state EXEC master..xp_servicecontrol ””QueryState””, ””’ +
rtrim(@service) + ””” + char(10) +
‘select @state=state from #state’ + char(10) +
‘drop table #state’
— Build command to execute command that determines state of service being monitored
set @cmd2 = ‘declare @state varchar(100)’ + char(10) +
‘exec ‘ + rtrim(@server) + ‘.master.dbo.sp_executesql N”’ + @CMD + ”’,’ +
‘N”@state varchar(100) out”,’ +
‘@state out’ + char(10) +
‘set @state1 = @state’
— Execute command and return state of service being monitored
exec master.dbo.sp_executesql @cmd2,N’@state1 varchar(100) out’,@state1 out
——————
— End Section 3 –
——————
——————–
— Begin Section 4 –
——————–
— Is the service that was monitored not running
if @state1 <> ‘Running.’
begin
— Build and send email for service not running
set @subject = rtrim(@server) + ‘- ‘ + rtrim(@service) + ‘ service Not Running’
set @body = ‘The ‘ + rtrim(@service) + ‘ service is not running on machine ‘ + rtrim(@server) + ‘.’ + char(10) +
‘The ‘ + rtrim(@service) + ‘ has the current state – ‘ + @state1
exec master.dbo.sp_send_cdosysmail
@From=’ServiceMonitor@dbazine.com’,
@To=@email_dist,
@Subject=@subject,
@Body=@body
— Display message about service not running
Print @body
end
else
— Print state of service
Print ‘The ‘ + rtrim(@service) + ‘ is ‘ + rtrim(@state1) + ‘.’
End
——————
— End Section 4 –
——————
——————–
— Begin Section 5 –
——————–
else
begin
— build and send mail message if the server can’t be reached
set @subject = rtrim(@server) + ‘- Can not connect to Server’
set @body = ‘Can not seem to connect to server ‘ + rtrim(@server) +
‘. Either the MSSQLSERVER services is not running, the ‘ +
‘ machine is powered off, or the machine can”t be reached’ +
‘ from machine ‘ + rtrim(@@servername) + ‘.’
exec master.dbo.sp_send_cdosysmail
@From=’SQLServerMonitor@dbazine.com’,
@To=@email_dist,
@Subject=@subject,
@Body=@body
Print @body
end
— get next server and services to monitor
select top 1 @server=Server,@service=Service from dba.dbo.MonitorServerServices
where @save_server + @save_service < server + service
order by Server, Service
end
——————
— End Section 5 –
——————
This code processes through
a list of servers and services found in table “MonitorServerServices” one record
at a time. For each record, this code calls “usp_check_connection” to
determine if the server to be monitored is reachable. If this SP cannot
connect to the server being monitored then an error message is sent to the
email distribution address passed as a parameter to this SP using the “sp_send_cdosysmail”
SP. If the server is reachable, then this SP uses the “xp_servicecontrol” SP
to determine the STATE of the service being monitored. If the service being
monitored is found to not be running then this SP sends an email to the email
distribution list. This email notifies the individuals on the distribution
list of the status of the service, so they can quickly respond to SQL Server
service problems. If the current service is running then the next server and
service to be monitored is obtained from the “MonitorServerServices” table and the
while loop is repeated. This process continues until all servers and services
have been monitored
Conclusion
I have found this monitor
process provides an early indication on whether MSSQLSERVER or SQLSERVERAGENT
services are not running. Every so often, our System Administrators will take
down both SQL Agent and SQL Server, but only restart SQL Server. When this
happens, this process lets the DBAs know that SQL Server Agent needs to be
started to prevent possible loss of work due to SQL Server Agent being down.
If you have experienced outages of services on your critical SQL Server
machines then you already know how nice it would be to have a service
monitoring process.