Adding and removing nodes in SQL Server Clusters is not so difficult, and instructions on how to do so abound on the internet. However, mismanagement when adding/removing nodes can quickly become a ‘gotcha’ that wastes time. Bo Chen offers insight into some of those scenarios that are not normally covered in the standard online documents.
Background
There seem to be plenty of articles either from Microsoft’s
tech sites or other forums regarding how to manage (add/remove) nodes in SQL Server
clusters, however, none of them (at least I didn’t find any) touch this topic
in detail. Mismanagement, when adding/removing nodes in SQL Server clusters, can quickly
become GOTCHAs that take more time than necessary.
Suppose everyone knows about the basics of a SQL cluster,
which is nothing but a shared storage failover cluster consisting of one or
more than one nodes, just like VCS (Veritas Cluster Service), Sun Cluster or
IBM HACMP. At any time, only one node can have full access (via the same path,
and AP, MP and power path are out of scope here), to the shared storage, hence
the cluster has no load balance capabilities.
Below, I’m going to give several scenarios which, again, are
normally not covered by any of the standard documents you find online, but you
might run into in the future if you’re a SQL geek or geek to be, so please
don’t ask me how they could ever happen or be possible.
A scenario that’s not very typical but typical enough
A DBA tried to upgrade SQL 2005 Standard Edition SP3 to SQL
2005 Enterprise Edition SP3, but when the upgrade happened, he forgot to check
the cluster services in both nodes. When the upgrade was done, the cluster
couldn’t fail over to the other node, which failed to upgrade because cluster
service was down in that node during the upgrade. After a few failed
duct-taping attempts (it could have been fixed had those duct-taping attempts
not happened), it was decided the failed node needed to be evicted and added
back.
1.1 Open cluster
administrator from the active node by either “start -> Run -> cluadmin” or by clicking
“Start -> Administrative Tools -> Cluster Administrator”.
1.2 Enter “.” as the
“Cluster or server name” if cluadmin didn’t open the default cluster
automatically.
1.3 Right click on the
failed/problematic node and choose “stop cluster service”, if the cluster
service is still up and running on it.
1.4 Right click on the
failed/problematic node, choose “Evict node”, and click “OK” to confirm it.
Note: What
we really need to do first is to remove the node from the SQL Server definition
before evicting the node in the “Cluster Administrator”. The steps are, open
“Add or Remove Programs” in “control panel”, select “Microsoft SQL Server
2005”, and click “Change”. The Setup will launch, and select “Maintain the
Virtual Server” from the “Change or Remove Instance” page. Follow Microsoft’s
white paper/tech doc to remove a node from the cluster instance. However, when
you really need to evict a node, it means you’ve already messed up and these
normal steps won’t work. In this case, it won’t work as the registry, edition
level and version numbers have all been messed up due to the incorrect upgrade
steps and duct taping. Still, I highly recommend you to try the normal steps
first.
1.5 If you see some
message resembling “hostname xxx has been evicted” without further error
messages, congratulations, you’ve evicted the node and the cluadmin has
done a successful cleanup job for you, which means you can easily add the node
back without too much hassle.
1.6 If you see a message
after you have evicted the node, which says, “the node was evicted but cleanup
was not done”, congratulations are still in order, but you need to read on.
1.7 Someone will recommend
that you run “cluster /force” on the evicted node to do a manual cleanup, but
I’ll tell you, that won’t work either as it will only try to clean up some
cluster related info in the registry, and won’t do any other cleanup (SQL,
Exchange, etc.) that normally “Add or Remove Programs” would do.
1.8 Now, you have only two
options:
1.8.1
Manually uninstall SQL Server in the evicted node and do a manual
cleanup of the registries and binaries. I had to do that once with SQL 2000 cluster
and I really hated it. It’s much more tedious and takes much longer.
1.8.2
Rebuild Windows server on the evicted node (if you can do it by
yourself, great, otherwise you’ll have to bug a Windows admin). I highly recommend
this method, as it’s much cleaner and takes less time.
1.9 When you rebuild the
node, make sure you do everything the same as you would do for a brand new
cluster installation (except you don’t have to rerun the fiber to the SAN or
external disk array, crossover line/reconfigure VLAN). Test ping the heart
beat/private NIC/IP, and use disk management to verify that you can see the
shared storage (ignore the red x as it’s not the active node and has no access
to it anyway), etc.
1.10
Once the node has been rebuilt, you can add the node back to the Windows
cluster, which is easy. Open the “cluster administrator” from the active node,
right click on the cluster name, select “new -> node” and enter the hostname
of the rebuilt node. The “cluadmin” will do a verification and install cluster
service on the new node. You can ignore two warnings during this course as long
as you’re able to keep clicking “next” till “finish”. Then you want to apply
all hot fixes related to cluster service to this new node; at least bring it to
the same level as other nodes.
1.11
Congrats, now you’ve added the node to the cluster, and need to add it
to the SQL cluster instance. Open “Add or Remove Programs” from “Control Panel”,
select “Microsoft SQL Server 2005”, and click “Change”. The setup will launch
again, and again just follow the steps Microsoft provides to add the node to
the cluster.
1.12
OK, now you can see both nodes (or all nodes including the one you’ve
just added) on SQL resource group (and MSDTC group, if you’re one of those people
who insist on taking the hassle of creating a separate resource group including
disk, IP and network name for MSDTC, or for nothing) and you can do failover
testing successfully. You think you’re done.
1.13
But wait, what if we run the following query on all nodes:
select convert(varchar(20),serverproperty('computernamephysicalnetbios')) ,convert(varchar(20),serverproperty('productlevel')), convert(varchar(20),serverproperty('productversion')), convert(varchar(30),serverproperty('edition'))
You’ll find that the newly added node is at “RTM” level and
doesn’t have the service pack level (say SP3) and hot fix you wanted, which are
on other nodes already. Note, if you fail over on the old nodes, you’ll see the
right version/edition info, as they have the right binary. So we know the “Add
or Remove Program” or SQL Setup won’t automatically patch for you. OK, let’s do
it ourselves then.
1.14
You try to apply the service packages and hot fixes from the good node(s),
and it doesn’t work. You’re told, “The binary to be installed has an older
version”.
1.15
You fail over to the new node and try to apply SP and hot fixes there,
and the setup will give you some goofy error that it can’t connect to other
nodes (or some access error).
1.16
If you Google about this, you’ll find out you need to close all RDP (remote
desktop session) connections to all other inactive nodes. OK, now you’re seeing
a different error and the setup just won’t go through.
1.17
You can cancel the setup, which is going to roll back all of the the
changes done during the previous steps, which I do give the Soft a big thumbs
up as I don’t have to end up rebuilding the node again.
1.18
Here are the right steps, once you add the new node back to the SQL
cluster (step 1.12).
Fail the cluster
over to this new node, and stop cluster services in all other nodes. You can do
this with “cluadmin” or if you like, you can RDP to each and every node and use
“services.msc” to stop the cluster service.
1.19
Anyway, ensure that only the new node owns all of the active resources
and the cluster service in other nodes are all down, and there are no RDP
sessions to any other nodes. Now, try to apply the service package and hot
fixes on only this node again, and it all works fine!
1.20
Bring all other nodes online (start up cluster service from cluadmin),
test the fail over and back and run aforementioned query again; you should see that
you have the same edition/version/SP level information.
And that’s all I have to say about this topic, enjoy!
Additional Resources
MSDN: SQL Server 2005 Books Online How to: Recover from Failover Cluster Failure in Scenario 1
MSDN:
SQL Server 2005 Books Online How to: Add or Remove Nodes in a SQL Server 2005 Failover Cluster (Setup)