Adding and removing nodes in SQL Server Clusters is not so difficult, and instructions on how to do so abound on the internet. However, mismanagement when adding/removing nodes can quickly become a 'gotcha' that wastes time. Bo Chen offers insight into some of those scenarios that are not normally covered in the standard online documents.
There seem to be plenty of articles either from Microsofts
tech sites or other forums regarding how to manage (add/remove) nodes in SQL Server
clusters, however, none of them (at least I didnt find any) touch this topic
in detail. Mismanagement, when adding/removing nodes in SQL Server clusters, can quickly
become GOTCHAs that take more time than necessary.
Suppose everyone knows about the basics of a SQL cluster,
which is nothing but a shared storage failover cluster consisting of one or
more than one nodes, just like VCS (Veritas Cluster Service), Sun Cluster or
IBM HACMP. At any time, only one node can have full access (via the same path,
and AP, MP and power path are out of scope here), to the shared storage, hence
the cluster has no load balance capabilities.
Below, Im going to give several scenarios which, again, are
normally not covered by any of the standard documents you find online, but you
might run into in the future if youre a SQL geek or geek to be, so please
dont ask me how they could ever happen or be possible.
A scenario thats not very typical but typical enough
A DBA tried to upgrade SQL 2005 Standard Edition SP3 to SQL
2005 Enterprise Edition SP3, but when the upgrade happened, he forgot to check
the cluster services in both nodes. When the upgrade was done, the cluster
couldnt fail over to the other node, which failed to upgrade because cluster
service was down in that node during the upgrade. After a few failed
duct-taping attempts (it could have been fixed had those duct-taping attempts
not happened), it was decided the failed node needed to be evicted and added
1.1 Open cluster
administrator from the active node by either start -> Run -> cluadmin or by clicking
Start -> Administrative Tools -> Cluster Administrator.
1.2 Enter . as the
Cluster or server name if cluadmin didnt open the default cluster
1.3 Right click on the
failed/problematic node and choose stop cluster service, if the cluster
service is still up and running on it.
1.4 Right click on the
failed/problematic node, choose Evict node, and click OK to confirm it.
we really need to do first is to remove the node from the SQL Server definition
before evicting the node in the Cluster Administrator. The steps are, open
Add or Remove Programs in control panel, select Microsoft SQL Server
2005, and click Change. The Setup will launch, and select Maintain the
Virtual Server from the Change or Remove Instance page. Follow Microsofts
white paper/tech doc to remove a node from the cluster instance. However, when
you really need to evict a node, it means youve already messed up and these
normal steps wont work. In this case, it wont work as the registry, edition
level and version numbers have all been messed up due to the incorrect upgrade
steps and duct taping. Still, I highly recommend you to try the normal steps
1.5 If you see some
message resembling hostname xxx has been evicted without further error
messages, congratulations, youve evicted the node and the cluadmin has
done a successful cleanup job for you, which means you can easily add the node
back without too much hassle.
1.6 If you see a message
after you have evicted the node, which says, the node was evicted but cleanup
was not done, congratulations are still in order, but you need to read on.
1.7 Someone will recommend
that you run cluster /force on the evicted node to do a manual cleanup, but
Ill tell you, that wont work either as it will only try to clean up some
cluster related info in the registry, and wont do any other cleanup (SQL,
Exchange, etc.) that normally Add or Remove Programs would do.
1.8 Now, you have only two
Manually uninstall SQL Server in the evicted node and do a manual
cleanup of the registries and binaries. I had to do that once with SQL 2000 cluster
and I really hated it. Its much more tedious and takes much longer.
Rebuild Windows server on the evicted node (if you can do it by
yourself, great, otherwise you'll have to bug a Windows admin). I highly recommend
this method, as its much cleaner and takes less time.
1.9 When you rebuild the
node, make sure you do everything the same as you would do for a brand new
cluster installation (except you dont have to rerun the fiber to the SAN or
external disk array, crossover line/reconfigure VLAN). Test ping the heart
beat/private NIC/IP, and use disk management to verify that you can see the
shared storage (ignore the red x as its not the active node and has no access
to it anyway), etc.
Once the node has been rebuilt, you can add the node back to the Windows
cluster, which is easy. Open the cluster administrator from the active node,
right click on the cluster name, select new -> node and enter the hostname
of the rebuilt node. The cluadmin will do a verification and install cluster
service on the new node. You can ignore two warnings during this course as long
as youre able to keep clicking next till finish. Then you want to apply
all hot fixes related to cluster service to this new node; at least bring it to
the same level as other nodes.
Congrats, now youve added the node to the cluster, and need to add it
to the SQL cluster instance. Open Add or Remove Programs from Control Panel,
select Microsoft SQL Server 2005, and click Change. The setup will launch
again, and again just follow the steps Microsoft provides to add the node to
OK, now you can see both nodes (or all nodes including the one you've
just added) on SQL resource group (and MSDTC group, if youre one of those people
who insist on taking the hassle of creating a separate resource group including
disk, IP and network name for MSDTC, or for nothing) and you can do failover
testing successfully. You think youre done.
But wait, what if we run the following query on all nodes:
You'll find that the newly added node is at RTM level and
doesnt have the service pack level (say SP3) and hot fix you wanted, which are
on other nodes already. Note, if you fail over on the old nodes, youll see the
right version/edition info, as they have the right binary. So we know the Add
or Remove Program or SQL Setup wont automatically patch for you. OK, lets do
it ourselves then.
You try to apply the service packages and hot fixes from the good node(s),
and it doesn't work. You're told, The binary to be installed has an older
You fail over to the new node and try to apply SP and hot fixes there,
and the setup will give you some goofy error that it cant connect to other
nodes (or some access error).
If you Google about this, youll find out you need to close all RDP (remote
desktop session) connections to all other inactive nodes. OK, now youre seeing
a different error and the setup just wont go through.
You can cancel the setup, which is going to roll back all of the the
changes done during the previous steps, which I do give the Soft a big thumbs
up as I dont have to end up rebuilding the node again.
Here are the right steps, once you add the new node back to the SQL
cluster (step 1.12).
Fail the cluster
over to this new node, and stop cluster services in all other nodes. You can do
this with cluadmin or if you like, you can RDP to each and every node and use
services.msc to stop the cluster service.
Anyway, ensure that only the new node owns all of the active resources
and the cluster service in other nodes are all down, and there are no RDP
sessions to any other nodes. Now, try to apply the service package and hot
fixes on only this node again, and it all works fine!
Bring all other nodes online (start up cluster service from cluadmin),
test the fail over and back and run aforementioned query again; you should see that
you have the same edition/version/SP level information.
And thats all I have to say about this topic, enjoy!
MSDN: SQL Server 2005 Books Online How to: Recover from Failover Cluster Failure in Scenario 1
SQL Server 2005 Books Online How to: Add or Remove Nodes in a SQL Server 2005 Failover Cluster (Setup)
See All Articles by Columnist