How to Evict a Failed Node and Add it Back to SQL Server 2005 Cluster

April 14, 2010

Adding and removing nodes in SQL Server Clusters is not so difficult, and instructions on how to do so abound on the internet. However, mismanagement when adding/removing nodes can quickly become a 'gotcha' that wastes time. Bo Chen offers insight into some of those scenarios that are not normally covered in the standard online documents.

Background

There seem to be plenty of articles either from Microsoft’s tech sites or other forums regarding how to manage (add/remove) nodes in SQL Server clusters, however, none of them (at least I didn’t find any) touch this topic in detail. Mismanagement, when adding/removing nodes in SQL Server clusters, can quickly become GOTCHAs that take more time than necessary.

Suppose everyone knows about the basics of a SQL cluster, which is nothing but a shared storage failover cluster consisting of one or more than one nodes, just like VCS (Veritas Cluster Service), Sun Cluster or IBM HACMP. At any time, only one node can have full access (via the same path, and AP, MP and power path are out of scope here), to the shared storage, hence the cluster has no load balance capabilities.

Below, I’m going to give several scenarios which, again, are normally not covered by any of the standard documents you find online, but you might run into in the future if you’re a SQL geek or geek to be, so please don’t ask me how they could ever happen or be possible.

A scenario that’s not very typical but typical enough

A DBA tried to upgrade SQL 2005 Standard Edition SP3 to SQL 2005 Enterprise Edition SP3, but when the upgrade happened, he forgot to check the cluster services in both nodes. When the upgrade was done, the cluster couldn’t fail over to the other node, which failed to upgrade because cluster service was down in that node during the upgrade. After a few failed duct-taping attempts (it could have been fixed had those duct-taping attempts not happened), it was decided the failed node needed to be evicted and added back.

1.1  Open cluster administrator from the active node by either “start -> Run -> cluadmin” or by clicking “Start -> Administrative Tools -> Cluster Administrator”.

1.2  Enter “.” as the “Cluster or server name” if cluadmin didn’t open the default cluster automatically.

1.3  Right click on the failed/problematic node and choose “stop cluster service”, if the cluster service is still up and running on it.

1.4  Right click on the failed/problematic node, choose “Evict node”, and click “OK” to confirm it.

Note: What we really need to do first is to remove the node from the SQL Server definition before evicting the node in the “Cluster Administrator”. The steps are, open “Add or Remove Programs” in “control panel”, select “Microsoft SQL Server 2005”, and click “Change”. The Setup will launch, and select “Maintain the Virtual Server” from the “Change or Remove Instance” page. Follow Microsoft’s white paper/tech doc to remove a node from the cluster instance. However, when you really need to evict a node, it means you’ve already messed up and these normal steps won’t work. In this case, it won’t work as the registry, edition level and version numbers have all been messed up due to the incorrect upgrade steps and duct taping. Still, I highly recommend you to try the normal steps first.

1.5  If you see some message resembling “hostname xxx has been evicted” without further error messages, congratulations, you’ve evicted the node and the cluadmin has done a successful cleanup job for you, which means you can easily add the node back without too much hassle.

1.6  If you see a message after you have evicted the node, which says, “the node was evicted but cleanup was not done”, congratulations are still in order, but you need to read on.

1.7  Someone will recommend that you run “cluster /force” on the evicted node to do a manual cleanup, but I’ll tell you, that won’t work either as it will only try to clean up some cluster related info in the registry, and won’t do any other cleanup (SQL, Exchange, etc.) that normally “Add or Remove Programs” would do.

1.8  Now, you have only two options:

1.8.1  Manually uninstall SQL Server in the evicted node and do a manual cleanup of the registries and binaries. I had to do that once with SQL 2000 cluster and I really hated it. It’s much more tedious and takes much longer.

1.8.2  Rebuild Windows server on the evicted node (if you can do it by yourself, great, otherwise you'll have to bug a Windows admin). I highly recommend this method, as it’s much cleaner and takes less time.

1.9  When you rebuild the node, make sure you do everything the same as you would do for a brand new cluster installation (except you don’t have to rerun the fiber to the SAN or external disk array, crossover line/reconfigure VLAN). Test ping the heart beat/private NIC/IP, and use disk management to verify that you can see the shared storage (ignore the red x as it’s not the active node and has no access to it anyway), etc.

1.10  Once the node has been rebuilt, you can add the node back to the Windows cluster, which is easy. Open the “cluster administrator” from the active node, right click on the cluster name, select “new -> node” and enter the hostname of the rebuilt node. The “cluadmin” will do a verification and install cluster service on the new node. You can ignore two warnings during this course as long as you’re able to keep clicking “next” till “finish”. Then you want to apply all hot fixes related to cluster service to this new node; at least bring it to the same level as other nodes.

1.11  Congrats, now you’ve added the node to the cluster, and need to add it to the SQL cluster instance. Open “Add or Remove Programs” from “Control Panel”, select “Microsoft SQL Server 2005”, and click “Change”. The setup will launch again, and again just follow the steps Microsoft provides to add the node to the cluster.

1.12  OK, now you can see both nodes (or all nodes including the one you've just added) on SQL resource group (and MSDTC group, if you’re one of those people who insist on taking the hassle of creating a separate resource group including disk, IP and network name for MSDTC, or for nothing) and you can do failover testing successfully. You think you’re done.

1.13  But wait, what if we run the following query on all nodes:

	select convert(varchar(20),serverproperty('computernamephysicalnetbios'))
	,convert(varchar(20),serverproperty('productlevel')),
	convert(varchar(20),serverproperty('productversion')),
	convert(varchar(30),serverproperty('edition'))
 

You'll find that the newly added node is at “RTM” level and doesn’t have the service pack level (say SP3) and hot fix you wanted, which are on other nodes already. Note, if you fail over on the old nodes, you’ll see the right version/edition info, as they have the right binary. So we know the “Add or Remove Program” or SQL Setup won’t automatically patch for you. OK, let’s do it ourselves then.

1.14  You try to apply the service packages and hot fixes from the good node(s), and it doesn't work. You're told, “The binary to be installed has an older version”.

1.15  You fail over to the new node and try to apply SP and hot fixes there, and the setup will give you some goofy error that it can’t connect to other nodes (or some access error).

1.16  If you Google about this, you’ll find out you need to close all RDP (remote desktop session) connections to all other inactive nodes. OK, now you’re seeing a different error and the setup just won’t go through.

1.17  You can cancel the setup, which is going to roll back all of the the changes done during the previous steps, which I do give the Soft a big thumbs up as I don’t have to end up rebuilding the node again.

1.18  Here are the right steps, once you add the new node back to the SQL cluster (step 1.12).

Fail the cluster over to this new node, and stop cluster services in all other nodes. You can do this with “cluadmin” or if you like, you can RDP to each and every node and use “services.msc” to stop the cluster service.

1.19  Anyway, ensure that only the new node owns all of the active resources and the cluster service in other nodes are all down, and there are no RDP sessions to any other nodes. Now, try to apply the service package and hot fixes on only this node again, and it all works fine!

1.20  Bring all other nodes online (start up cluster service from cluadmin), test the fail over and back and run aforementioned query again; you should see that you have the same edition/version/SP level information.

And that’s all I have to say about this topic, enjoy!

Additional Resources

MSDN: SQL Server 2005 Books Online How to: Recover from Failover Cluster Failure in Scenario 1
MSDN: SQL Server 2005 Books Online How to: Add or Remove Nodes in a SQL Server 2005 Failover Cluster (Setup)

» See All Articles by Columnist Bo Chen








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers