How to Evict a Failed Node and Add it Back to SQL Server 2005 Cluster
April 14, 2010
Adding and removing nodes in SQL Server Clusters is not so difficult, and instructions on how to do so abound on the internet. However, mismanagement when adding/removing nodes can quickly become a 'gotcha' that wastes time. Bo Chen offers insight into some of those scenarios that are not normally covered in the standard online documents.
There seem to be plenty of articles either from Microsofts tech sites or other forums regarding how to manage (add/remove) nodes in SQL Server clusters, however, none of them (at least I didnt find any) touch this topic in detail. Mismanagement, when adding/removing nodes in SQL Server clusters, can quickly become GOTCHAs that take more time than necessary.
Suppose everyone knows about the basics of a SQL cluster, which is nothing but a shared storage failover cluster consisting of one or more than one nodes, just like VCS (Veritas Cluster Service), Sun Cluster or IBM HACMP. At any time, only one node can have full access (via the same path, and AP, MP and power path are out of scope here), to the shared storage, hence the cluster has no load balance capabilities.
Below, Im going to give several scenarios which, again, are normally not covered by any of the standard documents you find online, but you might run into in the future if youre a SQL geek or geek to be, so please dont ask me how they could ever happen or be possible.
A scenario thats not very typical but typical enough
A DBA tried to upgrade SQL 2005 Standard Edition SP3 to SQL 2005 Enterprise Edition SP3, but when the upgrade happened, he forgot to check the cluster services in both nodes. When the upgrade was done, the cluster couldnt fail over to the other node, which failed to upgrade because cluster service was down in that node during the upgrade. After a few failed duct-taping attempts (it could have been fixed had those duct-taping attempts not happened), it was decided the failed node needed to be evicted and added back.
1.1 Open cluster administrator from the active node by either start -> Run -> cluadmin or by clicking Start -> Administrative Tools -> Cluster Administrator.
1.2 Enter . as the Cluster or server name if cluadmin didnt open the default cluster automatically.
1.3 Right click on the failed/problematic node and choose stop cluster service, if the cluster service is still up and running on it.
1.4 Right click on the failed/problematic node, choose Evict node, and click OK to confirm it.
Note: What we really need to do first is to remove the node from the SQL Server definition before evicting the node in the Cluster Administrator. The steps are, open Add or Remove Programs in control panel, select Microsoft SQL Server 2005, and click Change. The Setup will launch, and select Maintain the Virtual Server from the Change or Remove Instance page. Follow Microsofts white paper/tech doc to remove a node from the cluster instance. However, when you really need to evict a node, it means youve already messed up and these normal steps wont work. In this case, it wont work as the registry, edition level and version numbers have all been messed up due to the incorrect upgrade steps and duct taping. Still, I highly recommend you to try the normal steps first.
1.5 If you see some message resembling hostname xxx has been evicted without further error messages, congratulations, youve evicted the node and the cluadmin has done a successful cleanup job for you, which means you can easily add the node back without too much hassle.
1.6 If you see a message after you have evicted the node, which says, the node was evicted but cleanup was not done, congratulations are still in order, but you need to read on.
1.7 Someone will recommend that you run cluster /force on the evicted node to do a manual cleanup, but Ill tell you, that wont work either as it will only try to clean up some cluster related info in the registry, and wont do any other cleanup (SQL, Exchange, etc.) that normally Add or Remove Programs would do.
1.8 Now, you have only two options:
1.8.1 Manually uninstall SQL Server in the evicted node and do a manual cleanup of the registries and binaries. I had to do that once with SQL 2000 cluster and I really hated it. Its much more tedious and takes much longer.
1.8.2 Rebuild Windows server on the evicted node (if you can do it by yourself, great, otherwise you'll have to bug a Windows admin). I highly recommend this method, as its much cleaner and takes less time.
1.9 When you rebuild the node, make sure you do everything the same as you would do for a brand new cluster installation (except you dont have to rerun the fiber to the SAN or external disk array, crossover line/reconfigure VLAN). Test ping the heart beat/private NIC/IP, and use disk management to verify that you can see the shared storage (ignore the red x as its not the active node and has no access to it anyway), etc.
1.10 Once the node has been rebuilt, you can add the node back to the Windows cluster, which is easy. Open the cluster administrator from the active node, right click on the cluster name, select new -> node and enter the hostname of the rebuilt node. The cluadmin will do a verification and install cluster service on the new node. You can ignore two warnings during this course as long as youre able to keep clicking next till finish. Then you want to apply all hot fixes related to cluster service to this new node; at least bring it to the same level as other nodes.
1.11 Congrats, now youve added the node to the cluster, and need to add it to the SQL cluster instance. Open Add or Remove Programs from Control Panel, select Microsoft SQL Server 2005, and click Change. The setup will launch again, and again just follow the steps Microsoft provides to add the node to the cluster.
1.12 OK, now you can see both nodes (or all nodes including the one you've just added) on SQL resource group (and MSDTC group, if youre one of those people who insist on taking the hassle of creating a separate resource group including disk, IP and network name for MSDTC, or for nothing) and you can do failover testing successfully. You think youre done.
1.13 But wait, what if we run the following query on all nodes:
select convert(varchar(20),serverproperty('computernamephysicalnetbios')) ,convert(varchar(20),serverproperty('productlevel')), convert(varchar(20),serverproperty('productversion')), convert(varchar(30),serverproperty('edition'))
You'll find that the newly added node is at RTM level and doesnt have the service pack level (say SP3) and hot fix you wanted, which are on other nodes already. Note, if you fail over on the old nodes, youll see the right version/edition info, as they have the right binary. So we know the Add or Remove Program or SQL Setup wont automatically patch for you. OK, lets do it ourselves then.
1.14 You try to apply the service packages and hot fixes from the good node(s), and it doesn't work. You're told, The binary to be installed has an older version.
1.15 You fail over to the new node and try to apply SP and hot fixes there, and the setup will give you some goofy error that it cant connect to other nodes (or some access error).
1.16 If you Google about this, youll find out you need to close all RDP (remote desktop session) connections to all other inactive nodes. OK, now youre seeing a different error and the setup just wont go through.
1.17 You can cancel the setup, which is going to roll back all of the the changes done during the previous steps, which I do give the Soft a big thumbs up as I dont have to end up rebuilding the node again.
1.18 Here are the right steps, once you add the new node back to the SQL cluster (step 1.12).
Fail the cluster over to this new node, and stop cluster services in all other nodes. You can do this with cluadmin or if you like, you can RDP to each and every node and use services.msc to stop the cluster service.
1.19 Anyway, ensure that only the new node owns all of the active resources and the cluster service in other nodes are all down, and there are no RDP sessions to any other nodes. Now, try to apply the service package and hot fixes on only this node again, and it all works fine!
1.20 Bring all other nodes online (start up cluster service from cluadmin), test the fail over and back and run aforementioned query again; you should see that you have the same edition/version/SP level information.
And thats all I have to say about this topic, enjoy!
MSDN: SQL Server 2005 Books Online How to: Recover from Failover Cluster Failure in Scenario 1