Hey everyone, another Failover Cluster issue i came across lately. And i wanted to share this one as i could not find any good resources online for this issue. So here we go.
A client contacted me a few weeks back about a node that would not come online again in the cluster after a reboot. It would not join back to the cluster at all. When looking at the cluster it just did not want to join back to the cluster. The only good error msg’s under failover clustering i could find where these.
What we found
The error msg’s are saying there is an authentication failure between the node and the cluster. And it’s referencing schannel in the 5401 event id. Looking at schannel event id 36869 it was showing that the certificate was missing the private key information.
So my first thought was ok, let’s find the cert and delete it and reboot the node, as Failover Cluster will get the cert back from the other nodes when trying to join the cluster. Well that did not work. After some more troubleshooting and googling and not finding an answer a premier support ticket was created. Unfortunately MS support has been overwhelmed with support cases and been backlogged for a while so the ticket has taken some time to resolve.
MS wanted us to run the tss_tool found at https://github.com/walter-1/TSS . Download the zip file and extract it to a folder and run the command
It will create a folder named MS_Data under c:\
Inside one of the logs the we found this little error event
11/17/2021 1:58:25 AM Error node1.contoso.local 1653 Microsoft-Windows-FailoverClustering Node-to-Node Communications NT AUTHORITY\SYSTEM Cluster node ‘node1’ failed to join the cluster because it could not communicate over the network with any other node in the cluster. Verify network connectivity and configuration of any network firewalls.
11/17/2021 1:58:25 AM Error node1.contoso.local 5398 Microsoft-Windows-FailoverClustering Startup/Shutdown NT AUTHORITY\SYSTEM Cluster failed to start. The latest copy of cluster configuration data was not available within the set of nodes attempting to start the cluster. Changes to the cluster occurred while the set of nodes were not in membership and as a result were not able to receive configuration data updates. . Votes required to start cluster: 3 Votes available: 1 Nodes with votes: Quorum Disk node2 node3 node4 Guidance: Attempt to start the cluster service on all nodes in the cluster so that nodes with the latest copy of the cluster configuration data can first form the cluster. The cluster will be able to start and the nodes will automatically obtain the updated cluster configuration data. If there are no nodes available with the latest copy of the cluster configuration data, run the ‘Start-ClusterNode -FQ’ Windows PowerShell cmdlet. Using the ForceQuorum (FQ) parameter will start the cluster service and mark this node’s copy of the cluster configuration data to be authoritative. Forcing quorum on a node with an outdated copy of the cluster database may result in cluster configuration changes that occurred while the node was not participating in the cluster to be lost.
And then this error msg
11/17/2021 1:58:38 AM Error node1.contoso.local 36869 Schannel N/A NT AUTHORITY\SYSTEM The TLS client credential’s certificate does not have a private key information property attached to it. This most often occurs when a certificate is backed up incorrectly and then later restored. This message can also indicate a certificate enrollment failure
We tried the Clear-ClusterNode -Name node1 -Force on the faulty node and on one of the nodes in the cluster to try and clear the node information but it did not work.
The last suggestion we had was to try and remove the node from the cluster. So we ran the following
# On Node1 run
Remove-ClusterNode -Name node1 -Force
#Once the node is out run this command on node1
Clear-ClusterNode -Name node1 -Force
#Then run the same command on any of the other nodes in the cluster
Clear-ClusterNode -Name node12 -Force
Once this was done, the node was outside the cluster. And we could try and add it back to the cluster. Which worked flawless
So to sum this up, something happened with the schannel certificate on the node that was rebooted. And it could not authenticate back safely. The only way to get it back in was to remove the node and re add it to the cluster. Hopefully this will help someone in the future.