Lately we have seen alot of Event ID 5120 with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED during rebooting a node.
Here is a statement from Microsoft about the issue and what to do when rebooting a node.
In the May cumulative update we introduced SMB Resilient Handles for the S2D intra-cluster network to improve resiliency to transient network failures. This has had some side effects in increased timeouts when a node is rebooted, which can effect a system under stress. Symptoms include event ID 5120’s with a status code of STATUS_IO_TIMEOUT or STATUS_CONNECTION_DISCONNECTED when a node is rebooted.
Until a fix is available, a workaround that addresses the issue is to invoke Storage Maintenance Node prior to rebooting a node in a Storage Spaces Direct cluster. Let’s say when patching for example.
So, first drain the node, then invoke Storage Maintenance Mode, then reboot.
Here’s the syntax:
Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq “<NodeName>”} | Enable-StorageMaintenanceMode
Once the node is back online disable Storage Maintenance Mode with this syntax :
Get-StorageFaultDomain -type StorageScaleUnit | Where-Object {$_.FriendlyName -eq “<NodeName>”} | Disable-StorageMaintenanceMode
Please note that Cluster Aware Updating does not put your nodes in Storage Maintenance mode.
S2D cluster doesn’t seem to like putting node into maintenance mode…
It depends on if you have any issues with any thing. It will not allow you to put a node in to maintenance if there is a warning in the cluster. Let’s say a disk that is unhealthy, a virtualdisk that is unhealthy or something. So make sure that there is no unhealthy systems in your setup.
Regards
Jan-Tore Pedersen