Over the last few weeks we have been having some issues with our Storage Spaces Direct test/dev cluster. To start off i will explain what happened and what did go wrong.
In this guide i will explain what you can do to fix a failed virtualdisk in a Failover Cluster. In S2D the ReFS volume will write some metadata to the volume when it mounts it. If it can’t do this for some reason it will jump the virtualdisk from node to node until it’s tried to mount it on the last host. Then it will fail and you will get this state in the event log and the Virtual disk will be failed.
In this guide i will give you a quick overview on how to troubleshoot your Storage Spaces, like ordinary Storage Pools with and without Tiering and Storage Spaces Direct. I will do the troubleshooting based on an issue we had with our test Storage Spaces Direct Cluster.
What happen was that we where starting to experience really bad performance on the VM’s. Response times where going trough the roof. We had response times of 8000 ms on normal os operations. What we traced it down to was faulty SSD drives. These where Kingston V310 consumer SSD drives. These did not have power loss protection on them, and that’s a problem as S2D or windows storage want’s to write to a safe place. The caching on these Kingston drives worked for a while. But after to much writing it failed. You can read all about SSD and power loss protection here.
During this summer i decided i wanted to test out Storage Spaces Direct. TP5 was out and i was quite eager to test it out. Now it’s been upgraded to RTM with cluster rolling upgrade. Rember to run Update-ClusterFunctionalLevel after.
So i look arround on ebay for some servers and other items i needed to buy. I ended up with the list under all in 4x
- HP DL380 G6 16 bay 128GB mem, 2x4core Intel CPU HP P420
- HP H220
- MellanoX ConnectX3 MCX312A-XCBT
- Intel 750 NVME PCIe
- 2x Kingston SSDNow V310 for caching(Replacing with Samsung SM863)
- 6x WD Red NAS 1tb 2.5″
- Dell Force Ten S4810P (Already had)
A week ago i replaced a NVME card on our development Storage Spaces Direct cluster. This did not go as gracefully as i had hoped. Normaly this should work in the following way.
- Pause node and drain the server for all resources
- Shut down server
- Replace NVME card
- Reboot server
- Resume server
This did not go as planned. I ended up with quite alot off issues. This was a late saturday evening. I ended up with disks that looked like this.