Troubleshooting failed VirtualDisk on a Storage Spaces Direct Cluster

In this guide i will explain what you can do to fix a failed virtualdisk in a Failover Cluster. In S2D the ReFS volume will write some metadata to the volume when it mounts it. If it can’t do this for some reason it will jump the virtualdisk from node to node until it’s tried to mount it on the last host. Then it will fail and you will get this state in the event log and the Virtual disk will be failed.

Updated April 18th 2018

clusterlog

If you also look in your ReFS event log you will see things like this

refslog

Now let’s run a powershell command on one of the nodes to look at the VirtualDisk

get-virtualdisk1

Updated section

Microsoft has changed some settings lately on what to do when a ReFS volume goes offline on a CSV. They have given us another parameter to use. Start by running these commands.

FYI the Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask needs to be run on the node that owns the  Disk that is failed. Also run just the Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery”  to see when it finishes. It will say running untill it’s done.

You will need to wait for the Data Integrity Scan to finish before continuing. The Integrity Scan will scan about 1 TB used space pr hour. So if you have used 8 TB it will use about 8 hours.

 

Now the virtualdisk should look like this in Failover Cluster manager.

fixvirtualdisk1

Wait for any storage jobs that is running. This might happen. Run Get-StorageJob and it should be empty. Once it’s empty we can add the virtualdisk back as a Cluster Shared Volume

Now it should be ok. You can run

And it should show as online.

If the volume does not come online after starting the clusterresource. Run the first part again with the dataintegrity scan. let it sit for a while. Then do the 2nd part with commands. One time i had to do this process 10 times over before it came online.

 

 

 

17 thoughts on “Troubleshooting failed VirtualDisk on a Storage Spaces Direct Cluster

  • June 15, 2018 at 4:09 pm
    Permalink

    i can confirm you can see a storagejob running after start-scheduledtask part. It will repair until its 99% complete and it moves over to a suspended state. when it reach this point execute the reverse commands and you should be good to go.

    Reply
    • June 18, 2018 at 12:11 pm
      Permalink

      Yes, and also run the get-scheduledtask command to see if the task is still running.

      Reply
  • May 14, 2018 at 12:31 pm
    Permalink

    Hi Jan-Tore,

    A huge thank you for this fantastic article! i had 3 active passive SQL clusters unable to attached the virtual disk after a restart late on a Friday night. After hours of troubleshooting i decided to google the issue and found this page and could not have asked for more 🙂

    Best regards,
    NoobToClustering

    Reply
  • May 10, 2018 at 12:03 am
    Permalink

    Hi Jan-Tore,

    also from me many thanks for this life-saving hint! It happened to us two times that we had a CSV volume in a detached state and were unable to bring it online again. After the first occurrence (during March 2018) we even opened a MS support call with no usable results. I had to finally delete/recreate the volume thereby having to accept data loss. On the second occurrence we luckily stumbled over your post here – and it helped! The recovery took about 22h in our case for a 25TB volume… Painfully long, but better than nothing.

    Many regards,

    Norbert

    Reply
  • May 4, 2018 at 11:03 am
    Permalink

    Just back online today after a recovery using this procedure.
    Firstly, a huge thank you to the author who was at one stage online with me during the recovery.
    And a bigger thank you to Dave Kawula a Microsoft MVP – look him up if you are ever in trouble – was insightful and helpful to the nth degree
    while it was happening.

    OK enough of the thank yous.
    Be prepared for the step:
    Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask
    to take a *long* time with *zero* feedback about progress.
    Repeatedly running Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery”
    will result in Running… until it’s not. No feedback whatsover. You know it’s ended when it says Ready.
    *Just* *wait* you have to.
    YMMV but our 25TB cluster volume with 15TB of data took 16 hours to complete. I know, we should break it into smaller volumes.
    Obviously different hardware configs will change this time.
    But this works if you are stuck and we were online again without data loss albeit with a long(ish) downtime.

    Reply
    • May 4, 2018 at 11:10 pm
      Permalink

      Glad to be of help, and this is the reason for sharing my findings to the rest of the world. So that others can find a solution by simply going to google or bing. Im happy all is back to normal 🙂

      JT

      Reply
  • May 2, 2018 at 9:00 pm
    Permalink

    Great Work!

    I really appreciate you taking the time to share this info.

    With that said, I have been struggling with this exact issue for 3 weeks. I am a small MSP and will loose clients if i don’t fix it soon.

    You think you can give me a hand? I will pay for your time if need be?

    If you can help, I can give you access, or just tell me what logs and screen shots you need.

    Reply
    • May 5, 2018 at 9:49 am
      Permalink

      Sorry for not seeing this before. Just sendt you an email.

      Regards
      Jan-Tore Pedersen

      Reply
  • May 1, 2018 at 11:13 pm
    Permalink

    Good Evening sir. Hope this message finds you well. I will try not to waste too much of your time. so just two things.

    1. I really enjoy your blog. I appreciate you taking the time to put this info out there.

    2. I have the exact issue described in this post. I have followed these steps many times.
    My S2D cluster has been down for days and as an MSP thats not good!
    I am hoping you can give me an hand?
    If I have to pay you a consulting fee I will good sir.
    You are very knowledgeable, and I’d like to learn a few things if possible.

    If you can help, let me know what you need from me , data-wise to help see whats going on. I can even arrange for you to have direct access.

    Reply
  • April 26, 2018 at 2:16 am
    Permalink

    works! saved me. I had multiple volumes offline and the repair job would either exception or suspend until I got the first part going for all offline drives in the pool at the same time.

    Reply
  • March 26, 2018 at 7:33 pm
    Permalink

    When try to bring virtual disk up following your solution I still have no luck. It’s still offline. I’ve failed disks in the pool which I replaced them with brand new disks but don’t show up as available,instead I see the get-physicaldisk showing removing from pool,communication lost with old disk serial numbers even after reboot

    Reply
    • April 3, 2018 at 8:37 am
      Permalink

      Hello

      Without getting more info i can’t help you out. If you send me your email adress i will invite you to the slack channel we use for troubleshooting S2D. https://storagespacesdirect.slack.com

      Reply
      • April 20, 2018 at 4:21 pm
        Permalink

        How do I private message you? I’d like to join the slack…

        Reply
        • April 21, 2018 at 10:13 pm
          Permalink

          send me a msg on twitter, jantorep

          Reply
  • January 26, 2018 at 4:47 pm
    Permalink

    Thanks! this is the only reference I have found to this problem.
    I would have wished the events were in text, not pictures – so I found this a bit late – searching for the errors using google does not pop this article, because it’s pictures I guess.

    Never the less – THANKS!
    At least I got a bit of my trust in the S2D back after loosing complete access to a volume with no explanation.

    Reply
  • January 1, 2018 at 4:18 pm
    Permalink

    I did not check the storagejob’s if any was running. You can probably wait if you want to yes if a repair job is running. But it works without waiting as well.

    Regards
    Jan-Tore

    Reply
  • December 31, 2017 at 1:12 am
    Permalink

    Many thanks for this post ! Is very “simple” but is very Good !

    Just before to re-add the VirtualDisk to the Cluster, It’s possible to wait the Repair Storage Job.

    Best regards,
    Philippe G.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

* Checkbox GDPR is required

*

I agree