Troubleshooting performance issues on your windows storage. Storage Spaces Direct

In this guide i will give you a quick overview on how to troubleshoot your Storage Spaces, like ordinary Storage Pools with and without Tiering and Storage Spaces Direct. I will do the troubleshooting based on an issue we had with our test Storage Spaces Direct Cluster.

What happen was that we where starting to experience really bad performance on the VM’s. Response times where going trough the roof. We had response times of 8000 ms on normal os operations. What we traced it down to was faulty SSD drives. These where Kingston V310 consumer SSD drives. These did not have power loss protection on them, and that’s a problem as S2D or windows storage want’s to write to a safe place. The caching on these Kingston drives worked for a while. But after to much writing it failed. You can read all about SSD and power loss protection here.

So what i will cover here is how to identify slow disks.

First you will notice performance issues on your VM’s or physical server. Open resource monitor and look at Response times

responstimes

First rule out RDMA if this is for a Scale-Out-Fileserver or S2D

Download this powershell script and run it on your server.

https://github.com/Microsoft/diskspd/blob/master/Frameworks/VMFleet/test-clusterhealth.ps1

You should get an output like this, and look for the SMB Connectivity Error Check. Look at errors the last 5 min and run the test a few times and look for if the numbers increase. As this is an indication for RDMA problems.

test-clusterhealt

As you can see there where some issues on RDMA here, but nothing serious. The nr 4 node is being looked after. I will come back with a guide for that.

Now let’s have look at the disks on the physical machine.

If you have S2D you should see if all drives in the storagepool is attached to the Caching device. I have NVME as caching device and ssd and hdd as performance and capacity. The caching device will be the :0 and the disks will be 1:0 2:0 and so on depending on how many drives you have.

So what you do is run this command.

 

And you will get this list. It’s pretty long. I have 6 HDD and 2 SSD totaling 8 drives. And all 8 is attached to the Caching device. This is important.

get-counter

Now let’s jump to performance counter and add a counter there.

perfmon

Now to know which disk is  SSD and wich is HDD you will need to run this next command. Disk 0 is the caching device.

 

getckusterdiskinfo

For a singel node the command is

Now we can go back to Perf monitor. We can see on node 1 that the SSD’s are disk 4 and 5. In the screenshot i have here the SSD’s where 1 and 2

perfmon2

If you look closely at the Avg. Disk sec/Read,  Avg. Disk sec/Transfer and Avg. Disk sec/Write you will see high numbers. They should be in the 0,001-0,003 or as low as possible. We did see numbers in the range of 1,870. And not this is Average numbers not max, so you can imagine the high latency on this. This was caused by Windows Server telling the storage disk to write to a safe place. And when the SSD caching is not working on the SSD the safe place is directly on the Nand cells, wich has about 200 iops. The ssd should do 25k iops in writing.

To replace failed drives in a storage pool i used this guide. I will paste it in here for reference.

 

 

3 thoughts on “Troubleshooting performance issues on your windows storage. Storage Spaces Direct

  • July 20, 2017 at 10:29 am
    Permalink

    Thanks for the reply.

    In the end we discovered that S2D was not working with SES and as such the HDD’s weren’t even in the enclosure – just the SSD’s! Change of HDD’s got two enclosures with the SSD’s and HDD’s as expected and all working fine now.

    Our own fault for not using qualified hardware!

    Thanks – Steve

    Reply
  • June 22, 2017 at 6:37 pm
    Permalink

    Hi,

    We have come across exactly the same situation (were using Intel 520 SSD’s without power protection). Have swapped the SSD’s out with Intel Data Centre models (which DO have power protection) however we are not seeing any difference…. The counters are not increasing even with heavy write operations.

    Once you swapped your SSD’s out did you need to do anything else? E.g. reboot the nodes, repair the CSV’s?

    Thanks,

    Steve

    Reply
    • June 26, 2017 at 9:01 pm
      Permalink

      Hello Steve, how are your virtual disks. You did swap 1 and 1 disk and let it rebuild? Can you run a optimize-storagepool to rebalance disks. You could check out, show-prettypool.ps1 script from Cosmos Darwin this will show you the balance of the disks.

      Also give me a picture of your perfmon and physical disks when you have load on it.

      Plus a output of

      get-virtualdisk
      get-physicaldisk
      get-storagejob

      JT

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *