Updated January 29th with new Priority Flow Control recomendations to add Cluster HeartBeat to Priority ID 7 for Windows Server and Dell Switches.
Updated May 26th 2018 with HPE FlexFabric config
You have probably heard these acronyms somewhere, so what are these and are they the same. In short yes and no
RoCE stands for RDMA over Converged Ethernet, the RDMA part is Remote Direct Memory Access.
RDMA allows for network data(TCP packets) to be offloaded on the Network cards and put directly in to the memory, bypassing the hosts CPU. Allowing for the host to have all the access to the CPU. In normal TCP offload all the network traffic goes trough the CPU and with higher speeds will take more CPU. On a 10gbit network it would take about 100% cpu on a 12 core Intel Xeon V4 CPU.
Mellanox has a good explanation for RDMA here.
DCB stands for Data Center Bridging
What it contains are enhancements to Ethernet communication protocol. Ethernet is a best-effort network that may experience packet loss when network devices are busy, creating re transmission. DCB allows for selected traffic to have zero packet loss. It eliminates loss due to queue overflow and to be able to allocate bandwidth on links. DCB allows for different priorities of packets being sent over the network.
In this post i will cover how to enable RDMA and DCB in Windows for SMB and on different switches. I will update with more switches as i read trough different vendors configuration. As the setup varies a lot from vendor to vendor.
In the last year Microsoft has started to recomend iWARP as the default RDMA solution for S2D. This is based on that iWARP do not need DCB, PFC and ETS for it to work. In general RoCE does not need it, but as RoCE communicates over UDP flow controll is needed if there are packet drops.
RoCE is comming with a DCB free solution in the future. But for any High IOPS RDMA configuration today, DCB and PFC is needed. Even for iWARP. To configure DCB/PFC for iWARP it’s identical to RoCE, so the same configuration apply to both.
Switches and Vendors that is covered in this post
Lenovo
NE2572 (CNOS)
Dell
N4000 series
Force 10 S4810p, S6000, S6000-on(FTOS)
Cisco
Nexus NX-OS
Mellanox
SN2100
HPE
FlexFabric 5700/5900
Quanta
LB8
How to configure Windows Server 2012, 2012R2, 2016 and 2019 with RDMA and DCB
For SMB you will need to install WindowsFeauture Data-Center-Bridging
Install-WindowsFeature -Name Data-Center-Bridging
Reboot the server and let’s configure the DCB settings. SMB always use Priority 3, you can use any other, but best practice is 3. And Cluster HeartBeat uses Priority 7
Create QoS Policy New-NetQosPolicy "SMB" -NetDirectPortMatchCondition 445 -PriorityValue8021Action 3 New-NetQosPolicy "Cluster" -PriorityValue8021Action 7 # Turn on Flow Control for SMB and Cluster Enable-NetQosFlowControl -Priority 3,7 # Make sure flow control is off for other traffic Disable-NetQosFlowControl -Priority 0,1,2,4,5,6 #Disable DCBx Set-NetQosDcbxSetting -Willing $false -Confirm:$false # Apply a Quality of Service (QoS) policy to the target adapters Enable-NetAdapterQos -InterfaceAlias "NIC1","NIC" # Give SMB Direct a minimum bandwidth of 50% New-NetQosTrafficClass "SMB" -Priority 3 -BandwidthPercentage 50 -Algorithm ETS #Give Cluser a minimum bandwith of 1% New-NetQosTrafficClass "Cluster" -Priority 7 -BandwidthPercentage 1 -Algorithm ETS #Disable Flow Controll on physical Nics Set-NetAdapterAdvancedProperty -Name "NIC1" -RegistryKeyword "*FlowControl" -RegistryValue 0 Set-NetAdapterAdvancedProperty -Name "NIC2" -RegistryKeyword "*FlowControl" -RegistryValue 0 #Enable QoS and RDMA on nic's Get-NetAdapterQos -Name "NIC1","NIC2" | Enable-NetAdapterQos Get-NetAdapterRDMA -Name "NIC1","NIC2" | Enable-NetAdapterRDMA
After the QOS part is done, let’s configure a network team or a switch. For S2D one uses a setswitch with Embededteaming
New-VMSwitch –Name S2DSwitch –NetAdapterName "NIC1","NIC2" -EnableEmbeddedTeaming $true -AllowManagementOS $false
Let’s create some network cards and enable RDMA on them. Once RDMA is enabled DCB will also be enabled for SMB.
Add-VMNetworkAdapter –SwitchName S2DSwitch –Name Managment –ManagementOS Add-VMNetworkAdapter –SwitchName S2DSwitch –Name SMB1 –ManagementOS Add-VMNetworkAdapter –SwitchName S2DSwitch –Name SMB2 –ManagementOS # Enable RDMA on the virtual network adapters just created $smbNICs = Get-NetAdapter -Name *SMB* | Sort-Object
$smbNICs | Enable-NetAdapterRDMA #Let's find the physical nics in the team. $physicaladapters = (Get-VMSwitch | Where-Object { $_.SwitchType -Eq "External" }).NetAdapterInterfaceDescriptions | ForEach-Object { Get-NetAdapter -InterfaceDescription $_ | Where-Object { $_.Status -ne "Disconnected" } } #Map SMB interfaces to Physical Nics Set-VMNetworkAdapterTeamMapping -VMNetworkAdapterName $smbNICs[0].Name -ManagementOS -PhysicalNetAdapterName (get-netadapter -InterfaceDescription $physicaladapters[0].InterfaceDescription).name Set-VMetworkAdapterTeamMapping -VMNetworkAdapterName $smbNICs[0].Name -ManagmentOS -PhysicalNetAdapterName (get-netadapter -nterfaceDescription $physicaladapters[0].InterfaceDescription).name
To check if RDMA is enabled you can run this command
Get-SmbClientNetworkInterface | where RdmaCapable -EQ $true | ft FriendlyName
Now DCB and RDMA is configured in Windows, let’s move to the switch setup.
This is where the hard part is, figuring out the correct setup for your switch. Most switch vendors support this.
Lenovo NE2572
Use default port settings, and enable DCB on switch in global mode.
Cee Enable cee ets priority-group pgid 3 priority 3 cee ets priority-group pgid 3 description "RoCEv2" cee pfc priority 3 enable cee pfc priority 3 description "RoCEv2" cee ets priority-group pgid 7 priority 7 cee ets priority-group pgid 7 description "Cluster" cee pfc priority 7 enable cee pfc priority 7 description "Cluster" cee ets priority-group pgid 0 description "Default" cee ets priority-group pgid 0 priority 4 5 6 cee ets bandwith-percentage 0 49 3 50 7 1
DEll N4000 Series
Turn off flowcontrol on all interfaces.
Conf t interface range tengigabitethernet 1/0/13,ten1/0/14,ten1/0/15,ten1/0/16,ten2/0/13,ten2/0/14,ten2/0/15,ten2/0/16 classofservice traffic-class-group 0 1 classofservice traffic-class-group 1 1 classofservice traffic-class-group 2 1 classofservice traffic-class-group 3 0 classofservice traffic-class-group 4 1 classofservice traffic-class-group 5 1 classofservice traffic-class-group 6 1 classofservice traffic-class-group 7 2 traffic-class-group max-bandwidth 49 50 1 traffic-class-group min-bandwidth 49 50 1 traffic-class-group weight 49 50 1 datacenter-bridging priority-flow-control mode on priority-flow-control priority 3 no-drop priority-flow-control priority 7 no-drop exit exit
What you set here is that we have sett traffic class 3 into group 0, and we have set max and min bandwith on the groups. The groups are 0,1,2. This gives max bandwith for group 0 and 1 50% each. Then we enable the DCB config on the interfaces with mode on. And with priority 3 no-drop we enable the no packet drop on the traffic class 3.
Dell Force 10 S4810p
Turn off flowcontrol on all interfaces.
dcb enable dcb-map SMBDIRECT priority-group 0 bandwidth 50 pfc on priority-group 1 bandwidth 49 pfc off priority-group 2 bandwidth 1 pfc on priority-pgid 1 1 1 0 1 1 1 2 exit interface TenGigabitEthernet 1/46 description no ip address mtu 12000 switchport spanning-tree pvst edge-port dcb-map SMBDIRECT no shutdown exit
Dell Force 10 S6000, S6000-On(FTOS)
Turn off flowcontrol on all interfaces.
conf t protocol lldp advertise management-tlv system-capabilities system-description system-name advertise interface-port-desc dcb enable dcb-map RDMA-dcb-map-profile priority-group 0 bandwidth 50 pfc on priority-group 1 bandwidth 50 pfc off priority-group 2 bandwidth 1 pfc on priority-pgid 1 1 1 0 1 1 1 2 exit interface fortyGigE 1/5 description no ip address mtu 9216 portmode hybrid switchport dcb-map RDMA-dcb-map-profile no shutdown exit
Cisco Nexus NX-OS
By default PFC(Priority Flow Control) is enabled on Cisco Nexus switches. To hard enable it do the following.
No Priority 7 for cluster
configure terminal interface ethernet 5/5 priority-flow-control mode on switch(config)# class-map type qos c1 switch(config-cmap-qos)# match cos 3 switch(config-cmap-qos)# exit switch(config)# policy-map type qos p1 switch(config-pmap-qos)# class type qos c1 switch(config-pmap-c-qos)# set qos-group 3 switch(config-pmap-c-qos)# exit switch(config-pmap-qos)# exit switch(config)# class-map type network-qos match-any c1 switch(config-cmap-nqos)# match qos-group 3 switch(config-cmap-nqos)# exit switch(config)# policy-map type network-qos p1 switch(config-pmap-nqos)# class type network-qos c-nq1 switch(config-pmap-nqos-c)# pause buffer-size 20000 pause-threshold 100 resume-threshold 1000 pfc-cos 3 switch(config-pmap-nqos-c)# exit switch(config-pmap-nqos)# exit switch(config)# system qos switch(config-sys-qos)# service-policy type network-qos p1 exit
Cisco Nexus 3132 NX-OS 6.0(2)U6(1)
By default PFC(Priority Flow Control) is enabled on Cisco Nexus switches. To hard enable it do the following.
No Priority 7 for cluster
#Global settings class-map type qos match-all RDMA match cos 3 class-map type queuing RDMA match qos-group 3 policy-map type qos QOS_MARKING class RDMA set qos-group 3 class class-default policy-map type queuing QOS_QUEUEING class type queuing RDMA bandwidth percent 50 class type queuing class-default bandwidth percent 50 class-map type network-qos RDMA match qos-group 3 policy-map type network-qos QOS_NETWORK class type network-qos RDMA mtu 2240 pause no-drop class type network-qos class-default mtu 9216 system qos service-policy type qos input QOS_MARKING service-policy type queuing output QOS_QUEUEING service-policy type network-qos QOS_NETWORK #Port Specific settings switchport mode trunk #Set your vlans on next lines switchport trunk native vlan 99 switchport trunk allowed vlan 99,2000,2050 spanning-tree port type edge flowcontrol receive off flowcontrol send off no shutdown priority-flow-control mode on
Mellanox SN2100
No Priority 7 for cluster
configure terminal priority-flow-control priority dcb priority-flow-control 3 enable interface ethernet 1/1 dcb priority-flow-control mode on dcb ets tc bandwidth 10 50 40 0
HPE FlexFabric 5700/5900 series
No Priority 7 for cluster
#Setting the ETS priority 3 to group 1 qos map-table dot1p-lp import 0 export 0 import 1 export 0 import 2 export 0 import 3 export 1 import 4 export 0 import 5 export 0 import 6 export 0 import 7 export 0 exit #ETS configuration for 50% dropless on group 1 priority 3 wich is default for SMB RDMA interface ten-gigabitethernet 1/0/1 qos trust dot1p qos wrr be group 1 byte-count 15 qos wrr af1 group 1 byte-count 15 qos wrr af2 group sp q os wrr af3 group sp qos wrr af4 group sp qos wrr af group sp qos wrr ca6 group sp qos wrr ca7 group sp #Turning on PFC on the interfaces interface ten-gigabitethernet 1/0/1 priority-flow-control auto priority-flow-control no-drop dot1p 3 qos trust dotlp #For these next lines you don't realy need unless you are realy pushing your config and maxing out speeds. #RoCEv1 QCN congestion config qcn enable qcn priority 3 auto Exit interface Ten-GigabitEthernet1/0/10 lldp tlv-enable dotl-tlv congestion-notification #RoCEv2 ECN congestion config qos wred queue table ROCEv2 queue 0 drop-level 0 low-limit 1000 high-limit 18000 discard-probability 25 queue 0 drop-level 1 low-limit 1000 high-limit 18000 discard-probability 50 queue 0 drop-level 2 low-limit 1000 high-limit 18000 discard-probability 75 queue 1 drop-level 0 low-limit 18000 high-limit 37000 discard-probability 1 queue 1 drop-level 1 low-limit 18000 high-limit 37000 discard-probability 5 queue 1 drop-level 2 low-limit 18000 high-limit 37000 discard-probability 10 queue 1 ecn exit interface Ten-GibabitEthernet1/0/10 qos wred apply ROCEv2
Quanta
This is the basic how to enable, not had the chance to test this out my self yet. So this will be updated as the manual is not straight forward.
No Priority 7 for cluster
#To make sure DCB is enabled we can run this command priority-flow-control mode ON/Auto (Default is Auto and it is enabled) #Now we need to set priority no-drop for priority 3. Standard is no-drop on 3,4,5,6 #First we clear all priority no priority-flow-control priority #Then we set on only priority 3 priority-flow-control priority 3 no-drop #Now let's set ets queue bandwith #to enable queue ets #To set bandwith between san/lan to 50/50 run no queue ets weight #let's set the san bandwith to priority 3 queue ets pg-mapping lan 0 1 2 4 5 6 7 queue ets pg-mapping san 3 #let's configure pfc for interface interface 1/1 storm-control flowcontrol pfc
Jan, I have been trying to reach you directly. I have some of my Team unconvinced that bout Tip #10. (30 tips in 30 min) My team says DCB is useless for Iwarp, since Iwarp is for TCP. Can you speak to why DCB would be recommended for Iwarp?
I presume its as a fallback, but if you should shed any light on it, I would be glad to be part of that conversation with my team. Email me please if possible. I work for an S2d Support team.
Thank you, Louis
Hello Louis
It comes down to how TCP works during lost packets, it will retransmit. And the cause of a lost packet is that the link is full. Basicly trying to send more data then the network can handle. TCP handles this with retransmit, but at one point it would just increase with lost packets untill the system can recover. There is no flowcontroll to basicly limit how much data is transported. And with microsoft also recomending to reserve 1-2% for cluster traffic as well, DCB will help in a high performance solution to tell the system to lower the bandwith beeing sendt. Instead of loosing packets and getting retransmits that could cause congestion in the network and pull down the performance to a halt.
For low performance systems, it’s fine. Let’s say you have a cluster averaging 600k to 1 million iops. And you reboot a host, there will be alot of normal IO and alot of rebuild action that could cause congestion on 25GBit links. So we do recomend it. Even MS say in high performance iWarp clusters it’s recomended.
Great article! It’s super helpful. I was looking at your powershell commands where you set up the priorities, and I noticed that you said that Cluster traffic uses priority 5, but in your PowerShell commands, you set it for priority 7. Is there a reason for this, or is it just a typo?
Bryan
New recomendation from the Azure Stack HCI team to use priority 7 and not 5. Will update Thanks.
Thanks for your article. it’s very Great article! I Have a question have you some config for brocade vdx 6740?? Many, many, Thanks!!
Sorry i do not im afraid 🙁
Thanks for your article. Helped me a lot.
My Dell N4000 has no traffic-class-group 7:
“Value is out of range. The valid range is 0 to 6”
As well as when configuring flow-control priority 7
“Valid no-drop priorities are between 0 and 6”
I probably have an outdated firmware but is this a major problem?
No you can use priority 5 if 7 is not available
I think you made a mistake on the priority-group 7 bandwidth 1 pfc on command because you are now actually pausing the frames of the cluster hardbeats.
Sorry for not responding the config has been updated for corect settings now. They are verified to work as well.
I’m a littlebit confused, I have a Dell S4048-ON switch. I configured the DCB on the switch with:
dcb enable
dcb-map SMBDIRECT
priority-group 0 bandwidth 50 pfc on
priority-group 1 bandwidth 49 pfc off
priority-group 2 bandwidth 1 pfc on
priority-pgid 1 1 1 0 1 1 1 2
exit
However I found another article about ROCE configuration over here: https://www.fredericstefani.com/configure-dell-s4048-switches-for-storage-spaces-direct/
Indicating that you need to have the oposite configuration:
dcb enable
dcb-buffer-threshold RDMA
priority 3 buffer-size 100 pause-threshold 50 resume-offset 35
exit
dcb-map RDMA
priority-group 0 bandwidth 50 pfc off
priority-group 3 bandwidth 50 pfc on
priority-pgid 0 0 0 3 0 0 0 0
exit
Under the S2D node interface configure:
dcb-policy buffer-threshold RDMA
dcb-map RDMA
I think in his configuration he forgot the PFC for the cluster hardbeats.
Thank you for this post. Do you have a working configuration for Cisco Nexus 3132 with NX-OS 9.x with SMB and CLUSTER QoS? Best regards
Sorry i do not. You will need to speak with cisco for that.
Hi JT, what a great information shared here. Thanks!
But I have a question with regards to my setup. I have 4 10Gbe Mellanox ConnextX Pro3. 2 of the ports are teamed together using SET. I have enabled PFC for group 3 for use of my Live Migration traffic.
Another 2 ports, not team and use for SMB traffic. I have also created PFC priority 3 for this. On Windows 2016, I have also enabled the same priority 3 / 99% weight as both ports is solely used for SMB traffic. The problem I’m facing is, when I run Test-RDMA.ps1, it keeps showing me error that physical switch need to be configure for PFC. I am lost and confused. Can you guide me what I did wrong here? I have disabled vlan tagging for that ports as well.
Also, can I have 2 different PFC on same priority group 3 on my switches?
Thanks in advance for sharing your knowledge.
What switches are you using? And are the supported for DCB and PFC?
Jan-Tore
Hi JT, its Dell S4048-on and it support DCB and PFC. Thanks
Have you confiugerd the Dell switches for DCB and PFC?
Regards
Jan-Tore
Hello,
Does the HPE 5700 Support RDMA/ROCE?
I don’t See IEEE 802.1Qaz Enhanced Transmission Selection (ETS) available on this switch,
However, ECN, DCB and PFC are available.
Thanks
I am helping someone with a 5700 right now. Will update once i have been able to look into it. They are having RDMA issues, so will let you know if it’s working or not.
Unless you have them, get something else. They are not too easy to configure. Some HPE Aruba or Dell S/Z series, Lenovo NE series.
Regards
Jan-Tore
Hello Jan,
Did you have a chance to check the HPE 5700 with RoCE?
Thanks
The basic config should be similar. But there is very little info out there on the FlexFabric DCB/PFC config. And i don’t have access to HPE support site to check for more docs. And i have not gotten the complete config.
But from what i could see the specefic config on the switch for DCB is this.
priority-flow-control auto
priority-flow-control no-drop dot1p 3
qos trust dot1p
But i would say that there might be some config missing. but i can’t confirm as i don’t have full config example for 5700. I did find one for 5940, still a bit not sure about the HPE setup. But il go trough this guide and see what i figure out.
http://manualzz.com/doc/32098665/rdma-over-converged-ethernet–roce–design-guide
Thanks mate, that’s very useful.
Are you trying to turn on flow-control on these switches? The wording seems backwards from the configurations you’re showing.
Depends on wich switch you are talking about. But yes with RoCE you want PFC to be on 🙂
Jan-Tore
To Patrick’s comment, in regards to the Dell Force 10 S4810p you said “turn off flowcontrol for all interfaces”. We have other servers plugged into other interfaces of the switches with flowcontrol enable (flowcontrol rx on tx on). Will the configuration affect/conflict with these interfaces? DCB-Map are not applied to those non-rdma interfaces.
-ken
Hello JT.
I wanted to let you know that we found this blog post extremely helpful. I do have a question if you have a moment. We have Cisco Nexus 9000 series switches with NX-OS 7. My network admins said that the values you provided we not allowed. You posted this: pause buffer-size 20000 pause-threshold 100 resume-threshold 1000 pfc-cos 3, but they said the minimum values they could set were this: pause buffer-size 27456 pause-threshold 12480 resume-threshold 12480 pfc-cos 3. Can you help me out here? I want to make sure I have it right. We are setting up a Storage Space Direct Cluster.
Thanks
-Matthew
Thanks for the feedback.
My guide is a baseline for how to set it up. The os might change as new versions come out. There is a guide for Nexus 3132 in the official MS doc and it does not have these settings. As it’s a diffrent NX-OS i belive. But what i recomend is using my baseline, do a google search of the latest NX-OS and see what they put in there. I will update my post with the guidelines for the NX3132 switch.
But if the minimum threshold’s have change i do not see any reason not to use the new values. But always refer to the latest CLI guide for the OS you are running. If you get it to work, let me know and il update the blog post.
JT
Great post JT!
Do you think RoCE/RDMA/DCB will work on Juniper EX4550s? If so, have you tried it and can you share the configs?
KL
To be honest i do not know. I have no experience with Juniper, they do say that it support DCB and PFC. But no mention of RoCE only FCoE. I think you will need to dig deep to find the correct info/config. You could ofc ask juniper. But you would need to to a lot of googling 🙂
Let me know if you figure it out.
Oh and remember to turn off DCBX as it’s not supported with S2D.
JT
Hi,
We have configured Juniper QFX5100 to work with DCB and PFC, it only supports RoCEv1.
To support RoCEv2 you need 17.4 release and at least QFX5110 modell.
I am working with JTAC on some PFC issues and I will ask them if EX4550 is supported for RoCEv1.
NS
Thanks for this 🙂
You know how to change the settings in the OS to work on RoCEv1? If it’s Mellanox cards it’s a registry key.
Regards
Jan-Tore Pedersen
Yes we are using Mellanox cards.
# The following RoCE modes are supported:
# •RoCE V1 MAC based (legacy) : 1
# •RoCE V2 IP based (routable) : 2
# Check status on Mellanox NIC
Get-MlnxDriverCoreSetting
# Set RoceMode
Set-MlnxDriverCoreSetting –RoceMode 1
Good day, need help to configure PFC for RoCEv1 on Juniper QFX5100 it is possible?
Hello
Im sorry but i have no experience with Juniper, and it’s a OS i have glanced at found that im not touching it 🙂
As far as i can see it does not support RoCE and RDMA. It supports DCB and DCBx over FCoE and iSCSI but not RoCE.
I would recomend contacting Juniper for this.
Regards
Jan-Tore Pedersen
Good day, need help to configure PFC on QFX5100 with JUNOS 14.1X53-D45.3, it is possible to share example of you configuration or check my configuration…..????
Just a note, I believe your Dell Force 10 S4810 config is slightly off. You are marking Priority 4 not Priority 3 with the current command of “priority-pgid 1 1 1 1 0 1 1 1”
Thanks you for pointing that out, you are absolutely correct 🙂 I write mistake from my side.
Thanks
JT