Last year i had an issue where nodes would suddenly drop out of the Azure pane and no longer show under the cluster or as a Azure Resource. And perhaps it also says not connected recently?
So let’s go trough the scenarios and how to fix them
The Issue
Here you can see that the cluster is no longer reporting to Azure.
In this scenario both nodes are still reporting under the cluster but it’s not connected in a while. But if a node is missing here, it will also be missing in the next picture.
You can also see here that both Arc Resources representing the nodes are still there if you go in to the resource group where the node resources are supposed to be.
Since both nodes are here we need to do some trouble shooting
Issue 1 Cluster not Reporting to Azure and All nodes are still visible in Azure
For the above scenario where both nodes are still visible and connected to azure but not syncing we can try the following.
We can run the following commands
Cd C:\Program Files\AzureConnectedMachineAgent
.\azcmagent.exe show
Get-AzureStackHCI
Get-AzureStackHCIArcIntegration
Based on the output the ascmagent is saying the node is connected, the Get-AzureStackHCI says that it’s unable to connect to the service.
Now if you know you are behind a strict firewall or thinking it could be network related run the Connectivity Validation tool.
First you need to install it run the following PowerShell commands
Install-Module PowerShellGet -AllowClobber -Force
Set-PSRepository -Name PSGallery -InstallationPolicy Trusted
Install-Module -Name AzStackHci.EnvironmentChecker
Then you can run the Invoke-AzStackHciConnectivityValidation PowerShell command.
Invoke-AzStackHciConnectivityValidation
In this case the output shows that every test is ok and we have full connectivity
Now if i try to run the command Sync-AzureStackHCI i get the Unable to connect to service
If i run the same command on node 2 it says it can’t sync to azure as node01 is the owner of the cluster
There are 2 services that is important to Azure Stack HCI, in my case on the owner node the Azure Stack HCI Cluster agent was stopped. If i try to start it get’s stuck in Start Pending.
If i then move the Core cluster role to the other node the Sync-AzureStackHCI command works. This tells me that something is up with node 01. A reboot is in order for that one. Now the reason for this is that the owner node is the one syncing all billing information to Azure.
If i now look at the Azure portal, node 01 is now gone from the cluster plane and the cluster shows as connected.
To fix this node01 needs to be rebooted. Now this is my fault as i have not patched the nodes in a while. And i recomend that you do patch the nodes on a refular basis. Pref once a month with 1 month defered. That means if you are in March, then you can install February patches, unless there is a very critical patch that needs to be updated with.
Once the node is back up and back into the cluster you run the following commands in sequence.
Get-AzureStackHCIArcIntegration
#I if it shows
#ClusterArcStatus = Enabled
#NodesArcStatus = {[se350n01 Disabled] and[se350n02, Enabled] }
#If so,
#1. Sign in to the Azure portal and delete the Azure Resource Manager resource representing the Arc server for this node (se350n01 Disabled.
#2. Sign in to each disconnected node again and run Enable-AzureStackHCIArcIntegration.
Enable-AzureStackHCIArcIntegration
#3. Run Sync-AzureStackHCI on the 2 nodes.
Sync-AzureStackHCI
#If Get-AzureStackHCIArcIntegration happens to show NodesArcStatus = {[se350n01 Enabled] and[se350n02, Enabled] }
#1. Sign in to the disconnected node/s, (se350n01).
#2. Run Disable-AzureStackHCIArcIntegration.
Disable-AzureStackHCIArcIntegration
#3. Sign in to the Azure portal and delete the Azure Resource Manager resource representing the Arc server for this node (se350n01).
#4. Sign in to each disconnected node again and run Enable-AzureStackHCIArcIntegration.
Enable-AzureStackHCIArcIntegration
#5. Run Sync-AzureStackHCI on the node.
Sync-AzureStackHCI
This will now sync the node back to azure and it will shop up again as a arm resource. Now one thing you need to know is, if you don’t use Azure Policy to enable Azure Monitor and Insights it won’t onboard it to that. And you will also need to attach the Data Collection Rule again to the node to get data synced.
Issue 2 Cluster not Reporting to Azure and one node is not visible in Azure
This is the same procedure as the one above. Follow the guide there and do the same. The steps will be
1. Pause the Node not visible in Azure
2. Reboot the node
3. Run the script above
Log Collection
If you need to create a support ticket or want to read the logs your self, you need to collect them from all nodes. Now there are certain logs you will not be able to read because you will lack the files needed to read the logs correct. So this is more for Microsoft Support. And i like to get them going pretty fast so that you can upload them to support once the ticket is created.
#Enable HCI Debug Logs:
Get-ClusterNode | % { Invoke-Command -ComputerName $_ -ScriptBlock { Wevtutil.exe sl /q /e:true Microsoft-AzureStack-HCI/Debug } }
#Alternatively: Run this on each HCI node:
Wevtutil.exe sl /q /e:true Microsoft-AzureStack-HCI/Debug
#If you get this message, it means the debug log is already enabled: Failed to save configuration or activate #log Microsoft-AzureStack-HCI/Debug. The requested operation cannot be performed over an enabled direct #channel. The channel must first be disabled.
#1. Please ensure Microsoft-AzureStack-HCI/debug is enabled and collected from all nodes
#2. Since this may be cluster related, also at a minimum please also include the System and Microsoft-Windows-FailoverClustering logs
#This snippet will retrieve those logs and find/collect the registration log files.
# Get Logs From All Nodes. Run from one of the nodes or modify Get-ClusterNode to use a management node
$AdditionalLogs = "Microsoft-Windows-Kernel-Boot/Operational", "System","Application", "Microsoft-Windows-FailoverClustering/Operational"
$logNames = @("Microsoft-AzureStack-Hci/Admin", "Microsoft-AzureStack-Hci/Debug") + $AdditionalLogs
$date = Get-Date
$datestring = "{0}{1:d2}{2:d2}-{3:d2}{4:d2}" -f $date.year,$date.month,$date.day,$date.hour,$date.minute
$logDir = Join-Path -Path (Get-Location) -ChildPath "AzsHci-Logs_$datestring"
New-Item -ItemType Directory -Force -Path $logDir
$s = {
$path = Join-Path -Path $env:temp -ChildPath $env:computername
Remove-Item $path -Recurse -Force -ErrorAction SilentlyContinue
New-Item -Path $path -ItemType Directory
$using:logNames | % {
$basefilename = Join-Path -Path $path -ChildPath ($_.Replace("/", "-"))
$filename = $basefilename + ".evtx"
$evt = New-Object System.Diagnostics.Eventing.Reader.EventLogSession
$evt.ExportLogAndMessages($_, 'LogName', "*", $filename)
Get-Winevent -LogName $_ -Oldest -ErrorAction SilentlyContinue | Format-Table -Autosize -Wrap | Out-File "$basefilename.txt"
}
# Registration script logs
$logFiles = Get-Childitem -Path C:\ -Include *RegisterHCI* -Recurse -ErrorAction SilentlyContinue
if ($null -ne $logFiles -and $logFiles.Count -ne 0)
{
foreach($logFile in $logFiles)
{
$logFilePath = $logFile.FullName
$savePath = Join-Path -Path $path -ChildPath "$($logFile.Name)"
Copy-Item $logFilePath -Destination $savePath -Force
}
}
return $path
}
Get-ClusterNode | % {++$i; $node=$_; [array]$jobs += Start-Job -Sc {
Write-Progress -Id $using:i -Activity "Collecting AzsHCI Logs" -Status "Node $using:node"
$lognames = $using:lognames
$session = New-PsSession -ComputerName $using:node
$remotepath = Invoke-Command -Session $session -ScriptBlock ([scriptblock]::Create($using:s))
Copy-Item -Path $remotepath -Destination $using:logDir -Recurse -FromSession $session -Force
}}
Receive-Job $jobs -Wait -AutoRemove -ErrorAction SilentlyContinue
Compress-Archive $logDir -DestinationPath "$logDir.zip"