DPM Azure Recovery Services Agent Crashing

Update: We did start having dependency issues after updating the MARS agent. It appears that the agent now depends on the management service. Not getting errors anymore though so we reset things back to normal. Stuff below is just for posterity.

DPM 2016 deployments have been filling up my error logs with crash reports for the Microsoft Azure Recovery Services Management Agent. Turns out that’s the statistics agent for the Azure dashboards that don’t work on the LTSC releases of DPM (http://blog.teknikcs.ca/2019/02/21/dpm-protected-items-dont-appear-in-azure-vault/).

System Event ID: 7031 
The Microsoft Azure Recovery Services Management Agent service terminated unexpectedly
Application Event ID: 1000
Faulting application name: OBRecoveryServicesManagementAgent.exe
Application Event ID: 1026
Application: OBRecoveryServicesManagementAgent.exe
Description: The process was terminated due to an unhandled exception.
Exception Info: System.AccessViolationException
at .CTraceProvider.TraceToErrorFile(CTraceProvider, DLS_TRACE_EVENT)

Disable it if you’re on DPM 2016 or DPM 2012. No impact that we’ve seen.

Shoretel Users Can’t Change Call Handling Mode or Agent Status

TL;WR Probably the SG90 acting up again. Those things are weird. Rebooting the SG90 and the Director server fixed it for me. YMMV.

While moving around some VM’s we had a Shoretel Director server running without a network connection for 4 hours during a maintenance window. Afterwards users couldn’t change their Call Handling Modes or change their agent logged in/out status. It failed from both the phones and from communicator. At first I thought it was a CAS problem; however the phone directory, history, options, and speed dial features were all working correctly.

I popped up the IPDSCASCfgTool (see bottom) to set the log levels for the CAS to include all the DB and CAS flags for a start. After that I used powershell to stream the logs with Get-Content. I use Measure-Object first to grab the line count of the file so that we can skip the first 393,000 lines straight to the live output. That’ll work like tail -f in linux and just continuously stream the logs to the console.

Note: You SHOULD be able to use Get-Content -Wait -Tail <Number of Lines> to skip to the end, but that wasn’t working on this particular server. Gremlins…

PS C:\Shoreline Data\Logs> Get-Content .\ipds-190225.000000.Log | measure
Count : 393693
Average :
Sum :
Maximum :
Minimum :
Property :
PS C:\Shoreline Data\Logs> Get-Content .\ipds-190225.000000.Log -Wait | where -Property ReadCount -gt 393700
17:49:28.837 ( 3264: 3512) >SetUserCHM. User: 123. CHM: 2
17:49:28.888 ( 3264: 3512)
15:52:01.574 ( 7508: 5168) >CDBWriter::SetUserCHM::CDBUpdateTable::Update() failed. Error: 0xc1200db5.

SetUserCHM was me (unsuccessfully) changing from CHM 1 (Standard) to CHM 2 (In a Meeting) from communicator (testing from a phone will also log here but it’s noisier). That error sent me off looking for database issues, communications problems, etc. No dice. The evt log showed some interesting output though:

15:55:17.013 ( 4080: 4476) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C
15:55:17.029 ( 7508: 4036) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C
15:55:17.183 ( 4080: 4436) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C
15:55:17.183 ( 2992: 6552) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C
15:55:17.183 ( 1764: 1856) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C
15:55:17.187 ( 4972: 5232) [evtl] (Error) CEventLibImpl::sendReceiveIPC failed - 0xC126100C

After digging around for named pipe issues and doing traces I tried the same on the voice switch and didn’t see any interesting errors. Theoretically the phone’s button and control traffic hits the voice switch and is making it to the director but I couldn’t verify that because I couldn’t get the switch to start a packet cap. Supposedly that’s because of a cipher mismatch: the director server tries to ssh into the voice switch to start the packet cap but it fails to login using it’s certificate.

Anyway I ran out of ideas and just waited until I could restart the switch and it worked. Everything was fine. Once again the SG90 blew up in my face was behaving unexpectedly and sent me one step closer to trying the vswitch.

Another one of those 1/1 google searches:
IPDSCASCfgTool (https://oneview.mitel.com/s/article/How-to-set-the-Log-Levels-for-IPDS-CAS-on-Shoreware-Servers)

DPM Protected Items Don’t Appear in Azure Vault

TL;WR: None of the current Long Term (i.e. 2012 R2/2016) DPM releases actually send updates to azure. So unless you’re on the bleeding edge DPM stream you don’t get to see this stuff. Suggested Pairing: Two drawn-out eye rolls.

Simple enough problem, I was looking at the Recovery Services vault page for a DPM installation and noticed all these nice dashboards for alerts, items, jobs, etc. All of them were empty.

  • 0 Backup Items
  • 0 Backup Jobs
  • 1 Azure Backup Agent… with item count 0

After a blazing fast support call I learned that this feature currently doesn’t work with any released version of DPM (Microsoft’s flagship backup software in case you forgot). According to support the feature won’t be implemented until DPM 2019 is released (due March 2019).

Semi-Related: On our last deployment it really annoyed me to find out that no 2016 version of DPM supported server 2019. Again… Msft’s Flagship Backup Software… and it doesn’t support the latest version of the server OS AND their snazzy dashboards don’t integrate. I get that delaying 5 months between GA launch and integrating with system center doesn’t seem that crazy, but it certainly was annoying to find out that the brownfield deployment simply wouldn’t be able to use server 2019 without buying a new set of DPM licenses.

Yeah yeah “DPM 2016 is old so obviously it won’t protect new versions of windows.” I’d argue that sentiment isn’t appropriate in the context of a backup application with non-trivial behaviour meant for long term use.

Cannot disconnect windows server iSCSI sessions when you ignore your own advice

TL;WR: If you can’t eject a disk and you have apps open, try closing them! Duh. Suggested Pairing: A third of your remaining tea/coffee vessel.

Wasted 15 minutes of my life today trying to disconnect two iSCSI sessions on a development Windows Server 2012 R2 hyper-v host. Kept getting “This session cannot be logged out since a device on that session is currently being used”. Pulled up Process Explorer looking for handles on the disks (searching MPIO in my case because we were using MPIO). Lo and behold our task manager had open handles on the disks.

I immediately realized that this particular test server had disk perf monitoring running (DISKPERF -Y). That puts basic disk performance counters into task manager and of course I had task manager open in the background. While that’s handy for a test server, it’s not recommended in production for performance reasons and things like this. Solution was to follow my own advice and close background apps when troubleshooting access problems. Do as I say etc.

PowerShell Inline Status Updates Without Write-Progress

Preface: Don’t use this, it’s really bad practice for PowerShell code to work by manipulating the host output stream. It’ll work, but it’ll mess with logging, transcripts, text pipeline (ew, gross, use objects), and other things I’m sure. For a one off that you’re writing in a busy remote shell… maybe...

Here’s a lazy way of having a progress indicator that doesn’t fill the whole screen history and doesn’t require Write-Progress. Use NoNewline to avoid implicit newline/carriage return, then manually add a carriage return (`r) at the start of your line to overwrite existing text.

The magic “$([char]27[2K`r” (Escape[2K) is an ANSI control sequence that tells the terminal to preemptively erase the entire line so that you don’t end up with leftover characters if your next value is shorter than the previous one. https://www.real-world-systems.com/docs/ANSIcode.html#CSI

Write-Host "`r$TimestampOrWhatever: $ValueOrWhatever" -NoNewline


Write-Host "$([char]27)[2K`r" -NoNewline #Equivalent to Esc[2K
Write-Host "$TimestampOrWhatever: $ValueOrWhatever" -NoNewline

S2D Cluster Creation From VMM 2016

TL;WR: If you’re trying to setup an S2D Hyper-Converged Cluster before adding it to your VMM Fabric: Don’t. It will work, but you won’t be able to see or manage the storage from VMM. Let VMM build it all. Suggested Pairing: Half a dozen hours of diagnostics and ███████ [REDACTED].

Quick Trials and Tribulations Lessons:

  1. I couldn’t for the life of me get VMM 2016 to manage the S2D portions of a cluster created outside of VMM. Did it manually, everything worked fine, validated RDMA etc. Then I realized that VMM wouldn’t let me manage the S2D tiers when creating volumes. Finally after some googling I read in a couple places that VMM won’t fully cooperate with an S2D Cluster created externally. Womp Womp.
  2. Most online documentation regarding externally created clusters just says (paraphrasing Microsoft docs): “Just add the cluster to VMM! Magic!”. That’s not how that works, adding an existing cluster to VMM is easy, adding a hyper-converged S2D cluster to VMM was definitely not easy.
  3. The fairly unhelpful message “Error (25325): The cluster creation failed because of the following error: An error occurred while performing the operation.” For me was because of permissions, somewhere. Even though I applied all of the CNO OU rights, local admin rights, DNS rights, AD rights, and every cluster validation passed without warning; I kept getting this error. Eventually I gave up and used a highly privileged AD account and voilà.

In the end after creating several S2D clusters trying to get VMM to cooperate I finally got it to show me the magic “Configure advanced storage and tiering settings” box.