SRM: vSphere Replicated VMs stuck in a “Sync” status
Here recently I've noticed that there is an occasional time where the VMs I have replicating using the vSphere Replication system are stuck in a "Sync" status for an overly long time.
After pulling the logs, I was able to figure out what was happening... Timeouts, lots of them. The log file vmware-dr.log pulled from the remote site was full of lines like the following: (local is the SRM server, peer is the vCenter server)
2012-04-02T07:35:04.077-04:00 [02784 verbose 'Default'] Timed out reading between HTTP requests. : Read timeout after approximately 50000ms. Closing stream TCPStreamWin32(socket=TCP(fd=2596) local=10.xx.xx.xxx:9085, peer=10.xx.xx.xxx:55039)
2012-04-02T11:54:34.159-04:00 [02744 verbose 'Licensing'] Asset in sync.
2012-04-02T11:58:12.527-04:00 [02868 info 'LocalVC' opID=ac2d1cb] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
2012-04-02T11:58:12.723-04:00 [02860 info 'LocalVC' opID=596971f7] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
After a brief discussion with our network engineers, it was believed that there was no problem with the connection between the local and remote site. So I took a "when in doubt, reboot" approach. I restarted the SRM service on the remote SRM server. No luck. After that, I did a "Restart Guest" on the VRS system at the remote site. After about 5 minutes, the systems started to connect and replicate again.
I've noticed it a lot, and I've heard from other people whom also manage their own SRM deployments that a reboot is a pretty good first step in troubleshooting. So keep that in mind as issues arise and troubleshooting is required.




10 GHz Total CPU
16 GB Total RAM
7,578 GB Total Disk
1 Host(s)
1 RPs
8 VMs
0 vMotions
(4)
(4)
(0)
3 Physical NICs
3 Virtual PGs
December 14th, 2012 - 07:22
Hi. We had the same problem with same resolution/workaround. We found out that ANY disruption to the network links or FC fibers, gbics etc would cause this. Qe had a failed gbic on the switch, replaced it and no more reboots for us.