Here recently I've noticed that there is an occasional time where the VMs I have replicating using the vSphere Replication system are stuck in a "Sync" status for an overly long time.
After pulling the logs, I was able to figure out what was happening... Timeouts, lots of them. The log file vmware-dr.log pulled from the remote site was full of lines like the following: (local is the SRM server, peer is the vCenter server)
2012-04-02T07:35:04.077-04:00 [02784 verbose 'Default'] Timed out reading between HTTP requests. : Read timeout after approximately 50000ms. Closing stream TCPStreamWin32(socket=TCP(fd=2596) local=10.xx.xx.xxx:9085, peer=10.xx.xx.xxx:55039)
2012-04-02T11:54:34.159-04:00 [02744 verbose 'Licensing'] Asset in sync.
2012-04-02T11:58:12.527-04:00 [02868 info 'LocalVC' opID=ac2d1cb] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
2012-04-02T11:58:12.723-04:00 [02860 info 'LocalVC' opID=596971f7] [PCM] Received NULL results from PropertyCollector::WaitForUpdatesEx due to timeout of 900 seconds
After a brief discussion with our network engineers, it was believed that there was no problem with the connection between the local and remote site. So I took a "when in doubt, reboot" approach. I restarted the SRM service on the remote SRM server. No luck. After that, I did a "Restart Guest" on the VRS system at the remote site. After about 5 minutes, the systems started to connect and replicate again.
I've noticed it a lot, and I've heard from other people whom also manage their own SRM deployments that a reboot is a pretty good first step in troubleshooting. So keep that in mind as issues arise and troubleshooting is required.
We finally decided to get some real disaster recovery and business continuity plans set in place and after deliberating between a couple options, we decided to go with Site Recovery Manager 5 with hypervisor based replication.
There's been plenty of fun in setting it all up and starting the replications.
Default Instance is absolutely required.
Mixed Mode Authentication is also absolutely required.
Create a login, database, and a schema within the database all with the same exact name
As a precaution, the created login is also a sysadmin and db_owner for the created database.
Connect to the SQL Server with IP, FQDN wouldn't work.
vCenter connection should be listed the same as in the site connection (ie. if you connected the sites via FQDN, use FQDN for the VRMS configuration).
Only after those steps, could I get VRMS to connect to the SQL server (which if you notice by the screenshot, is the same SQL server as the SRM connection).
Initial Replication Error
Now that I had green check marks across the board... While trying to set the replication on any of the VMs, I would receive this error: Call "HmsGroup.OnlineSync" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
Going through the logs, I saw lots of SSL Handshake errors and general connectivity problems, so I had to go back to our networking people and have them alter the hardware firewalls to allow connectivity across the board to all the systems involved (ESXi host, vCenter, SRM server, VRMS, VRS). Once that was done, it would successfully configure the virtual machine for replication.
I have yet to go back to firm up all of these port rules, I'll report back once I have it done.
Side note: I have no reason to doubt the VMware TCP/UDP KB, I just know that I was still having some connectivity problems after following it.
Replication Locking Failure
Now that the connections are all good, I have a couple VMs replicating. I go to another VM, right click, vSphere Replication, add in a schedule and then I receive an error of: Configuring Virtual Maching for Replication... failed.
VRM Server generic error. Please Check the documentation for any troubleshooting information. The detailed exception is 'Optimistic locking failure'.
After searching the documentation, I come up with a synchronize storage step error, however this is not correct for this situation as I have not yet synchronized it.
I check the system and it's up, it's running, the VMware Tools are installed and functioning properly, everything looks good. So I go back and try and run the vSphere Replication wizard again and I am instantly hit with an error of: Call "HmsGroup.CurrentSpec" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
So I started by rebooting the system, once the VMware Tools were running I did it again only to see the same error. This time, I powered off the system, removed it from inventory, added it back by the browsing the datastore for the vmx file, and powered it on. Once the VMware Tools were running, I tried it again and it worked perfectly! That was a little painful and I wished I had made notes of the timestamps to go through the logs, but it was a success nonetheless.
Large VMDK Replication Problems
With that figured out, it was time to get the VMs replicated and SRM with VRMS worked wonderfully from that point... until we got to the file servers.
I know the "2TB - 512" disk sizing rule, we found that out the hardware from having upgraded from 3.5 to 4 with some RDMs. It was not a fun experience. So all of the vmdks of our file servers are 2040GB in size.
The initial replication is successful, however the re-sync is not. It gives an event of: Replication operation error: Virtual machine is in an invalid state.
As before, the system is up, it's running, the VMware Tools are installed and functioning properly, everything looks good. So I go into the SRM plugin and tell it to "Syncronize Now" and I receive another error of: Call "HmsGroup.OnlineSync" for object "GID-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" on Server "x.x.x.x" failed. An unknown error has occurred.
So I pull down the logs by going to the Sites button, Summary tab, looking in the commands for "Export System Logs". The important part here is to get the logs for the site giving you the error. IE. if a server at your remote site is the one failing in the message, that's the log you'll want.
Unfortunately the event logs contain items such as:
2012-01-04T18:58:34.533Z [F5993B70 error 'Main' opID=hsl-a0edc478]  ExcError: exception N3Vim5Fault12FileTooLarge9ExceptionE: vim.fault.FileTooLarge
2012-01-04T18:58:34.533Z [F5993B70 error 'Main' opID=hsl-a0edc478]  Code set to: Generic storage error.
I'm currently working with a VMware Support Engineer to fix this problem. There has been a bug filed, so hopefully there is some new news soon. I'll update when I know something.
Large VMDK Replication Problems - Resolved
Received some unfortunate news from VMware Support yesterday. With vSphere Replication, it uses snapshot technology to forward the deltas to the remote site. Well, unbeknownst to me, there is actually some overhead that is applied to the VMDK file itself! So really the 2TB minus 512B should be more like 2TB minus 512B minus 16GB to make a total maximum of around 2030GB.
So once I reduced the VMDK size down to 2030GB, it replicates and updates just fine. Now the problem is how do I shrink the VMDK files...
If you want to read more, check out the VMware Knowledge Base article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1012384 and more specifically the "Calculating the overhead required by snapshot files" section towards the bottom.
Disable Replication of VMDK = Delete VMDK
This was one of the things I learned the hard way through the testing of the larger VMDK files. I happened to go through and set one of the disks that had already been replicated to be disabled from replication. From the local site vCenter, it looks like the replication was turned off, right?
Unfortunately, that was not the case. I pulled up the remote vCenter and was greeted with an event that says the virtual disk was deleted!
That was certainly a surprise. I guess I understand why it does that, but the first time I did it, the VMDK that was deleted was 2040GB in size. It took me almost 2 days to copy all of that information down!