What Is Zerto Replication

Zerto is a fairly new replication product for Virtual environments, which is quickly growing in popularity after getting best in show at VMworld 2011.  At work we have recently started using it for some of our DR processes and it seems to provide several advantages over traditional Storage replication and Hypervisor Replication.

Zerto is a purely software based replication product meaning it can be completely Storage Array agnostic unlike most Storage replication products and because it works using the vSphere APIs it is able to replicate at a more granular level, each VM rather than each LUN/Volume.  Also unlike traditional Hypervisor based replication products it does not use snapshots to perform the replications allowing it to support significantly reduced RPOs and greatly increased levels of scalability.

How Does It Replicate

As I mentioned earlier Zerto does not use the traditional method of Snapshot based replications, this allows it to be both considerably more efficient and co-exist better with backup products using the traditional methods.

When Zerto is set up you will have a replication appliance or VRA sitting on each host that VMs are being replicated from/to, these VRAs will use the VMware APIs (mainly vSphere API), which allows it to see all the data coming through the IO stack on that host, and can compress and replicate and changes to the secondary site.  Because the data is only being taken from the IO stack using the APIs it does not interfere with the process of the data being written to disk it will not affect the performance of the VMs.  This also means if the VRA were to fail the replicated VM will continue to work as normal.

What If Replication Fails

If the replications fail to the point where one or more components cannot keep up with standard method of replication, usually this will happen if either there is unusually high contention on the WAN or target storage; or sometimes when a large amount of new data is added in to a replication.  Then the replications will fail-back to a “Bitmap Sync”  When this happens the VRA will still use the VMware API to read the IO stack but it will store less detail in memory and keep the data in memory until the issue preventing replication is resolved

If we use and example of a failed WAN link preventing replication:

Under normal operation the VRA will store all changes to a block level, if blocks 7,9,10,47,49,51 are changed then those blocks will be written in to memory and then removed once replicated

When the link fails and the number of changes being stored in memory increases the amount of detailed being stored will decrease, so initially the VRA may store the information that blocks 7-10 and 47-51 have changed.

If the link stays down for a longer amount of time then the VRA will store the changes in even less detail, in this example it could be just storing the information that block 7-51 have changed.

This enables the replication process to handle fairly large amounts of downtime but the longer the link is down and the less detail is stored the larger the replication will be once the link is returned.