Replication

We have created a replication (synchronisation) engine which serves four independent functions:

  1. Migration of data between Cyrus backends for load balancing (it also allows us to migrate all data off a live system for maintenance).
  2. Efficient data transfer to a tape spooling system which has a large amount of cheap IDE disk storage and an LTO tape stacker.
  3. Initial upload of mailboxes from the old UW world to Cyrus.
  4. Rolling replication of updates in real time to independent hot spare systems (eventually at a remote site) for disaster recovery.

This is the largest single customisation that we have made to the Cyrus server.

Postscript: This code has now been merged into Cyrus 2.3 CVS by Ken Murchison. Ken put a lot of work in to merge the code properly into Cyrus (my version deliberately kept the replication code to one side to make updates easy) and also to add support for shared mailboxes and other things that we don't use yet here in Cambridge. However, the underlying logic hasn't changed all that much from the code that we have been running happily since about July 2003. Rolling replication has saved my bacon on a number of occasions. It is really very satisfying to be able to move users off a system which has some kind of hardware or OS problem with a single command.

Design

Our replication engine is a transaction based system sitting on top of the Cyrus libimap mailbox access library. This provides better flexibility and robustness than a replication scheme which transfers data by raw disk block or file without any understanding of the underlying content. It also means that the replica system is in a self consistent state at the end of each transaction. This is true even if the source system suffers from file system corruption and is trying to replicate a corrupt database. (The lack of checksums in index files and mboxlists at the source end means we aren't 100% safe: it is just conceivable that the system would replicate "delete all mailboxes" as an action, despite various safeguards. In practice sanity checks in libimap mean that the replication engine bails out rapidly in the event of corrupt input.)

Transaction based replication is probably more bandwidth efficient than block based replication as you don't need to replicate entire data blocks or all the traffic generated by heavy fsync() activity on the master system. In addition, transactions can be merged to create a small set of transactions needed to bring a replica in line with the master copy. This makes the replication engine particularly effective when it has to catch up from a backlog of updates. As a concrete example: a pair of Intel servers each with a six disk RAID 5 set, 100 Mbit Ethernet and 1,000 active users takes about three minutes to catch up from several hours activity (around 10,000 messages).

The protocol used for replication is a simple lined based text protocol similar in style to IMAP or POP. The reason that we cannot use IMAP itself is that we need to preserve a number of message and mailbox attributes (e.g: mailbox UIDvalidity and UIDlast, message UIDs, timestamps). The protocol attempts to minimise the number of round trips between client and server to improve efficiency and throughput. The replication protocol is designed to be robust: it can recover from errors and inconsistencies which are caused if the master and replica lose synchronisation because someone has made unexpected changes at the replica end. It relies on cluster wide message globally unique identifiers (UUIDs) in order to resolve conflicts and maintain the single instance store when copying messages between systems. It is possible to run several instances of the replication system concurrently: particularly useful for checking the integrity of a replica system while rolling replication is also in operation.

The replication engine depends to a limited degree on the two phase expunge system that we have created. This allows it to create snapshots of individual mail folders for the few seconds that it typically takes to synchronise a single folder: by locking out the asynchronous expire job we are guaranteed that no messages will disappear from the master end while we are updating a specific mailbox, which makes error recovery easier. The replication engine also has the ability to lock out all updates to the mboxlist for a specific user for a few seconds so that we have a guaranteed consistent set of mail folders to work with if any ambiguity arises there (more about this later).

Components

The main replication system consists of three components:

sync_server

A fairly simple server which runs on the replica system and just follows instructions from the client, reporting errors and problems as needed.

sync_client

Does all the hard work: it asks the server for its current status of all the objects of interest in a given account, and then works out a set of transactions that will transform the server into a clone of the client system and executes them with assorted forms of error recovery.

replicate

Runs on the client to set up sync_client and sync_server, typically using an SSH link to access the remote system and start up the server. Passes instructions down to the client and server and then just lets them get on with their job until they have finished.

Three additional components help with the initial transfer of data into the Cyrus world:

sync_upload

Interfaces to sync_server when uploading data from the UW world to Cyrus, preserving a number of message and folder attributes (folder UIDvalidity and UIDlast, message UIDs, timestamps).

cyrus_upload

Used to establish a tunnel between sync_upload running as the user and sync_server running as the Cyrus user on the relevant backend system. Equivalent to replicate above.

upload.pl

Wrapper script used to check whether accounts are active, lock out users for the duration of the transfer and then run cyrus_upload.

Making rolling replication efficient

One approach to replication is to synchronise all of the data in an account every time something changes. This is very useful for data migration and nightly backup purposes, but we really don't want to do that every time something changes during rolling replication. The solution is to define a set of events/actions which will be recorded by the IMAP, POP and LMTP servers during their normal operation. This list of actions is picked up by an asynchronous queue runner which is constantly pushing changes to the replica system (resolving any conflicts in the process). The actions which are currently defined and the consequence of these events are as follows:

APPEND <mailbox>

Upload all messages for the named mailbox which exist on the client which have UIDs greater than the uidlast value last recorded at the server end. In case of problems (e.g: lack of any messages to APPEND, most likely caused by MAILBOX updates since the APPEND operation was logged) we fail back to MAILBOX.

SEEN <mailbox> <user>

Update seen database for given (mailbox, user) pair. In case of problems, we fail back to MAILBOX.

MAILBOX <mailbox>

Initiate full update for given set of mailboxes: upload missing messages, resolve UUID conflicts, expunge messages on server which have been expunged on client, update flags and ACLs as required. The replication engine uses message UUID values to track messages which have already been uploaded to the server and which can be moved or linked rather than uploaded again, improving efficiency and maintaining the single instance store.

MAILBOX actions are also used to replicate IMAP CREATE, RENAME and DELETE commands. The IMAP server records the names of all the mailboxes involved, and the replication engine works out the minimum number of transactions required to synchronise the server by tracking the Cyrus uniqueid values of mailboxes (we don't normally have to upload messages a second time). In case of problems, we fall back to USER.

META <user>

Update account meta-information: subscriptions, sieve filters and quota limits. In case of problems, we fall back to USER.

USER <user>

Synchronise everything: all mailboxes and account meta information. In practice this only normally happens if the list of mailboxes changes under our feet.

If the first attempt at USER synchronisation fails, we lock the user out of the mboxlist list for a few seconds and have another go: this deals with most cases where the user is renaming or deleting large numbers of folders at a go. If the given account doesn't exist on the replica system it is created.

If the UIDvalidity of a users inbox on the server doesn't match the UIDvalidity on the client, we are probably looking at an obsolete account. The system can either clear the replica account and replicate everything or bail out and demand intervention from the system administrator (the safer course of action).

Merging actions

The replication engine combines and promotes actions as required: if it saw 6 APPEND events, 3 SEEN events and 2 MAILBOX events for a single mailbox in a single pass these are converted into a single MAILBOX action. Similarly the replication engine groups MAILBOX actions to give it a better chance of tracking messages which move between folders.

Initial replication

One amusing case is the way in which accounts are first replicated in the absence of any other instructions. Typically a message arrives, causing an APPEND action, this gets promoted to a MAILBOX action, which in turn is promoted to a USER action and the entire account gets replicated. The upload.pl job which uploads our users from the UW world to the Cyrus world sends them a mail message to tell them that they have just been moved. Hey Presto: you've just been replicated...

Restart

In rolling replication mode, the synchronisation engine is typically running across a single, permanent, SSH link. sync_client and sync_server at the two ends fork() off client processes and sync_client negotiates a restart every few minutes. The restart quietly resolves any memory leak issues and clears out message which have been staged when replicating the single instance store or which have been RESERVED on route from one folder to another at the server end.

The problem with asynchronous replication

Our rolling replication system implements asynchronous updates: the replica system should always be internally self consistent, but may be a few seconds out of date compared to the master copy (particularly if there is a sudden surge in activity: the replication system may choose to idle for a few seconds to prevent overload, but will build up a small backlog of actions to clear in the process). This means that messages delivered to the master system may be lost if the master dies before the data in question is replicated. This is the major failing of this approach compared to entirely synchronous block based replication which only acknowledges an IMAP APPEND or LMTP mail delivery when data is safely recorded on both the master and replica system, at a cost of extra latency and higher bandwidth requirements.

Additional safeguards are possible: we can configure the external Mail Transfer Agent to buffer and then replay delivery of recently delivered messages (and the duplicate suppression system can be used to eliminate messages which have already been replicated). Messages uploaded via APPEND commands sent to the IMAP server (e.g: sent-mail) are much more of a challenge. One approach would be use get the IMAP proxy in a Murder configuration to record messages which have been successfully appended and replay that log on failover. This is all academic unless the front end LMTP and IMAP proxies also provide high availability failover and their own replication system.

This replication engine is still a rather useful tool for disaster recovery.


David Carter <dpc22@cam.ac.uk>