######################################################################## # # Presentation to UK Unix User Group Winter Conference # (High-Availability and Reliability) # Bournemouth, Thursday 2004-02-26 # # $Cambridge: hermes/doc/talks/2004-02-ukuug/talk.mgp,v 1.8 2004/03/05 17:08:33 fanf2 Exp $ # ######################################################################## # %deffont "standard" tfont "standard.ttf", size 5 %deffont "thick" tfont "thick.ttf" %deffont "typewriter" tfont "typewriter.ttf" # %default 1 area 90 90, leftfill, size 2, fore "black", back "white", font "thick" %default 2 size 7, vgap 10, prefix " " %default 3 size 2, bar "gray70", vgap 10 %default 4 fore "black", vgap 30, prefix " ", font "standard", size 6 # %tab 1 prefix " ", icon box "gray30" 50 # ######################################################################## %page %nodefault %center, fore "black", back "white", font "thick", size 8 Scaling up Cambridge University's email service %size 6 UKUUG Winter Conference Thursday 2004-02-26 %font "standard", size 5 Tony Finch Mail Support Computing Service University of Cambridge # # http://www.ukuug.org/events/winter2004/ # ######################################################################## %page About me %size 5 1993-1994 Gap year, Bristol 1994-1997 Comp. Sci., Trinity College 1997-2000 Running web servers 2000-2001 Apache developer, San Francisco 2002-present again Computing Service 1996-present 1997-present 1999-present 2002-present # # CV expressed as a list of email addresses # # Ran a Linux box in my college room. Wanted to do email on it but # couldn't because of Cambridge's port 25 block, so just had a web server. # Wanted to be in the Mail + News team at Demon but was put in the web # team because of my experience, and ended up being a web person for # five years. Now I finally get to do email! # # My work for the CS has mostly been spam & virus filtering # using MailScanner which Julian Field talked about yesterday. # I'm nnot going to talk about that, nor about Exim. # This talk is about work done by my colleague David Carter. # ######################################################################## %page About this talk %area 94 90 03 10 Old Hermes stats, overview, webmail New Hermes goals, base software, architecture New software proxy, two-phase expunge, replication performance, compatibility, migration Future directions # # Hermes is the name of our central email service # # First some background information about the old system # -- including brief plug for our webmail software # # Then overview of our new system # # Bulk of talk about modifications to Cyrus # -- improvements and hacks # # Finally a bit about some future possibilities # ######################################################################## %page Hermes statistics %center %newimage -xscrzoom 90 "stats-talk.eps" # # An idea of the scale of the system. # Growth was exponential from until recently. # Now fairly complete coverage of the people in the University # (though about 7,000 don't use Hermes). # Growth now very slow, except in terms of data size. # ######################################################################## %page Old Hermes - hardware %center %newimage -yscrzoom 90 "hermes-web.eps" # # More an illustration of the naming scheme than actual architecture. # Prism doesn't emit multicolourd rays of light and isn't triangular. # This dates back to 1999, with yellow + green added in 2000. # ######################################################################## %page Photograph of Old Hermes %center %newimage "hermes-old.jpg" # # NetApp F740s, 280 GB storage (18 GB disks) # Suns are E450s and E220s, 2GB RAM. # Optimised for telnet + pine, as well as IMAP. # ######################################################################## %page Old Hermes - software %center %newimage -xscrzoom 90 "hermes-old-sw.eps" # # Based on UW-IMAP c-client library # Mailboxes in home directories for quota reasons. # # Unclear layering # -- mixing mailboxes and files # -- shared knowledge of mailbox layout Exim / c-client / misc (finger) # ######################################################################## %page Old Hermes problems %center %newimage -zoom 100 "crybaby.png" ######################################################################## %page Old Hermes problems ... NFS expensive locking problems poor performance single point of failure Berkeley mailboxes No concurrent access slow to list a mailbox slow to expunge # # NFS performance problems particularly with concurrent access # -- memory mapping optimisations don't work. # Must take down whole service if a NetApp has problems. # # Lack of concurrent access confuses users # -- accidentally log in twice and break first session # # Berkeley mailboxes also have minor corruption problems # ######################################################################## %page ... and consequences Strict quotas 10MB default 50MB maximum 10MB per mailbox 4MB per message Custom software indexed Berkeley mailboxes speed up folder SELECT and FETCH high performance webmail software "Prayer" # # The 10MB mailbox limit applies even if your overall quota is more. # # Not going to say much about the c-client patch # -- too many localisms to be publishable # -- obsolete; better to choose good mailbox format from the start # ######################################################################## %page Prayer %center %newimage -yscrzoom 95 "wing-study.jpg" # # Quick digression. # # Inspired by Oxford's WING # -- Web IMAP/NNTP Gateway # -- wasn't usable when we needed it in 1999 # # Main aim is to avoid thrashing the IMAP server to death # with millions of short-lived connections. # ######################################################################## %page Prayer features %size 5 Very lightweight written in C no scripting languages 4,000,000 hits/day on one PC mostly dynamic & encrypted no frames or javascript etc. Uses c-client to talk IMAP persistent connection to mail store full folder listing cache Custom web server front-end doesn't use Apache TLS session cache, persistent HTTP # # Lightweight on server and client. # Can be run on NetBSD on a Mac SE/30 without crippling it. # # Does use cookies, but they are optional. # Most pages less than 1KB compressed. # # hooks into account management daemon on Hermes admin box # -- password changes, quota, (filtering) # # Available from David's web page (URL at end). # ######################################################################## %page New Hermes goals %center %newimage -zoom 150 "goal.jpg" # # Back to the main topic... # ######################################################################## %page New Hermes goals Large quotas 100 MB default, 1 GB max 200 MB mailbox, 10 MB message Better data recovery users delete email accidentally Remove single points of failure we can use cheaper hardware # # mailbox quota mainly to stop clients from choking # # NetApp snapshots already provide nice data recovery # -- don't want to lose the functionality # -- gaps between snapshots are troublesome # # Not addressing high availability at this time # -- concentrating more on mean time to recovery # than mean time between failure # # The constraints are # -- manpower and budget already mentioned # -- compatibility # ######################################################################## %page CMU Cyrus %center %newimage "cyrus.jpg" # # Choice of software # # Cyrus was a Persian king in about 550BC, # and the first to set up a postal service. # # Software comes from Carnegie-Mellon University. # ######################################################################## %page CMU Cyrus features %size 5 Black-box IMAP server all runs under one Unix UID message delivery by LMTP & Sieve filters user agent access by POP3 and IMAP Supports shared folders access control lists concurrent mailbox access per-user "seen" database One file per message with indexes short header, long header, full text wire-format data # # Runs on top of Unix rather than integrating with it. # # does its own authentication and quota handling # Heavy use of databases # -- admin done with special tools and extended IMAP # # Concurrent access makes efficient IMAP-based biff-alike possible. # # Indexes similar to those we have added to UW-IMAP. # -- full text index new feature for us # -- 200,000 messages per second better than grep! # # Data in wire format on disk # -- heavy use of mmap for reduced copying # # Single-instance store # -- a link to the same file for each folder a message appears in # -- Message-ID:-based duplicate suppression # # courier doesn't have indexes # dovecot doesn't have shared mailboxes or maildir++ quotas # both perpetuate scruffy layering violations # ######################################################################## %page New Hermes architecture %center %newimage -yscrzoom 90 "hermes-new-fluffy.eps" # # Slightly simplified diagram # Much clearer layering # Hardware architecture closely matches software architecture # # Bottom-up description. # # Message store partitioned onto 16 PCs # -- paired machines have copy of each other's data # -- replication engine copies data between pairs and to backup server # ---- more about that later # # Central mail hub is called ppsw # Proxy directs IMAP and POP connections to correct Cyrus machine # Exim routes email to Cyrus via LMTP # # Historical jobs of ppsw: # -- route email within university # -- mailing lists, anti-spam, anti-virus # -- named after PP MTA # # Omitted webmail and menu system / pine boxes # -- menu system still on old Hermes at the moment # -- still includes file store for attachments, pine config etc. # ######################################################################## %page Photographs of New Hermes %area 45 70 05 20 %center %newimage -yscrzoom 100 "hermes-new.jpg" %area 45 70 50 20 %center %newimage -yscrzoom 100 "hermes-new2.jpg" # # Left picture: # # Top 6 machines are ppsw. # # Bottom 8 machines are first tranche of Cyrus message store servers. # -- 3 GB RAM, 350 GB storage each (72 GB disks) # -- Reiser FS for small file performance. # # Right picture: # # Second tranche of 8 Cyrus message store servers on the left. # # Backup server on the right. # 3 TB storage, 250GB disks. # robot has 17 tape cartridge magazine and 2 LTO drives # XFS for large file performance and backup tools. # ######################################################################## %page POP and IMAP proxy %area 40 50 10 25 %center %newimage -zoom 100 "pop.jpg" %area 40 50 50 30 %center %newimage -zoom 100 "imac.jpg" # # Sorry about the dreadful pun. Hard to find a good picture for IMAP. # ######################################################################## %page POP and IMAP proxy Locally developed similar functionality to Perdition not as clever as Cyrus Murder Tracks protocol until authentication then knows correct back-end server after that just a clear channel Handles TLS decryption same session cache code as Prayer # # Murder of IMAP servers -- "cool for crows" # -- allows us to use lots of small message stores # rather than one big monolithic one # # Cyrus proxy continues to track connection after authentication # -- knows which back-end has which mailbox # -- cross-cluster shared mailboxes # -- we don't use it because it was too immature at the time # # Also important for migration (explained later). # ######################################################################## %page Two-phase expunge %center %newimage -yscrzoom 75 "lazy.jpg" # # Our first Cyrus modification. # # Like lazy evaluation of expunge requests # except with extended semantics # -- sysadmins and users can recover "lost" email # # IMAP's two-stage delete/expunge isn't safe enough. # # Good way of using up spare disk # ######################################################################## %page Two-phase expunge (1) %size 5 Expunge "moves" messages to shadow folder file stays in same place; indexes changed folder -> .EXPUNGED/folder Delete moves folders to hidden namespace added timestamp to uniquify name .DELETED/folder-20040226-12:34:56 .EXPUNGED/.DELETED/xy-20040226-12:34:56 Users can recover deleted email themselves however hidden namespaces are confusing unexpunge and undelete admin tools # # Accessible via webmail and perhaps other IMAP clients # We also have a backdoor password to do things for users # ######################################################################## %page Two-phase expunge (2) %size 5 Extended quotas limit amount of deleted email greatest of 28 days and 100MB expire job runs overnight on demand if deletion quota exceeded Side-effect improvements better performance expunge is just a change to index contents heavyweight filesystem ops happen at night concurrent access works better, e.g. one user deletes message, other downloads it helpful for replication # # Reason it is called "two-phase" is expiry of old expunged email # Expire job similar to old-style news server. # The on-demand expire job is to prevent DOS attacks. # # More data does not mean more work # -- users don't look at expunged data # ######################################################################## %page Data replication %area 40 40 10 20 %center %newimage -zoom 100 "storage.jpg" %area 40 40 50 20 %center %newimage -zoom 100 "storage.jpg" %area 40 40 50 55 %center %newimage -zoom 100 "storage.jpg" %area 40 40 10 55 %center %newimage -zoom 100 "storage.jpg" # # Another good way of using up spare disk # # Our largest extension to Cyrus # ######################################################################## %page Data replication (1) Multi-purpose paired servers for resilience copying data to backup server moving users to different machine Crucial part of new Hermes extra sanity checking a good idea replication stops when in trouble MD5 checksum databases consistency check each night # # Both servers of a pair have real users and replicated users # -- continuous load means failures get noticed # # Moving users for load balancing or to take a machine out of service # ######################################################################## %page New Hermes architecture %center %newimage -yscrzoom 90 "hermes-new-fluffy.eps" # # Reminder # ######################################################################## %page Data replication (2) Application-level replication transaction-based corruption on master not replicated nor is extraneous low-level activity Asynchronous w.r.t. changes on master does not increase IMAP latency can catch up with backlog slightly less safe than synchronous easier to implement # # Asynchronous means it replicates to the slave # after the master has completed a transaction. # # Alternative to asynchronous is to ensure # transaction completed on master and slave # before reporting completion to user # -- what about temporary breakage? # -- (network, sysadmin, etc.) # ######################################################################## %page Replication protocol %center %newimage -zoom 200 "ambassador.jpg" # # Replication system uses a special text-based protocol # ######################################################################## %page Replication protocol overview Special protocol IMAP cannot replicate metadata timestamps, UIDs, UIDvalidity relies on cluster-wide UUIDs Efficient few client-server round-trips multiple operations are merged benefits from parallel running # # UID == message unique ID # UIDvalidity == mailbox sequence number # UUID == universally unique ID # # message UUIDs used to avoid unnecessarily transferring data # # a few minutes to resynchronize paired servers # after hours with replication turned off # ######################################################################## %page Replication protocol operations APPEND add messages to mailbox SEEN update list of user's seen messages MAILBOX synchronize whole mailbox message flag changes, expunges META synchronize user metadata subscriptions, filters, quotas USER synchronize everything all mailboxes and metadata # # APPEND and SEEN fall back to MAILBOX # MAILBOX and META fall back to USER # # MAILBOX also used for CREATE/RENAME/DELETE operations # # Example: replication of new account # -- message arrives # -- APPEND operation # -- promoted to MAILBOX # -- promoted to USER # -- account replicated # triggered by welcome mesasge we send user # after migration from the old Hermes system # ######################################################################## %page Performance %center %newimage -yscrzoom 75 "robin.jpg" # # There are a few lurking performance gotchas in Cyrus ... # ######################################################################## %page Performance %center %newimage -yscrzoom 75 "ferrari.jpg" # # ... which we have addressed # ######################################################################## %page Performance improvements %size 5 Extended cache contents standard cache omits popular data cache misses are expensive we include most header data in cache Lazy cache updates standard Cyrus often rewrites cache instead we leave unused data behind garbage-collect overnight Fast folder renames standard is to copy then delete we use rename system call when possible # # Cache miss involves opening and parsing each message file. # Trading off disk space against work # -- we have lots of disk # # Lazy cache updates particularly important with big caches # requires change to cache file format # # Fast renames important for us on first day of month # -- thousands of MUAs that do folder rotation # ######################################################################## %page Namespaces %center %newimage -zoom 100 "old-possum.jpg" # # We have also made some changes to ease the transition # -- particularly because of difficulties with folder naming # ######################################################################## %page Incompatibility UW-IMAP: Unix-style namespace ~fanf2/inbox ~fanf2/mail/saved-messages distinct directories and mailboxes Cyrus: Usenet-style namespace user.fanf2 user.fanf2.saved-messages Folders are "dual-use" contain messages and subfolders # # Could cause horrible MUA reconfiguration difficulties # ######################################################################## %page Compatibility Standard Cyrus options unixhierarchysep, altnamespace '/' separator, no user.fanf2 prefix Local addition unixnamespace ~fanf2/ ~/mail/ etc. become aliases for top level Plus hack to remove dual-use feature and add support for directory stubs # # Pine behavioral problems with dual-use mailboxes. # # Plan is to gradually move towards a more standard Cyrus setup # without unixnamespace and with dual-use folders. # ######################################################################## %page Migration %center %newimage -zoom 100 "wildebeest.jpg" # # Finally, we have to move the users from old to new # -- still in progress as we speak # ######################################################################## %page Migration ppsw knows which users have moved IMAP/POP proxy directs connections and Exim routes email appropriately Replication engine copies saved email Exim filters translated to Sieve (except when manually written) Process is by-and-large transparent except for misconfigured users # # Proxy was first part of new Hermes # # Users must have IMAP and POP clients pointed at # imap.hermes.cam.ac.uk and pop.hermes.cam.ac.uk # -- hermes.cam.ac.uk used to work # -- no loger points to IMAP and POP servers # # Special version of replication client (master) software # -- linked with c-client # # Users with hand-written filters must re-write them # -- most users set up filtering with the menu system # # There are a few trivial user-visible differences # -- menu system less useful # -- separate file store and message store # ######################################################################## %page Future work %center %newimage -zoom 150 "enterprise.jpg" # # Sysadmin's work is never done... # ######################################################################## %page Future work Better unexpunge user-interface too confusing, needs admin help Data replication for incoming messages via Exim & proxy on ppsw Shared mailboxes need extra cleverness Cyrus Murder, replication fixes Second machine room? pairwise-replicate between them # # More development work on Prayer is needed # -- e.g. internationalization # # Replication on ppsw would help for outgoing messages # and messages going to departments # # seen database handling in replication not good enough # work-around for mupdate server SPOF? # # second machine room for protection against # power problems and fire alarms etc. # # may need to add some HA technology to automate fail-over # ######################################################################## %page That's all, folks %size 5 %center %newimage -zoom 100 "bugs.jpg" http://www-uxsup.csx.cam.ac.uk/~dpc22/ http://www-uxsup.csx.cam.ac.uk/~fanf2/ # # My paper, slides, and notes are available on the web. # (including errata!) # ########################################################################