% $Cambridge: hermes/doc/talks/2008-06-google/talk.tex,v 1.18 2008/06/03 14:50:10 fanf2 Exp $ % This file is included by notes.tex and slides.tex \usepackage[english,british]{babel} \usepackage{pgf,pgfarrows,pgfshade} \usepackage{url} \newcommand\image[2][]{\mode{\includegraphics[#1]{#2}}} \DeclareUrlCommand\email{% \def\UrlLeft{<}% \def\UrlRight{>}% \urlstyle{tt}% } \author[Tony~Finch]{ Tony~Finch\\ \email{fanf2@cam.ac.uk} \email{dot@dotat.at} } \institute[Cambridge UCS]{% University of Cambridge\\ Computing Service\\ Mail Support% } \titlegraphic{% \raisebox{1pt}{% \includegraphics[height=20pt]{univ.jpg}% }% \hspace{6pt}% \includegraphics[height=22pt]{ucs.jpg}% } % end of preamble \title{ Some thoughts on MTA architecture } \subtitle{ \url{http://dotat.at/writing/mta-arch} } \date{Monday 2 June 2008} \begin{document} \frame{\maketitle} \begin{abstract} The currenly popular MTAs were all designed before the spam problem became severe. Because of this some of their basic architectural decisions are unhelpful in the face of current email loads. What would a 21st century spam-conscious MTA design look like? This talk discusses some possible approaches. \end{abstract} % \section{Introduction} \begin{frame} \frametitle{About me} \mode
{ My CV expressed as a list of email addresses: } \begin{tabular}{ l l l } 1994 -- 1997 & \email{fanf2@cam.ac.uk} & computer science \\ 1997 -- 2000 & \email{fanf@demon.net} & web server admin \\ 2000 -- 2001 & \email{fanf@covalent.net} & Apache httpd coder \\ 2002 -- now & \email{fanf2@cam.ac.uk} & postmaster \\ & & \\ 1997 \ldots & \email{dot@dotat.at} & \\ 1999 \ldots & \email{fanf@apache.org} & httpd \\ 2002 \ldots & \email{fanf@FreeBSD.org} & unifdef \\ 2004 \ldots & \email{fanf@exim.org} & \\ 2006 \ldots & \email{fanf@apache.org} & SpamAssassin \\ \end{tabular} \end{frame} Tony Finch is a Unix system developer. He has contributed to a number of open source projects, including Exim, FreeBSD, and the Apache httpd. He works for the University of Cambridge Computing Service where he runs the university's central SMTP relay. \begin{frame} \frametitle{``Wouldn't it be nice if\ldots ?''} \begin{columns} \column{5cm} \begin{itemize} \item theoretical musings\\ on MTA architecture \item originally a series of\\ postings on my blog,\\ Feb 2006 -- March 2007 \item there is no code\\ and no likelihood of code \end{itemize} \column{5cm} \image[width=5cm]{einstein.jpg} \end{columns} \end{frame} I need to emphasize up-front that this is \emph{not} an implementation report! The original posts on my LJ were titled ``How not to design an MTA''. This talk aims to be more positive. The most popular MTAs were designed 10 or more years ago, before the spam problem became overwhelmingly serious. \begin{frame} \frametitle{A snapshot of the problem} \centering\image[width=6cm]{junkmail.png} \begin{columns}[t] \column{5cm} Average email traffic \\ (legitimate and spam): \begin{tabular}{ l l } Mar 2005 & 15 \\ Mar 2006 & 20 \\ Mar 2007 & 35 \\ Mar 2008 & 80 \\ \end{tabular} \begin{itemize} \item all numbers in messages (or rejections) per second \end{itemize} \column{5cm} Current traffic classification: \begin{tabular}{ l c r } relay attempts & 0.5 -- 1.5 \\ known malware & 2 -- 4 \\ blacklisted & 60 -- 75 \\ invalid recipient & 1.5 \\ invalid sender & 1.2 \\ SpamAssassin & 2 \\ legitimate email & 3 \\ internal email & 2.5 \\ \end{tabular} \end{columns} \end{frame} These numbers come from the University of Cambridge's central email relay which has about 30,000 users. The point is to notice which numbers are bigger and therefore understand where we need to concentrate effort on performance. Note for example that the total number of address verifications for rejected messages is about 5/sec which is nearly as much as the volume of desirable email. In this table, ``legitimate email'' covers email from outside the University and does not include ``internal email'', so the volume of desirable email is about 5.5 per second. ``Known malware'' is identified from HELO hostname signatures. Legitimate external email is about 4\% of the total, and internal email is about 3\% of the total. About 86\% is rejected before address verification. % \section{Concurrency} \begin{frame} \frametitle{Concurrency} \begin{columns} \column{5cm} \begin{itemize} \item concurrency requirements \\ grow with spam volumes \item most MTAs use an OS process per connection \item really inefficient! \end{itemize} \column{5cm} \image[width=5cm]{motorway.jpg} \end{columns} \end{frame} Let's start off with the uncontroversial observation that most MTAs still have really old-school inefficient concurrency architectures. This is aggravated by some anti-spam techniques that slow down SMTP conversations, such as ``greet-pause'' (the server delays its greeting in order to spot abusive clients that don't wait their turn to speak) and ``teergrubing'' (an attempt to waste spammers' resources by delaying all SMTP responses --- which is fairly pointless when the spammers' botnets have more resources than the good guys). \begin{frame} \frametitle{Waste vs efficiency} \begin{itemize} \item event-driven connection multiplexing \item high-level languages with lightweight threads \end{itemize} \centering \vspace{24pt} better software performance $\Longrightarrow$ better hardware efficiency \vspace{24pt} \image[width=10cm]{hummer-prius.jpg} \end{frame} Instead of throwing hardware at the problem because it maxes out at about 1000 concurrent connections, use \texttt{epoll} or \texttt{kqueue} to multiplex 10,000+ connections\footnote{see \url{http://www.kegel.com/c10k.html}}. Even better, write the program in a concurrency-oriented language that has the infrastructure built-in, such as Erlang. \begin{frame} \frametitle{Waste vs efficiency} \begin{itemize} \item best use of the available resources \ldots \end{itemize} \vspace{12pt} \centering\image[width=10cm]{bikes.jpg} \end{frame} Bikes are perhaps a better analogy to lightweight concurrency than comparing a Hummer with a Prius. \begin{frame} \frametitle{Some partial solutions} \begin{itemize} \item \texttt{SAUCE} -- software against UCE \\ \url{http://www.chiark.greenend.org.uk/~ian/sauce/} \\ (written in Tcl) \item \texttt{qpsmtpd-async} -- anti-spam smtpd for qmail \\ \url{http://smtpd.develooper.com/} \\ (written in Perl) \item MailChannels Traffic Control\texttrademark \\ {\footnotesize\url{http://www.mailchannels.com/products/traffic-control.html}} \end{itemize} \end{frame} (It so happens that these examples support my recommendation to write in a higher-level language if you want better concurrency.) These are ``partial solutions'' because they only deal with incoming connections. SAUCE and MailChannels work as SMTP proxies in front of an existing SMTP server, whereas qpsmtpd uses lower-level (non-SMTP) ways to inject the message into the back-end MTA queue. They do not address poor concurrency architecture in message routeing and delivery. % \section{Verification} \begin{frame} \frametitle{Address verification} \begin{itemize} \item most verifications are for messages that will be rejected \item email address routeing can be arbitrarily complicated \\ so verification can be too! \item concurrency useful for multi-recipient messages \\ as well as multiple messages \end{itemize} \raggedleft \image[width=5cm]{address.jpg} \end{frame} Address verification is the first difficult anti-spam check that an MTA must perform --- DNS blacklists and protocol checks are relatively cheap. I'll expand on these points in the next few slides, but first a bit about the importance of thorough verification. \begin{frame} \frametitle{Avoid bouncing} \begin{itemize} \item reject unwanted email as early as possible \item try hard not to accept and bounce \item reduce spam backscatter \& forwarded spam \item avoid wasting your MTA's resources \end{itemize} \centering \image[width=6cm]{damper.jpg} \end{frame} If you accept email and later discover you can't deliver it, RFC 2821 says you are supposed to send a bounce message. But what if it's undeliverable because it's spam? The sender address is forged and you'll be sending an unwanted bounce to an innocent third party. You don't want to drop the message in case of a legitimate misconfiguration. So you must reject incorrectly-addressed email, instead of accepting it then bouncing it. Spam senders drop messages that are rejected. Legitimate senders will generate their own delivery failure reports. % \subsection{Routeing is verification} \begin{frame} \frametitle{How email addresses are routed} \begin{columns} \column{5cm} \begin{itemize} \item DNS --- MX/A/AAAA \item flat files --- text or cdb \begin{itemize} \ttfamily \item aliases \item mailertable \item virtusertable \end{itemize} \item LDAP --- ``laser'' schema \item SQL databases \end{itemize} \column{5cm} \image[width=5cm]{switchboard.jpg} \end{columns} \end{frame} A lot of the complexity in MTAs exists to support the variety of ways that postmasters want to configure the redirection and routeing of email. I've made particular mention of sendmail's routeing and redirection tables because they are typical of the baroque lookup semantics that MTAs support. In the virtusertable you can match all or part of an address, or you can do wildcard matching, and the destination can be another address or a custom error message \ldots A few notes on performance. DNS lookups can be very high latency but don't individually require many resources. An MTA's routeing engine should support highly concurrent DNS resolution --- not one process per message. On the other hand, LDAP and SQL connections are relatively heavyweight, so an MTA should be able to limit the number of concurrent database connections and schedule lookups across the connection pool. There are a couple of common email routeing features that make verification particularly tricky. \begin{frame} \frametitle{User-defined filtering} \begin{itemize} \item Sieve --- RFC 5228 \item address validity can be conditional on the sender's address \item selective sub-address validity, e.g. \texttt{fanf9+subaddress@hermes.cam.ac.uk} \end{itemize} \raggedleft \image[width=5cm]{sieve.jpg} \end{frame} First, a practical example. User-defined filtering means that email address validity can change very dynamically, the moreso the larger your userbase is. \begin{frame} \frametitle{Routeing with regular expressions} \begin{columns} \column{5cm} \begin{itemize} \item try to match address \\ against a series of \\ regular expressions \item when one matches, \\ replace address with \\ corresponding result \item interpolate captured subexpressions \item route resulting address, \\ repeating \texttt{regsub} \\ if necessary \end{itemize} \column{5cm} \image[width=5cm]{turing.jpg} \end{columns} \end{frame} Second, one for the computer scientists. It's well known that Sendmail's rewriting language is Turing-complete. I showed in 2002 that Exim is also Turing-complete by writing a configuration that does SK combinator reduction. But any MTA that supports iterated regular expression match and substitution is Turing-complete. This includes Postfix --- which implies that its \texttt{trivial-rewrite(8)} daemon is in fact non-trivial. \begin{frame} \frametitle{Verification: you're doing it wrong!} \centering Postfix \texttt{local\_recipients\_map} \vspace{24pt} \image[width=8cm]{wet-cat-1.jpg} \end{frame} Postfix and qmail are designed to have heavy separation between their SMTP front-end and their address routeing code. This means that they cannot conveniently use the routeing engine to verify addresses. Consequently, qmail doesn't bother, and Postfix has the \texttt{local\_recipients\_map} option to configure verification performed by the SMTP front-end independently of the routeing engine. This violates the Don't Repeat Yourself (``DRY'') principle and is instead more like Write Everything Twice (``WET''). Features like user-defined filters mean that a table defined by the postmaster can only be an approximate list of valid addresses, which means some undeliverable email will accepted then bounced. % \subsection{Callout verification} \begin{frame} \frametitle{Verifying relayed addresses} \begin{columns} \column{5cm} \image[width=5cm]{relay-race.jpg} \column{5cm} \mode{ \begin{pgfpicture}{0cm}{0cm}{5cm}{6cm} % bottom to top \pgfsetendarrow{\pgfarrowtriangle{4pt}} \pgfrect[stroke]{\pgfxy(1,0)}{\pgfxy(3,1)} \pgfputat{\pgfxy(2.5,0.5)}{\pgfbox[center,center]{department}} \pgfxyline(2.5,3)(2.5,1) \pgfputat{\pgfxy(2.5,2)}{\pgfbox[left,center]{~SMTP}} \pgfrect[stroke]{\pgfxy(1,3)}{\pgfxy(3,1)} \pgfputat{\pgfxy(2.5,3.5)}{\pgfbox[center,center]{incoming MX}} \pgfxyline(2.5,6)(2.5,4) \pgfputat{\pgfxy(2.5,5)}{\pgfbox[left,center]{~SMTP}} \end{pgfpicture} } \end{columns} \end{frame} We have a number of departments who run their own email servers, but we provide the MX service. When we route email to them we only look at the domain part of the email address, which determines the destination host. This means we don't know which local parts (usernames) are valid and which are not. How can we avoid the accept-and-bounce problem in this situation? \begin{frame} \frametitle{Verification: you're doing it wrong!} \begin{columns} \column{6cm} \begin{itemize} \item copy table of valid recipients from department to MX \item configure MX to query department's LDAP directory \end{itemize} \column{4cm} \image[height=6.4cm]{wet-cat-2.jpg} \end{columns} \end{frame} A couple of ways you might solve the problem boil down to re-implementing Postfix's \texttt{local\_recipients\_map}, but with greater separation between the front and back parts of the system. You have the same problems of over-estimating validity and writing everything twice. \begin{frame} \frametitle{Call-forward recipient verification} \ttfamily \begin{tabular}{ l l } 220 mx.cam.ac.uk & \\ HELO dotat.at & \\ 250 Hello & \\ MAIL FROM: & \\ 250 OK & \\ RCPT TO: & 220 mta.cl.cam.ac.uk \\ & HELO mx.cam.ac.uk \\ & 250 Hello \\ & MAIL FROM: \\ & 250 OK \\ & RCPT TO: \\ & 550 Unknown user \\ 550 Unknown user & QUIT \\ RSET & 221 Goodbye \\ \ldots & \\ \end{tabular} \end{frame} A slightly abbreviated example of how call-forward verification works. I am sending email to an address at a department that runs its own mail server. When my SMTP sender states the recipient address, the MX routes the address to verify it, and finds that it is at a remote server. It connects to that server and starts an SMTP conversation which it will abort after getting the recipient verification result. It can then pass the result back to my SMTP sender. \paragraph{Advantages:} No special arrangements between the MX and the departments: if the MX can deliver email to the department, and if the department's server is properly configured not to accept-then-bounce, then the MX can verify addresses at the department's domain. No duplicated configuration. \paragraph{Caveats:} The MX needs to maintain a cache of verification results, so that if it is being hammered it doesn't pass all of its load to the back end. The MX also needs a way of dealing with callouts that take longer than one SMTP timeout period. Ideally it will either tell the client to come back later, or accept the message and hope for the best. In either case it should continue the callout operation so that it can put a definitive result in its cache. This should be easy to do if the callout uses the MTA's normal delivery mechanisms, which are naturally decoupled from the front end. The MTA should be designed to make callouts as efficient as possible, since most of them are going to be for messages that are rejected. In particular it must not require disk transactions like Postfix's implementation does. Verification should be built-in to the MTA. The \texttt{milter-ahead} addon to Sendmail has to re-implement Sendmail's routing features --- \texttt{virtusertable} and \texttt{mailertable} in order to implement call-forward verification. Yet another instance of ``write everything twice''. % \section{Content scanning} \begin{frame} \frametitle{Content scanning} \begin{columns} \column{6cm} \image[width=6cm]{scanner.jpg} \column{4cm} \begin{itemize} \item anti-spam \item anti-phishing \item anti-virus \item lots of CPU \item lots of memory \end{itemize} \end{columns} \end{frame} Programs like SpamAssassin and ClamAV. \begin{frame} \frametitle{Content scanning goals} \begin{itemize} \item decouple scanner from client concurrency \& speed \item do not require entire message to be buffered in RAM \item avoid temporary on-disk buffers \item security boundary between content scanner(s) and MTA \end{itemize} \vspace{24pt} \raggedleft \image[width=6.2cm]{goals.png} \end{frame} The first requirement means that we should buffer the message while it is being received --- for example, while it is being dribbled up a slow DSL line from a compromised home computer. Once the entire message has been received then we can run the scanner(s) over it as fast as they can go. This minimizes the number of concurrent scanners we need to run for a given number of messages per second. \paragraph{Aside:} The milter interface is designed to dribble the message from the MTA to the scanner in blocks as it is received. The second requirement means that we should write the message to a file on disk, and rely on the operating system's buffer cache to keep the message in RAM if there is space, or page it out if there is not. The third requirement means that the file we use should be the message's final resting place in the MTA's queue --- there must be no unnecesasry copying of the data or filesystem operations (such as rename). Content scanners are doing a difficult job with untrusted data so ideally they could be run in a separate process with a different user ID than the MTA -- which is in fact a fairly natural setup. \begin{frame} \frametitle{Data callout} \begin{columns} \column{5.5cm} \begin{itemize} \item use the normal \\ local delivery mechanism \item efficiently transfer a file \\ from the queue to a program \item cross security boundaries \item control concurrency \\ and smooth load spikes \end{itemize} \column{4.5cm} \image[width=4.5cm]{delivery.jpg} \end{columns} \end{frame} Most of the goals are naturally fulfilled if the MTA has a ``data callout'' mechanism. Like verification callouts, the front end asks the back end to perform a task for it during the SMTP conversation. Data callouts include the message data, instead of stopping after the envelope. There are some other interesting things that this mechanism makes much easier. You can use a remote data callout to implement application-level replication, so that the message is stored redundantly in two local MTA queues before the sender gets confirmation that it has been accepted. Or you can implement early delivery, in which you see if the message can be delivered before returning confirmation to the sender, so it only has to be \texttt{fsync}ed into the queue if the delivery can't be completed within a short timeout. % \section{Log-structured queue} \begin{frame} \frametitle{Queue layout} \begin{itemize} \item MTAs typicall scatter messages all over the disk \item often separate files for envelopes and contents \item this makes queue runs particularly expensive \end{itemize} \vspace{12pt} \image[width=6cm]{scatter.jpg} \end{frame} Two files per message (data and metadata) is twice as expensive as it needs to be. Even one file per message (combined envelope and contents) is too expensive: How do you find which message needs to be retried next? Your disks are doing lots of seeks for random-access scanning of the queue. \begin{frame} \frametitle{Log-structured queue} \begin{itemize} \item write all metadata sequentially to one file \item queue runners read file sequentially \item updated envelopes also appended to the file \item queue runners act as garbage collectors \item size of log bounded by retry interval \end{itemize} \vspace{12pt} \raggedleft \image[width=6cm]{logs.jpg} \end{frame} The idea here is that message contents still written to one file per message, so that we can let the kernel allocate space for it. We only need to look at that file during delivery. Queue running and routeing only need to look at the message envelope, and the log structure makes it really easy to find which one to examine next because the order it is written is the order in which messages need to be retried. When a message is delivered to a recipient you need to update its envelope to delete that recipient, and when there are none left mark the envelope completed. Log-structured files are append-only, so when you complete a delivery attempt you by write a replacement envelope to the end of the file. The original envelope then becomes garbage. This has rather beautiful consequences. \begin{frame} \frametitle{Log-structured queue} \centering \begin{pgfpicture}{0cm}{0cm}{10cm}{3.25cm} \pgfsetendarrow{\pgfarrowtriangle{4pt}} \pgfdeclarehorizontalshading{log}{1cm}{gray(0cm)=(0.5);gray(8cm)=(1.0)} \pgfrect[stroke]{\pgfxy(1,0)}{\pgfxy(8,1)} \pgfputat{\pgfxy(5,0.5)}{\pgfbox[center,center]{\pgfuseshading{log}}} \pgfputat{\pgfxy(5,0.5)}{\pgfbox[center,center]{envelope log}} \pgfxyline(9,2.5)(9,1) \pgfputat{\pgfxy(9,2.75)}{\pgfbox[center,bottom]{write}} \pgfxyline(8,1)(8,2.5) \pgfputat{\pgfxy(8,2.75)}{\pgfbox[center,bottom]{30m}} \pgfxyline(7,1)(7,2.5) \pgfputat{\pgfxy(7,2.75)}{\pgfbox[center,bottom]{1h}} \pgfxyline(5,1)(5,2.5) \pgfputat{\pgfxy(5,2.75)}{\pgfbox[center,bottom]{2h}} \pgfxyline(1,1)(1,2.5) \pgfputat{\pgfxy(1,2.75)}{\pgfbox[center,bottom]{4h}} \end{pgfpicture} \vspace{12pt} \image[width=6cm]{queue.jpg} \end{frame} Most new envelopes written to the log immediately become garbage, because most deliveries succeed first time. Replacement envelopes have a retry time which indicates which of the log readers will deal with it when they get to that point. When a log reader handles an envelope it becomes garbage. The queue runner that processes the oldest records in the log ensures that it leaves no live records behind. This naturally bounds the size of the queue. The traditional problem with log-structured filesystems and databases is the cost of garbage-collecting old log records --- that is, identifying which ones are still live. This isn't a problem for an MTA queue because the queue runners need to scan old log records anyway, so you get garbage collection for free. The log structure also allows you to reduce \texttt{fsync} operations. Deliveries must be synced as quickly as possible; however when you are accepting a message then you can delay sending the confirmation to the client in the hope that a delivery sync will do your job for you. It probably also makes sense to put small messages in the log, to save the cost of creating, syncing, and deleting a message contents file. \paragraph{Aside:} A fully-engineered log will probably be divided a file for the active head, and another that contains any envelopes that were not completely delivered first time. A really busy machine might have multiple head files on different disks. % \section{Conclusion} \begin{frame} \frametitle{Architectural principles} \begin{itemize} \item lightweight concurrency througout the system \item load smoothing / scheduling of scarce resources \begin{itemize} \item database connections, content scanners \end{itemize} \item address routeing is verification \item content scanning is a data call-forward \item a log-structured queue minimizes disk seeks \end{itemize} \centering \image[width=76mm]{architecture.jpg} \end{frame} The same mechanism in the MTA is used to implement routeing and verification, including all the concurrency and resource management. The delivery mechanisms are used to implement call-forward recipient verification and content scanning. \begin{frame} \frametitle{That's all, folks!} \centering \image[width=8cm]{beer.jpg} \begin{itemize} \item slides and notes available online:\\ \url{http://dotat.at/writing/mta-arch} \item any questions? \end{itemize} \end{frame} \end{document} % eof