Saturday, March 3, 2007

Characterizing the Latency of my Mail Server

Characterization fascinates me. Knowing what you are trying to solve in detail lets you focus on what is important. Sometimes what is important is everything that is possible - sometimes it is everything that is likely - both what is possible and what is likely for any given problem tend to be a lot smaller than an unbounded "anything at all". This is true for computers, business, planning your vacation, whatever. Characterization is key.

One of my favorite books of all time, Web Protocols and Practice, does a really great job of this for the web circa 2002. The web has changed in some ways since then, an update would be welcome, but many of the fundamentals still apply.

In 2002 I started working on XML aware networking. That space was changing so fast it was very hard to characterize the workloads we were seeing. That meant it was harder to build really great products when an average workload was 50 bytes one day and 50 megabytes the next - you want to focus on different things. The XML space still shows lots of variation, but it is maturing now in a way that makes it ripe for a treatment like WPaP.

Anyhow, I was thinking about this the other day when I was reading a paper about Yeah-TCP. It is popular nowadays to attack the high-bandwidth delay problems TCP is well known for. This used to be a research problem, but now it is thought to impact common desktop stacks too. That got me wondering a bit. I spend a lot of time in the datacenter working to fill highspeed low latency links.. a few years back when I was in the ISP world at AppliedTheory I saw an awful lot of low bandwidth and low latency links (it was 1999 - the Internet core was great, but the last miles were still comparatively slow - 30Mbps was big bucks to your door), now home users are seeing big bandwiths (Fios, u-verse, etc..) but I have no idea what has happened to a typical desktop rtt in the past few years.

Being a do it yourself kind of guy, I run a mail server for a vanity domain over standard copper DSL. In order of frequency it receives: spam, linux-kernel mail, other mailing list mail, and an occasional note someone actually wrote with me in mind. I figured it would be easy enough do a tcpdump capture of incoming smtp connections and post-process that to figure out what rtt's looked like these days.

It turns out that figuring out the elapsed time of a TCP handshake from a packet trace is not particularly easy. I can usually cobble something together with tcpdump, or tcpflow, or maybe wireshark.. but I couldn't figure out how to say "show me only the syn-ack and the ack to that" for every stream. I ended up writing some very hackish C code.. anybody how knows me, also knows I enjoyed doing that, but it was a chunk of very unportable work that should have been more scriptable.

Anyhow - onto the results. The capture covered about 24 hours. The server is not very busy. It received 1536 incoming connections over that time, and managed to complete the handshake on 99.1% (1523) of them. I have divided the characteristics into "all connections", "all non-lkml connections", and "just lkml connections". Everything is measured in milliseconds.


ALL NO-LKML ALL-LKML
samples 1523 1257 266
--
100th pct 61021 61021 161
90th pct 795 978 113
50th pct 184 233 100
10th pct 90 81 99
0th pct 27 27 98
--
mean 585 687 102
--
pct > 100ms 77 86 50


So there you go - if you believe this is representative then more than 3/4 of connections out there are floating around with >= 100ms TTs. Even communications with a significant high-volume server in my own timezone (vger.kernel.org) are likely to be in that neighborhood. The days of high bandwidth-delay do anecdotally seem to
have arrived on the desktop.

Interesting questions:
  • Spam comes from bots - is that a different than 'legit' traffic. If so - Does that matter?
  • this is server side. Would it look different if I was measuring handshakes I initiated?
  • is smtp just like http? just like video?