Monday, July 20, 2009

What is eating those Google SYN-ACKs?

In this post, I mentioned google was seeing huge packet loss on syn-acks from their servers. At the time it looked like 2%. That sounded nuts.

It still sounds nuts.

Someone else on the mailing list posted about that, and Jerry Chu of Google confirmed it:

Our overall pkt retransmission rate often goes over 1%. I was
wondering if SYN/SYN-ACK pkts are less likely to be dropped
by some routers due to their smaller size so we collected traces
and computed SYN-ACK retransmissions rate on some servers.
We confirmed it to be consistent with the overall pkt drop rate,
i.e., > 1% often.


You could imagine why the overall retransmission rate might be higher than the real drop rate due to jitter and various fast retransmit algorithms that might retransmit things that just hadn't been acknowledged quite yet. Even SYNs might be dropped at the host (instead of the network) due to queue overflows and such.. but we're talking about SYN-ACKs from busy servers towards what one would expect would be pretty idle google-searching clients. And these SYN-ACKs have giant timeouts (3 seconds - which is why Jerry was writing in the first place) so it certainly isn't a matter of over-aggressive retransmit. The only explanation seems to be packet loss. At greater than 1%

wow.

This probably has more to do with the global nature of google's audience than anything else. But still, TCP can really suck at loss rates that high. It must be very different than the desktop Internet I know (which is a fair-to-middlin cable service, not a fancy Fiber-To-The-Home setup which is becoming more common.)

I wonder exactly where those losses happen.