Saturday, December 8, 2012

Managing Bandwidth Priorities In Userspace via TCP RWIN Manipulation

This post is a mixture of research and speculation. The speculative thinking is about how to manage priorities among a large number of parallel connections. The research is a survey of how different OS versions handle attempts by userspace to actively manage the SO_RCVBUF resource.

Let's consider the case where a receiver needs to manage multiple incoming TCP streams and the aggregate sending windows exceed the link capacity. How do they share the link?

There are lots of corner cases but it basically falls into 2 categories. Longer term flows try and balance themselves more or less evenly with each other, and new flows (or flows that have been quiescent for a while) blindly inject data into the link based on their Initial Window configuration (3, 10, or even 40 [yes, 40] packets).

This scenario is a pretty common one for a web browser on a site that does lots of sharding.. It is common these days to see 75, 100, or even 200 parallel connections used to download a multi megabyte page. Those different connections have an internal priority to the web browser, but that information is more or less lost when HTTP and TCP take over - the streams compete with each other blind to the facts of what they carry or how large that content is. If you have 150 new streams sending IW=10 that means 1500 packets can be sent more or less simultaneously even if it is of a low priority. That kind of data is going to dominate most links and cause losses on others. Of course, unknown to the browser, those streams might only have 1 packet worth of data to send (small icons, or 304's) - it would be a shame to dismiss parallelism in those cases out of fear of sending too fast. Capping parallelism at a lower number is a crude approach that will hurt many use cases and is very hard to tune in any event.

The receiver does have one knob to influence the relative weights of the incoming streams: the TCP Receive Window (rwin). The sender generally determines the rate a stream can transfer at (via the CWND calculation), but that rate is limited to no more than rwin allows.

I'm sure you see where this is going - if the browser wants to squelch a particular stream (because the others are more important) it can reduce the rwin of the stream to below the Bandwidth Delay Product of the link - effectively slowing just that one stream. Viola - crude priorities for the HTTP/TCP connections! (RTT is going to matter a lot at this point for how fast they run - but if this is just a problem about sharding then RTTs should generally be similar amongst the flows so you can at least get their relative priorities straight).

setsockopt(SO_RCVBUF) is normally how userspace manipulates the buffers associated with the receive window. I set out to survey common OS platforms to see how they handled dynamic tuning of that parameter in order to manipulate the receiver window.

WINDOWS 7
Win7 does the best job; it allows SO_RCVBUF to both dynamically increase and decrease rwin (Decreases require the WsaReceiveBuffering socket ioctl ). Increasing a window is instantaneous and window update is generated for the peer right away. However when decreasing a window (and this is true of all platforms that allow decreasing) the window is not slammed shut - it naturally degrades with data flow and is simply not recharged as the application consumes the data which results in the smaller window.

For instance a stream has an rwin of 64KB and decreases it to 2KB. A window update is not sent on the wire even if the connection is quiescent.. Only after the 62KB of data has been sent the window shrinks naturally to 2KB even if the application has consumed the data from the kernel - future reads will recharge rwin back to a max of 2KB. But this process is not accelerated in any way by the decrease of SO_RCVBUF no matter how long it takes. Of course there would always be 1RTT of time between the decrease of SO_RCVBUF and the time it takes effect (where the sender could send larger amounts of data) but by not "slamming" the value down with a window update (which I'm not even certain would be tcp compliant) that period of time is extended from 1RTT to indefinite. Indeed, the application doesn't need SO_RCVBUF at all to achieve this limited definition of decreasing the window - it can simply stop consuming data from the kernel (or it can pretend to do so by using MSG_PEEK) and that would be no more or less effective.

If we're trying to manage a lot of parallel busy streams this strategy for decreasing the window will work ok - but if we're trying to protect against a new/quiescent stream suddenly injecting a large amount of data it isn't very helpful. And that's really the use case I have in mind.

The other thing to note about windows 7 is that the window scale option (which effectively controls how large the window can get and is not adjustable after the handshake) is set to the smallest possible value for the SO_RCVBUF set before the connect. If we agree that starting with a large window on a squelched stream is problematic because decreases don't take effect quickly enough, that implies the squelched stream needs to start with a small window. Small windows will not need window scaling. This isn't a problem for the initial squelched state - but if we want to free the stream up to run at a higher speed (perhaps because it now has the highest relative priority of active streams after some of the old high priority ones completed) the maximum rwin is going to be 64KB - going higher than that requires window scaling. a 64KB window can support significant data transfer (e.g. 5 megabit/sec at 100ms rtt) but certainly doesn't cover all the use cases of today's Internet.

WINDOWS VISTA
When experimenting with Vista I found behavior similar to Windows 7. The only difference I noted was that it always used a window scale in the handshake even if initially a rwin < 64KB was going to be used. This allows connections with small initial windows to be raised to greater sizes than is possible on windows 7 - which for the purposes of this scheme is a point in vista's favor.

I speculate that the change was made in windows 7 to increase interoperability - window scale is sometimes an interop problem with old nats and firewalls and the OS clearly doesn't anticipate the rwin range to be actively manipulated while the connection is established.. therefore if window scale isn't needed for the initial window size you might as well omit it out of an abundance of caution.

WINDOWS XP
By default XP does not do window scaling at all - limiting us to 64KB windows and therefore multiple connections are required to use all the bandwidth found in many homes these days. It doesn't allow shrinking rwin at all (but as we saw above the vista/win-7 approach to shrinking the window isn't more useful than one that can be emulated through non-SO_RCVBUF approaches).

XP does allow raising the rwin. So a squelched connection could be setup with a low initial window and then raised up to 64KB when its relative ranking improved. The sticky wicket here is that it appears that attempts to set SO_RCVFBUF below 16KB don't work. 16KB maps to a new connection with IW=10 - having a large set of new squlched connections all with the capacity to send 10 packets probably doesn't meet the threshold of squelched.

OS X (10.5)
The latest version of OS X I have handy is dated. Nothing in a google search leads me to believe this has changed since 10.5, but I'm happy to take updates.

OS X, like XP, does not allow decreasing a window through SO_RCVBUF.

It does allow increasing one if the window size was set before the connection - otherwise "auto tuning" is used and cannot be manipulated while the connection is established.

Like vista, the initial window determines the scaling factor, and assuming a small window on a squelched stream that means window scaling is disabled and the window can only be increased to 64KB for the life of the connection.

LINUX/ANDROID
Linux can decrease rwin, but as with other platforms requires data transfer to do it instead of a 1-RTT slamming shut of the window. Linux does not allow increasing the window past the size it has when the connection is established. So you can start a window large, slowly decrease it, and then increase it back to where it started.. but you can't start a window small and then increase it as you might want to do with a squelched stream.

It pains me to say this, and it is so rarely true, but this makes Linux the least attractive development platform for this approach.

CONCLUSIONS
From this data it seems viable to attempt a strategy for {windows >= vista and OS X} that mixes 3 types of connections:
  1. Full Autotuned connections. These are unsquelched, can never be slowed down, generally use window scaling and are capable of running at high speeds
  2.  Connections that begin with small windows and are currently squelched to limit the impact of new low priority streams in highly parallel environments
  3.  Connections that were previously squelched but have now been upgraded to 64KB windows.. "almost full" (1) if you will.
  4.  

Wednesday, December 5, 2012

Smarter Network Page Load for Firefox

I just landed some interesting code for bug 792438. It should be in the December 6th nightly, and If it sticks it will be part of Firefox 20.

The new feature keeps the network clear of media transfers while a page's basic html ,css, and js are loaded. This results in a faster first paint as the elements that block initial rendering are loaded without competition. Total page load is potentially slightly regressed, but the responsiveness more than makes up for it. Similar algorithms report first paint improvements of 30% with pageload regressions around 5%. There are some cases where the effect is a lot more obvious than 30% - try pinterest.com, 163.com, www.skyrock.com, or www.tmz.com - all of these sites have large amounts of highly parallel media on the page that, in previous versions, competed with the required css.

Any claim of benefit is going to depend on bandwidth, latency, and the contents of your cache (e.g. if you've already got the js and css cached, this is moot.. or if you have a localhost connection like talos tp5 uses it is also moot because bandwidth and latency are essentially ideal there).

Before I get to the nitty gritty I think its worth a paragraph of Inside-Mozilla-Baseball to mention what a fun project this was. I say that in particular because it involved a lot of cross team participation on many different aspects (questions, advice, data, code in multiple modules, and reviews). I think we can all agree that when many different people are involved on the same effort efficiency is typically the first casualty. Perhaps this project is the exception needed to prove the rule - it went from a poorly understood use case to commited code very quickly. Ehsan, Boris, Dave Mandelin, Josh Aas, Taras, Henri, Honza Bambas,  Joe Drew, and Chromium's Will Chan helped one and all - its not too often you get the rush of everyone rowing in sync; but it happened here and was awesome to behold and good for the web.

In some ways this feature is not intuitive. Web performance is generally improved by adding parallelization due to large amounts of unused bandwidth left over by the single session HTTP/1 model. In this case we are actually reducing parallelization which you wouldn't think would be a good thing. Indeed that is why it can regress total page load time for some use cases. However, parallelization is a double edged sword.

When parallelization works well it is because it helps cover up moments of idle bandwidth in a single HTTP transaction (e.g. latency in the handshake, latency during the header phase, or even latency pauses involved in growing the TCP congestion window) and this is critically important to overall performance. Servers engage in hostname sharding just to opt into dozens of parallel connections for performance reasons.

On the other hand, when it works poorly parallelization kills you in a couple of different ways. The more spectacular failure mode I'll be talking about in a different post (bug 813715) but briefly the over subscription of the link creates induced queues and TCP losses that interact badly with TCP congestion control and recovery. But the issue at hand here is more pedestrian - if you share a connection 50 ways and all 50 sessions are busy then each one gets just 2% of the bandwidth. They are inherently fair with each other and without priority even though that doesn't reflect their importance to your page.

If the required js and css is only getting 10% of the bandwidth while the images are getting 90% then the first meaningful paint is woefully delayed. The reason you do the parallelism at all is because many of those connections will be going through one of the aforementioned idle moments and aren't all simultaneously busy - so its a reasonable strategy as long as maximizing total bandwidth utilization is your goal.. but in the case of an HTML page load some resources are more important than others and it isn't worth sacrificing that ordering to perfectly optimize total page load. So this patch essentially breaks page load into two phases to sort out that problem.

The basic approach here is the same as used by Webkit's ResourceLoadScheduler. So Webkit browsers already do the basic idea (and have validated it). I decided we wanted to do this at the network layer instead of in content or layout to enable a couple extra bonuses that Webkit can't give because it operates on a higher layer:
  1. If we know apriori that a piece of work is highly sensitive to latency but is very low in bandwidth then we can avoid holding it back even if that work is just part of an HTTP transaction. As part of this patchset I have included the ability to preconnect a small quota of 6 media TCP sessions at a time while css/js is loading. More than 6 can complete, its just limited to 6 outstanding at one instant to bound the bandwidth consumption. This results in a nice little hot pool of connections all ready for use by the image transfers when they are cleared for takeoff. You could imagine this being slightly expanded to a small number of parallel HTTP media transfers that were bandwidth throttled in the future.
  2. The decision on holding back can be based on whether or not SPDY is in use if you wait until you have a connection to make the decision - spdy is a prioritized and muxed-on-one-tcp-session protocol that doesn't need to use this kind of workaround to do the right thing. In its case we should just send the requests as soon as possible with appropriate priorities attached to them and let the server do the right thing. The best of both worlds!
 This issue illustrates again how dependent HTTP/1 is on TCP behavior and how that is a shortcoming of the protocol. Reasonable performance demands parallelism, but how much is dependent on the semantics of the content, the conditions of the network, and the size of the content. These things are essentially unknowable and even characterizations of the network and typical content change very quickly. Its essential for the health of the Internet that we migrate away from this model onto HTTP/2.

Comments over here please: https://plus.google.com/100166083286297802191

Wednesday, November 14, 2012

A brief note on pipelines for firefox

A brief note on http pipeline status - you may have read on mozilla planet yesterday that Chrome has this technology enabled. That's not true, I can't say why the author wrote it - but chrome has the same status (implemented but configured off by default) as firefox and largely for the same reasons.

This is a painful note for me to write, because I'm a fan of HTTP pipelines.

But, despite some obvious wins with pipelining it remains disabled as a high risk / high maintenance item. I use it and test with it every day with success, but much of the risk in tied up in a user's specific topology - intermediaries (including virus checkers - an oft ignored but very common intermediary) are generally the problem with 100% interop. I encourage readers of planet moz to test it too - but that's a world apart turning it on by default- the test matrix of topologies is just very different.

See bugzilla 716840 and 791545 for some current problems caused by unknown forces.

Also see telemetry for TRANSACTION_WAIT_TIME_HTTP and TRANSACTON_WAIT_TIME_HTTP_PIPELINES - you'll see that pipelines do marginally reduce queuing time, but not by a heck of a lot in practice. (~65% of transactions are sent within 50ms using straight HTTP, ~75% with pipelining enabled). If we don't see more potential than that, it isn't worth taking the pain of interop troubles. It is certainly possible that those numbers can be improved.

The real road forward though is HTTP/2 - a fully multiplexed and pipelined protocol without all the caveats and interop problems. Its better to invest in that. Check out TRANSACTON_WAIT_TIME_SPDY and you'll see that 93% of all transactions wait less than 1ms in the queue!

Monday, August 6, 2012

The Road to HTTP/2

I have been working on HTTP/2.0 standardization efforts in the IETF. Lots has been going on lately, lots of things are still in flux, and many things haven't even been started yet. I thought an update for Planet Mozilla might be a useful complement to what you will read about it in other parts of the Internet. The bar for changing a deeply entrenched architecture like HTTP/1 is very tall - but the time has come to make that change.

Motivation For Change - More than page load time

HTTP has been a longtime abuser of TCP and the Internet. It uses mountains of different TCP sessions that are often only a few packets long. This triggers lots of overhead and results in common stalling due to bad interaction with the way TCP was envisioned to be deployed. The classic TCP model pits very large FTP flows against keystrokes of a telnet session - HTTP doesn't map to either of those very well. The situation is so bad that over 2/3rds of all TCP packet losses are repaired via the slowest possible mechanism (timer expiration), and more than 1 in 20 transactions experience a loss event. That's a sign that TCP interactions are the bottleneck in web scalability.

Indeed, after a certain modest amount of bandwidth is used additional bandwidth barely moves the needle on HTTP performance at all - the only thing that matters is connection latency. Latency isn't getting better - if anything it is getting worse with the transition to more and more mobile networks. This is the quagmire we are in - we can keep adding more bandwidth, clearer radios, and faster processors in everyone's pocket as we've been doing but that won't help much any more.

These problems have all been understood for almost 20 years and many parties have tried to address them over time. However, "good enough" has generally carried the day due to legitimate concerns over backward compatibility and the presence of other lower hanging fruit. We've been in the era of transport protocol stagnation for a while now. Only recently has the problem been severe enough to see real movement on a large scale in this space with the deployment of SPDY by google.com, Chrome, Firefox, Twitter, and F5 among others. Facebook has indicated they will be joining the party soon as well along with nginx. Many smaller sites also participate in the effort, and Opera has another browser implementation available for preview.

SPDY uses a set of time proven techniques. Those are primarily multiplexed transactions on the same socket, some compression, prioritization, and good integration with TLS. The results are strong: page load times are improved, TCP interaction improves (including less connections to manage less dependency on rtt as a scaling factor, and improved "buffer bloat" properties),  and incentives are created to give users the more secure browsing experience they deserve.

 Next Steps

Based on that experience, the IETF HTTP-bis working group has reached consensus to start working on HTTP/2.0 based on the SPDY work of Mike Belshe and Roberto Peon. Mike and Roberto initially did this work at Google, Mike has since left Google and is now the founder of Twist. While the standards effort will be SPDY based, that does not mean it will be backwards compatible (it almost certainly won't be) nor does it mean it will have the exact same feature set but to be a success the basic properties will persevere and we'll end up with a faster, more scalable, and more secure Web.

Effectively taking this work into an open standardization forum means ceding control of the protocol revision process to the IETF and agreeing to implement the final result as long as it doesn't get screwed up too badly. That is a best effort process and we'll just have to participate and then wait to see what becomes of it. I am optimistic - all the right players are in the room and want to do the right things. Just as we have implemented SPDY as an experiment in the past in order to get data to inform its evolution, you can expect Mozilla to vigorously fight for the people of the web to have the best protocol possible and it seems likely we will experiment with implementations of some Internet-draft level revisions of HTTP/2.0 along the way to validate those ideas before a full standard is reached. (We'll also be deprecating all of these interim versions along the way as they are replaced with newer ideas - we don't want to create defacto protocols by mistake.) The period of stagnant protocols is ending.

Servers should come along too - use (and update) mod_spdy and nginx with spdy support; or get hosting from an integrated spdy services provider like CloudFlare or Akamai.

Exciting times.


Wednesday, April 25, 2012

Making Firefox Search Snappier

The Firefox 15 development window just opened and I checked into inbound a cool feature that had been sitting in my queue for a little while. Let's see if it sticks!

The feature basically lets non-network code hint to the networking layer that it will probably send a http transaction to a particular site soon, but it isn't ready to do so yet. The network code can take the hint and begins to setup a TCP and (if necessary) SSL handshake right away because these are high latency operations.

The primary initial user of this is the search box in firefox. When you focus on that box we will probably make a connection to the search provider right away. Simultaneously with this you type your search terms - when your search is ready (or partly ready if you are using search suggestions) it can be submitted to the search service without any waiting for network setup.

This can be a significant win. The average Internet round trip time is about 100ms (there is a lot of variation in this). It takes 1 RTT to setup TCP and (likely) 2 more for SSL. 300ms is a notable pause, but overlapped in the background it essentially becomes free resulting in snappier searches that still use a secured transport. If you're using a SPDY enabled search provider such as google or twitter, this is done over SPDY, so the one TCP session now established will be able to carry all of your search results - no more setup overhead to worry about with additional parallel connections etc..

The other user of this feature that got checked in as part of this merge is actually internal to the networking code just before the cache I/O is done. The amount of time it takes to check the disk cache is extremely variable - afaict generally its pretty fast but the tail can be really awful depending on hardware, other system activity, OS, etc.. So we overlap the network handshakes with this activity that figures out the values of the If-Modified-Since (etc..) request headers.

There is an IDL for providing the hint (nsISpeculativeConnect) - so if you can think of other areas ripe for this kind of optimization let's get to it!

[The best place for comments is probably here: https://plus.google.com/100166083286297802191/posts ]


Monday, April 23, 2012

Welcome Jiten - Building a Networking Dashboard

There are many awesome things about the Firefox networking layer. However realtime visibility into what its doing is not on that list.

Thanks to Google Summer of Code, Jiten Thakkar is going to work on that problem this summer. Jiten is already a card carrying Mozillian with code commits in several parts of Gecko and I'm excited to have him focused on necko and an add-on for the next few months.

We hope to learn what connections are being made, how fast they are running, what the DNS cache looks like, how often SPDY is used, and all manner of other information that will aid debugging and inform optimization choices. Maybe even icing on the cake ssuch as in-browser diagnostic tools (e.g. Can I tcp handshake with an arbitrary hostname? How about HTTP? How about SSL?) to round things out.

Good Stuff!

Thursday, March 8, 2012

Twitter, SPDY, and Firefox

Well - look at what this morning brings: some Twitter.com enabled SPDY goodness in my firefox nightly! (below)

spdy is currently enabled in firefox nightly and you can turn it on in FF 11 and FF 12 through network.http.spdy.enabled in about:config

There is not yet any spdy support for the images on the Akamai CDN that twitter uses, and that's obviously a big part of performance. But still real deployed users of this are Twitter, Google Web, Firefox, Chrome, Silk, node etc.. this really has momentum because it solves the right problems.

Big pieces still left are a popular CDN, open standardization, a http<>spdy gateway like nginx, a stable big standalone server like apache, and support in a load balancing appliance like F5 or citrix. And the wind is blowing the right way on all of those things. This is happening very fast.

https://twitter.com/account/available_features

GET /account/available_features HTTP/1.1
Host: twitter.com
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120308 Firefox/13.0a1
[..]
X-Requested-With: XMLHttpRequest
Referer: https://twitter.com/
[..]

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate, pre-check=0, post-check=0
Content-Length: 3929
Content-Type: text/javascript; charset=utf-8
Date: Thu, 08 Mar 2012 15:20:56 GMT
Etag: "8f2ef94f3149553a2c68e98a8df04425"
Expires: Tue, 31 Mar 1981 05:00:00 GMT
Last-Modified: Thu, 08 Mar 2012 15:20:56 GMT
Pragma: no-cache
Server: tfe
[..]
x-revision: DEV
x-runtime: 0.01763
[..]
x-xss-protection: 1; mode=block
X-Firefox-Spdy: 1

Monday, January 23, 2012

HTTP-WG Proposal to tackle HTTP/2.0

Huzzah to Mark Nottingham, chair of the IETF HTTP Working Group. He proposes rechartering the group to "specify (sic) HTTP/2.0 an improved binding of HTTP's semantics to the underlying transport."

That's welcome news - the scalability, efficiency, and robustness issues with HTTP/1 are severe problems that deserve attention in an open standards body forum. The HTTP-WG is the right place.

SPDY will certainly be offered as an input to that process and in my opinion it touches on the right answers. But whatever the outcome it is great to see momentum around open standardization of solutions to the transport problems HTTP/1 suffers from.

Saturday, January 7, 2012

Using SPDY for more responsive interfaces

RST_STREAM turns out to be a feature of spdy that I under appreciated for a long time. The concept is simple enough - either end of the connection can cancel an individual stream without impacting the other streams that are multiplexed on the same connection.

This fills a conceptual gap left by HTTP/1.x. - In HTTP when you want to cancel a transaction about all you can do is close the connection.

Consider the case of quickly clicking through a webmail mailbox - just scanning the contents and rapidly clicking 'next'. Typically the pages will be only partly loaded before you move on to the next one. Assuming you have used up all your parallelism in HTTP/1, the new click will either have to wait for the old transactions to complete (wasting time and bandwidth) or cancel the old ones by closing those connections and then open fresh connections for the new requests. New connections add 2 or 3 round trip times to reopen the SSL connection (you are reading email over SSL, right?) before they can be used to send the new requests. Either way - that is not a good experience.

An interactive map application has similar attributes - as you scan along the map, zooming in and out, lots of tiles are loaded and are often irrelevant before they are drawn. I'm sure you can think of other scenarios that have cancellations.

Spdy solves this simply - with its inherently much greater parallelism the new requests can be sent immediately and at the same time cancel notifications can go out for the old ones. That saves the bandwidth and gets the new requests going as fast as possible without interfering with either the established connection or any other transactions also in progress.

A page load time metric won't really show this to you but the increased responsiveness is very obvious when working with these kinds of use cases - especially under high latency conditions that make connection establishment slower.


Sunday, January 1, 2012

A use case for SPDY header compression

A use case for SPDY header compression: http://pix04.revsci.net/F09828/a4/0/0/0.js

380 bytes of gzipped javascript (550 uncompressed), sent with 8.8KB of request cookies and 5.5KB of response cookies. That overhead is bad enough to mess with TCP CWND defaults - which means you are taking multiple round trips on the network just to move half a KB of js. For HTTP, that's a performance killer! Those cookies are repeated almost identically on every transaction with that host.

SPDY's dedicated header contexts and the repetitive nature of cookies means those cookies can be reduced ~98% for all but the first transaction of the session. Essentially the cookies remain stateless for app developers, along with the nice properties of that, but the transport leverages the connection state to move them much more efficiently.