DevOps Zone is brought to you in partnership with:

I have been working for almost two years now on infrastructure and deployment automation, exploring programmatic solutions to traditional systems administration problems and configuration management. I'm fanatical about testing, the scientific method and building good tools to support awesome   Oliver is a DZone MVB and is not an employee of DZone and has posted 29 posts at DZone. You can read more from them at their website. View Full User Profile

Can't Create New Network Sockets? Maybe It Isn't User Limits...

03.01.2013
| 2362 views |
  • submit to reddit

I’ve been doing a lot more programming in Go recently, mostly because it has awesome concurrency primitives but also because it is generally a pretty amazing language. Unlike other languages which have threads, fibres or event-driven frameworks to achieve good concurrency, Go manages to avoid all of these but still remain readable. You can also reason about its behaviour very effectively due to how easily understandable and straightforward concepts like channels and goroutines are.

But enough about Go (for the moment). Recently I found the need to quickly duplicate the contents of one Amazon S3 bucket to another. This would not be a problem, were it not for the fact that the bucket contained several million objects in it. Fortunately, there are two factors which makes this not so daunting:

  1. S3 scales better than your application ever can, so you can throw as many requests at it as you like.
  2. You can copy objects between buckets very easily with a PUT request combined with a special headerindicating the object you want copied (you don’t need to physically GET then PUT the data).

A perfect job for a Go program! The keys of the objects are in a consistent format, so we can split up the keyspace by prefixes and split the work-load amongst several goroutines. For example, if your objects are named 00000000 through to 99999999 using only numerical characters, you could quite easily split this into 10 segments of 10 million keys. Using the bucket GET method you can retrieve up to 1000 keys in a batch using prefixes. Even if you split into 10 million key segments and there aren’t that many actual objects, the only things that matter are that you start and finish in the right places (the beginning and end of the segment) and continue making batch requests until you have all of the keys in that part of the keyspace.

So now we have a mechanism for rapidly retrieving all of the keys. For millions of objects this will still take some time, but you have divided the work amongst several goroutines so it will be that much faster. For comparison, the Amazon Ruby SDK uses the same REST requests under the hood when using the bucket iterator bucket.each { |obj| … } but only serially – there is no division of work.

Now to copy all of our objects we just need to take each key return by the bucket GET batches, and send off one PUT request for each one. This introduces a much slower process – one GET request results in up to 1000 keys, but then we need to perform 1000 PUTs to copy them. The PUTs also take quite a long time each, as the S3 backend has to physically copy the data between buckets – for large objects this can still take some time.

Let’s use some more concurrency, and have a pool of 100 goroutines waiting to process the batch of 1000 keys just fetched. A recent discussion on the golang-nuts group resulted in some good suggestions from others in the Go community and resulted in this code:


https://gist.github.com/ohookins/5013420

It’s not a lot of code, which makes me think it is reasonably idiomatic and correct Go. Best yet, it has the possibility to scale out to truly tremendous numbers of workers. You may notice that each of the workers also uses the same http.Client and this is intentional – internally the http.Client makes some optimisations around connection reuse so that you aren’t susceptible to the performance penalty of socket creation and TCP handshakes for every request. Generally this works pretty well.

Let’s think about system limits now. Say we want to make our PUT copy operations really fast, and use 100 goroutines for these operations. With just 10 fetcher goroutines that means we now have 1000 goroutines vying for attention from the http.Client connection handling. Even if the fetchers are idle, if we have all of the copier workers running at the same time, we might require 1000 concurrent TCP connections. With a default user limit of 1024 open file handles (e.g. on Ubuntu 12.04) this means we are dangerously close to exceeding that limit.

Head http://mybucket.s3.amazonaws.com:80/: lookup mybucket.s3.amazonaws.com: no such host

When you see an error like the above pop up in your program’s output, it almost seems a certainty that you have exceeded these limits… and you’d be right! For now… Initially these were the errors I was getting, and while it was somewhat mysterious that I would see so many of them (literally one for each failed request), apparently some additional sockets are required for name lookups (even if locally cached). I’m still looking for a reference for this, so if you know of it please let me know in the comments.

This resulted in a second snippet of Go code to check my user limits:

https://gist.github.com/ohookins/5030871

Using syscall.Getrusage in conjunction with syscall.Getrlimit would allow you to fairly dynamically scale your program to use just as much of the system resources as it has access to, but not overstep these boundaries. But remember what I said about using http.Client before? The net/http package documentation says Clients should be reused instead of created as needed and Clients are safe for concurrent use by multiple goroutines and both of these are indeed accurate. The unexpected side-effect of this is that, unfortunately, the usage of TCP connections is now fairly opaque to us. Thus our understanding of current system resource usage is fundamentally detached from how we use http.Client. This will become important in just a moment.

So, having raised my ulimits far beyond what I expected I actually needed (this was to be the only program running on my test EC2 instance anyway), I re-ran the program and faced another error:

Error: dial tcp 207.171.163.142:80: cannot assign requested address

What the… I thought I had dealt with user limits? I didn’t initially find the direct cause of this, thinking I hadn’t properly dealt with the user limits issue. I found a few group discussion threads dealing with http.Clientconnection reuse, socket lifetimes and related topics, and I first tried a few different versions of Go, suspecting it was a bug fixed in the source tip (more or less analogous to HEAD on origin/master in Git, if you mainly use that VCVS). Unfortunately this yielded no fix and no additional insights.

I had been monitoring open file handles of the process during runtime and noticed it had never gone over about 150 concurrent connections. Using netstat on the other hand, showed that there were a significant number of connections in the TIME_WAITstate. This socket state is used by the kernel to leave a trace of the connection around in case there are duplicate packets on the network waiting to arrive (among other things). In this state the socket is actually detached from the process that created it, but waiting for kernel cleanup – therefore it actually doesn’t count as an open file handle anymore, but that doesn’t mean it can’t cause problems!

In this case I was connecting to Amazon S3 from a single IP address – the only one configured on the EC2 instance. S3 itself has a number of IP addresses on both East and West coasts, rotated automatically through DNS-based load-balancing mechanisms. However, at any given moment you will resolve a single IP address and probably use that for a small period of time before querying DNS again and perhaps getting another IP. So we can basically say we have one IP contacting another IP – and this is where the problem lies.

When an IPv4 network socket is created, there are five basic elements the kernel uses to make it unique among all others on the system:

protocol; local IPv4 address : local IPv4 port <-> remote IPv4 address : remote IPv4 port

Given roughly 2^27 possibilities for local IP (class A,B,C), the same for remote IP and 2^16 for each of the local and remote ports (assuming we can use any privileged ports < 1024 if we use the root account), that gives us about 2^86 different combinations of numbers and thus number of theoretical IPv4 TCP sockets a single system could keep track of. That’s a whole lot! Now consider that we have a single local IP on the instance, we have (for some small amount of time) a single remote IP for Amazon S3, and we are reaching it only over port 80 – now three of our variables are reduced to a single possibility and we only have the local port range to make use of.

Worse still, the default setting (for my machine at least) of the local port range available to non-root users was only 32768-61000, which reduced my available local ports to less than half of the total range. After watching the output of netstat and grepping for TIME_WAIT sockets, it was evident that I was using up this odd 30000 local ports within a matter of seconds. When there are no remaining local port numbers to be used, the kernel simply fails to create a network socket for the program and returns an error as in the above message – cannot assign requested address.

Armed with this knowledge, there are a couple of kernel tunings you can make. Tcp_tw_reuse and tcp_tw_recycle both are related to tunings to the kernel which affect when it will reclaim sockets in the TIME_WAIT state, but practically this didn’t seem to have much effect. Another setting, tcp_max_tw_bucketssets a limit on the total number of TIME_WAIT sockets and actively kills them off rapidly after the count exceeds this limit. All three of these parameters look and sound slightly dangerous, and despite them having had not much effect I was loath to use them and call the problem solved. After all, if the program was killing the connections and leaving them for the kernel to clean up, it didn’t sound like http.Client was doing a very good job of reusing connections automatically.

Incidentally, Go does support automatic reuse of connections in TIME_WAIT with the SO_REUSEADDR socket option, but this only applies to listening sockets (i.e. servers).

Unfortunately that brought me about to the end of my inspiration, but a co-worker pointed me in the direction of the http.Transport’s MaxIdleConnsPerHost parameter, which I was only vaguely aware of due to having skimmed the source of that package in the last couple of days, desperately searching for clues. The default value used here is two (2) which seems reasonable for most applications, but evidently is terrible when your application has large bursts of requests rather than a constant flow. I believe that internally, the transport creates as many connections as required, the requests are processed and closed and then all of those connections (but two) are terminated again, left in TIME_WAIT state for the kernel to deal with. Just a few cycles of this need to repeat before you have built up tens of thousands of sockets in this state.

Altering the value of MaxIdleConnsPerHost to around 250 immediately removed the problem, and I didn’t see any sockets in TIME_WAIT state while I was monitoring the program. Shortly thereafter the program stopped functioning, I believe because my instance was blacklisted by AWS for sending too many requests to S3 in a short period of time – scalability achieved!

If there are any lessons in this, I guess it is that you still often need to be aware of what is happening at the lowest levels of the system even if your programming language or application has abstracted enough of the details away for you not to have to worry about them. Even knowing that there was an idle connection limit of two would not have given away the whole picture of the forces at play here. Go is still my favourite language at the moment and I was glad that the fix was relatively simple, and I still have a very understandable codebase with excellent performance characteristics. However, whenever the network and remote services with variable performance characteristics are involved, any problem can take on large complexity.

Published at DZone with permission of Oliver Hookins, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)