Enterprise Integration Zone is brought to you in partnership with:

Asankha Perera is the Founder and CTO of AdroitLogic, that develops the UltraESB - the first and only Open Source ESB to introduce Zero-copy proxying for extreme performance. Asankha is a member of the Apache Software Foundation, and has previously contributed most of the code of the Apache Synapse ESB. Asankha is a DZone MVB and is not an employee of DZone and has posted 20 posts at DZone. You can read more from them at their website. View Full User Profile

Does Tomcat Bite More Than It Can Chew?

11.02.2012
| 8627 views |
  • submit to reddit

This is an interesting blog post for me, since it's about the root cause for an issue we saw with Tomcat back in May 2011, which remained unresolved. Under high concurrency and load, Tomcat would reset (i.e. TCP level RST) client connections without refusing to accept them - as one would expect. I posted this again to the Tomcat user list a few days back, but then wanted to find out the root cause for myself, since it would surely come up again in the future.

Background

This issue initially became evident when we ran high concurrency load tests at a customer location in Europe, where the customer had back-end services deployed on multiple Tomcat instances, and wanted to use the UltraESB for routing messages with load balancing and fail-over. For the ESB Performance Benchmark, we had been using an EchoService written over the Apache HttpComponents/Core NIO library that scaled extremely well and behaved well at the TCP level, even under load. However, at the client site, they wanted the test run against real services deployed on Tomcat - to analyse a more realistic scenario. We used a Java based clone of ApacheBench called the 'Java Bench' which is also a part of the Apache HttpComponents project, to generate load. The client would go up-to concurrency levels of 2560, pushing as many messages as possible through the ESB, to back end services deployed over Tomcat.

Under high load, the ESB would start to see errors while talking to Tomcat, and the cause would be IO errors such as "Connection reset by peer". Now the problem to the ESB is that it had already started to send out an HTTP request / payload over an accepted TCP connection, and thus it does not know if it can fail-over safely by default to another node, since the backend service might have  performed some processing over the request it may have already received. Of-course, the ESB could be configured to retry on such errors as well, but our default behaviour was to fail-over only on the safer connection refused or connect timeout errors (i.e. a connection could not be established within the allocated time) - which ensures correct operation, even for non-idempotent services.

Recent Observations

We recently experienced the same issue with Tomcat when a customer wanted to perform a load test scenario where a back-end service would block for 1-5 seconds randomly, to simulate realistic behaviour. Here, again we saw that Tomcat was resetting accepted TCP connections, and we were able to capture this with Wireshark as follows, using JavaBench directly against a Tomcat based servlet


As can be seen in the trace, the client initiated a TCP connection with the source port 9386, and Tomcat running on port 9000 accepted the connection - note “1”. The client kept sending packets of a 100K request, and Tomcat kept acknowledging them. The last such case is annotated with note “2”. Note that the request payload was not complete at this time from the client – note “3”. Suddenly, Tomcat resets the connection – note “4”

Understanding the root cause

After failing to locate any code in the Tomcat source code that resets established connections, I wanted to simulate the behaviour with a very simple Java program. Luckily the problem was easy to reproduce with a simple program as follows:

import java.net.ServerSocket;
import java.net.Socket;

public class TestAccept1 {

  public static void main(String[] args) throws Exception {
  ServerSocket serverSocket = new ServerSocket(8280, 0);
  Socket socket = serverSocket.accept();
  Thread.sleep(3000000); // do nothing
  }
}

We just open a server socket on port 8280, with a backlog of 0 and start listening for connections. Since the backlog is 0, one could assume that only one client connection would be allowed - BUT, I could open more than that via telnet as follows, and even send some data afterwards by typing it in and pressing the enter key.

telnet localhost 8280
hello world

A netstat command now confirms that more than one connection is open:
netstat -na | grep 8280 
tcp  0  0 127.0.0.1:34629  127.0.0.1:8280  ESTABLISHED
tcp  0  0 127.0.0.1:34630  127.0.0.1:8280  ESTABLISHED
tcp6  0  0 :::8280  :::*  LISTEN 
tcp6  13  0 127.0.0.1:8280  127.0.0.1:34630  ESTABLISHED
tcp6  13  0 127.0.0.1:8280  127.0.0.1:34629  ESTABLISHED

However, the Java program has only accepted ONE socket, although at the OS level, two would appear. It seems like the OS also allows more than two connections to be opened, even when the backlog is specified as 0. On Ubuntu 12.04 x64, the netstat command would not show me the actual listen queue length - but I believe it was not 0. However, before and after this test, I did not see a difference in the reported statistics for "listen queue" overflow, which I could see with the "netstat -sp tcp | fgrep listen" command

Next I used the JavaBench from the SOA ToolBox and issued a small payload at concurrency 1024, with a single iteration against the same port 8280


As expected, all requests failed, but my Wireshark trace on port 8280 did not detect any connection resets. Pushing the concurrency to 2560 and the iterations to 10 started to show tcp level RSTs - which were similar to those seen on Tomcat, though not exactly the same.

Can Tomcat do better?

Yes, Possibly .. What an end user would expect from Tomcat is that it refuses to accept new connections when under load, and not to accept connections and then reset them halfway through. But one would ask if that is achievable? Especially considering the behaviour seen with the simple Java example we discussed.

Well, the solution could be to perform better handling of the low level HTTP connections and the sockets, and this is already done by the free and open source high performance Enterprise Service Bus UltraESB, which utilizes the excellent Apache HttpComponents project underneath.

How does the UltraESB behave

One could easily test this by using the 'stopNewConnectionsAt' property of our NIO listener. If you set it to 2, you wont be able to even open a Telnet session to the socket beyond 2.
The first would work, the second too
But the third would see a "Connection refused"
And the UltraESB would report the following on its logs:

  INFO HttpNIOListener HTTP Listener http-8280 paused 
  WARN HttpNIOListener$EventLogger Enter maintenance mode as open connections reached : 2

Although it refuses to accept new connections, already accepted connections executes without any hindrance to completion. Thus a hardware level load balancer in front of an UltraESB cluster can safely load balance if an UltraESB node is loaded beyond its configured limits, without having to deal with any connection resets. Once a connection slot becomes free, the UltraESB will start accepting new connections as applicable.

Analysing a corresponding TCP dump

To analyse the corresponding behaviour, we wrote a simple Echo proxy service on the UltraESB, that also slept for 1 to 5 seconds before it replied, and tested this with the same JavaBench under 2560 concurrent users, each trying to push 10 messages in iteration.

Out of the 25600 requests, 7 completed successfully, while 25593 failed, as expected. We also saw many tcp level RSTs on the Wireshark dump - which must have been issued by the underlying operating system.


However, what's interesting to note is the difference - the RSTs occur immediately on receiving the SYN packet from the client - and are not established HTTP or TCP connections, but elegant "Connection Refused" errors - which would be what the client can expect. Thus the client can safely fail-over to another node without any doubt, overhead or delay.

Appendix : Supporting high concurrency in general

During testing we also saw that the Linux OS could detect the opening of many concurrent connections at the same time as a SYN flood attack, and then start using SYN cookies. You would see messages such as 

--> 

Possible SYN flooding on port 9000. Sending cookies

displayed on the output of a "sudo dmesg", if this happens. Hence, for a real load, it would be better to disable SYN cookies by turning it off as follows as the root user

--> 

# echo 0 > /proc/sys/net/ipv4/tcp_syncookies


To make the change persist over reboots, add the following line to your /etc/sysctl.conf

--> 

net.ipv4.tcp_syncookies = 0


To allow the Linux OS to accept more connections, its also recommended that the 'net.core.somaxconn' be increased - as it usually defaults to 128 or so. This could be performed by the root user as follows,


-->

# echo 1024 > /proc/sys/net/core/somaxconn


To persist the change, append the following to the /etc/sysctl.conf


--> 

net.core.somaxconn = 1024



Kudos!

The UltraESB could not have behaved gracefully without the support of the underlying Apache HttpComponentslibrary, and the help and support received from that project community, especially by Oleg Kalnichevski - whose code and help has always fascinated me!

Published at DZone with permission of Asankha Perera, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Eugen Zed replied on Wed, 2012/11/07 - 9:28am

Related to the beginning of the article:

1. How could you accept more than I connection? From the code you provided there is no listening loop.

2. The backlog is not about how many connections can be accepted but how many connections can be queued without being dropped while a given connection request is being processed. So setting the backlog to 0 doesn't mean that you can't open more than one connection but means that the new connection requests will be dropped while you are dealing with one request in your accept() method. Ofc you must have a listening loop for that.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.