Performance Zone is brought to you in partnership with:

Jeroen Borgers is an independent consultant who has worked with Xebia IT-Architects and Atos Origin in the past. Jeroen helps customers on enterprise Java performance issues and is instructor of Accelerating Java applications courses. He has worked on various Java projects in several industries since 1996, as a developer, architect, team lead, quality officer, mentor, auditor, performance tester, tuner and trainer. He is specialized in Java performance since 2005. Jeroen has posted 3 posts at DZone. You can read more from them at their website. View Full User Profile

Case Study: Performance Tuning a Web Shop (Part 1)

07.15.2008
| 18169 views |
  • submit to reddit

My book shelf is full of Java books bought over the years. I cannot remember buying one of them in a real bookstore, I bought them all on-line. I bought not just books on-line, but also DVD’s, a printer, a laptop, a washing machine, a baby bed for my youngest girl and the list goes on. All these products have conveniently been delivered to my home which saved me transportation and shopping time.

Occasionally, I visit a web shop where I have to wait for, like, 10 seconds or even more. I usually give it another try with another page, and if it is still so slow, I move my browser to a competitor shop. This slowness results in a bad experience for me as a user and the consequence is that I don’t buy in that shop.

I figure I am not the only one reacting like this. Therefore, it is utterly important for web shops and other revenue generating web sites to really have short page response times. And this is usually a big challenge for developers and operators.

A Case Study

I’ve dealt with performance in many projects. And I’ve consistently seen that there were no adequate performance requirements stated, no representative load tests in place, no useful performance monitoring in production and no tuning based on facts.

What typically did occur was: stress in acceptance, working overtime to fix suspected pieces of code, stress in production, complaining users and one party blaming the other with growing hostility. So, nasty things happen. The question is: Are these nasty things necessary? My answer is: no, they are not.

A big web shop in my country, the Netherlands, is the subject of this case study. In this case study, we dealt with a performance challenge, succeeded and learned several lessons. I would like to share those lessons with you.

Overview of the TRC project

The figure below shows the basic architecture of the Total Relationship Customer (TRC) system of which the web shop is the most important channel. Only Internet and call center channels are shown. The web server tier is made from Microsoft technology most is .NET by now and some ASP pages are still left. Each of the 20 web servers behind the load balancer has its own database on a separate machine for session data like the shopping cart. The app server tier contains the Java services. These are about 50 stateless services like addCustomer, findCustomerById, loginCustomer, processOrder, calculateOrderDiscount, getFinancialData, etc. Most services access the Oracle database and some access the mainframe through BEA Tuxedo middleware, like processOrder for inventory in-stock data. Just a few Tuxedo calls are still made from the web tier. Most traffic, like browsing and searching the product catalogue is handled completely in the web tier. So, the Java services containing most business logic handle just a small portion of the total load.

My Xebia colleagues had already introduced the Spring framework before I came in. Most remote EJB’s were already replaced by Spring beans, which besides a complexity reduction also improved performance significantly.

Speedup needs

I lead a cross functional team with representatives from MS development, Java development, database administration and Tuxedo/mainframe. Given the architecture of the system there were no low hanging fruit left, there were no obvious mistakes when I started. Still, several services needed a serious speedup, especially processOrder. This service took 7 seconds to execute when I first measured it. Luckily, the system was not in production yet and management of the customer had the foresight to really fix this problem before going live. It actually took several months of teamwork to take the 7 seconds down to the acceptable 1.5 seconds average app server response time. Since then, we managed to lower it even further, to just 700 ms.

Biggest gains

The biggest gains we got were all concerning remote calls out of the Java virtual machine. We found that the time was divided as 45% time in database access, 45% time in mainframe access and 10% in Java code. We applied the following actions to speed up the services.

Minimize the number of database queries
We applied various techniques to minimize the number of database calls, like batch updates, asynchronous data writing, id generation from database sequences in batches to minimize database calls.

Optimize the most expensive database queries
We improved database indexes, made more use of prepared statements instead of unprepared statements, used Oracle materialized views, and made use of business knowledge to optimize the most expensive database queries.

Minimize and optimize the mainframe calls
We minimized the number of calls to the mainframe to analyze if every call was really needed. It turned out many calls could be combined, e.g. for a complete order instead of per order line. Also, for some calls we realized that the result of the call was not really needed, so we could change it to an asynchronous call. We sped up the most expensive call by improving mainframe code.

Minimize and optimize the remote calls from web tier to app tier.
Furthermore, the proprietary bridge we used from ASP to Java was not efficient and actually did two remote calls per invocation. We sped this up with exposing the services as Hessian web service, instead of using the bridge. This Hessian service executes only one remote call per invocation and uses an efficient binary protocol. It is very well supported by Spring from both Java and .Net.

Use of evidence to find bottlenecks

Applying the techniques mentioned above is not really rocket science. The difficulty is to apply them at the right place: at the bottleneck. So, the question is: how did we find those bottlenecks? My answer is: we found the evidence. We found the bottlenecks by measuring with tools. We measured with tools instead of what I mostly see happen, by looking at code and in turn actually guessing where the bottleneck is.

We found three activities to be crucial for finding evidence, supported by three open source tools. These are load testing with JMeter, monitoring with JAMon and reporting with JARep.

Load testing with JMeter

Need for load testing

In the beginning, developers and testers executed the Java services from a test server page. While testing functionality they often also observed long response times. These times varied from run to run and it was unclear what happened with more users. So, this was not the right way to determine if the service was fast enough. We needed many samples to get better statistics, be able to simulate many users and repeat exactly the same tests over and over again. We realized that we needed a better performance testing approach and a tool to support this approach. Apache JMeter is the tool of our choice. It is currently the most used open source tool to load test functional behavior and measure performance.

Basic working of JMeter

Basic working with JMeter starts with recording user actions, that is, my own actions using a browser. Then I change hard values in the HTTP request like 12345678 for customerId with a variable ${customerId}. In turn, I need to make sure that this customerId value is available from a previous HTTP response, from a JDBC request or from a CSV file. It is crucial to vary these parameters like customerId, because the second time a call like getCustomerById for the same customer will be faster because of caching effects. That is, the database will now have the needed data from files in memory or even have the needed results of the queries in its result cache. Therefore, it is very important to mimic the production situation as much as possible and have a set of test use cases and number of users which is representative for the actual usage in production.

Functional assertions

Furthermore, I added a ResponseAssertion to check that the output of the response is as expected. A blazingly fast, but incorrect service is, of course, of no use. It is important not to leave this to the functional testers, because some bugs only manifest themselves under high load. Furthermore, for us, this extra regression test on functionality has prevented costly functional bugs in production in several instances.

Performance assertions

Additionally, I made sure the service request has a Duration Assertion. Here I record the required response time. By doing so, JMeter reports can use this value and show which requests do meet and which do not meet the response time requirement.

See below a screenshot of our test in JMeter. The left hand pane shows the test plan definition to execute, made of visual control elements. The lines in the middle pane show the executed requests. If a line is shown in red this means that not all assertions are met for that request. In the response pane at the right the response of a particular request shows, in this case from the processOrder (verwerkenOrder in Dutch) test page, containing test input and output.

Some JMeter best practices

Continuous performance testing with JMeter Ant task
We use the JMeter Ant task to start tests at a scheduled time. This task sends an email with a result report to our stakeholders. This way (for example), our project leader could quickly see the progress of our performance tuning efforts every morning. Integrating this in the daily build, enabled us to quickly see the performance impact of code changes of the day for the test environment. This facilitated continuous performance testing: it helped us to verify that performance improved as expected and it helped to prevent that new functionality degraded performance again.

Representative performance testing to predict accurately
In multiple occasions, however, we found that after tuning, a speedup which showed in the test environment did not translate itself as expected in the production environment. We actually found this to be one of the most challenging tasks: accurately predict speedup in production by changes. It showed ultimately important to have a representative test database, that is, one that contains the fully sized and up-to-date data. In addition, we found that a subtle difference in database statistics between test and production can cause bad predictions. The same holds for the cache size of the database, memory availability on the machine and other load put on the same machine during the test.

Use of random test data to prevent unrealistic caching
To fetch test data like productId’s, we used JDBC queries from within the JMeter test plan itself to get random products to subsequently test with. This is better than static CSV files containing the product ID’s, because products can quickly become out-of-date, in which case you need to re-create the CSV file. Also, by repeating the test, you start with products already used in the previous test which will likely be unrealistically cached. However, the disadvantage of using the JDBC queries is that you introduce artificial, unrealistic load on the database during the test. We found a solution by generating large CSV files with a script before testing and randomly picking IDs from this file. We introduced a JMeter component RandomCSVDataSetConfig to facilitate this approach.

New useful features of JMeter
Furthermore, we found it paid off to monitor availability of new releases of JMeter. Apache JMeter’s developers introduced several useful new features over the last years, like comments and module controllers.

Summary

In the first part of this article, I’ve talked about my experiences as an on-line buyer and the importance of short response times for web shops. I’ve described the speedup challenge of the customer's web shop. We achieved the biggest gains by applying various techniques to minimize the number of remote calls and to optimize the most expensive remote calls. We found the bottlenecks by using evidence. We found the evidence by using tools, most importantly: JMeter for load testing, JAMon for performance monitoring and JARep for performance reporting. With JMeter we can simulate user behavior and put a realistic load on the system. The JMeter test should prevent unrealistic caching in order to be able to predict performance of the system in production. A representative test environment and database are equally important ingredients to cook this prediction adequately.

In part 2 of this article, I’ll discuss how we found bottlenecks with JAMon and JARep and I will wrap up this case study.

About Jeroen Borgers

Jeroen Borgers is a senior consultant with Xebia - IT Architects. Xebia is an international IT consultancy and project organization specialized in enterprise Java and agile development. Jeroen helps customers on enterprise Java performance issues and is instructor of Java Performance Tuning courses. He has worked on various Java projects in several industries since 1996, as a developer, architect, team lead, quality officer, mentor, auditor, performance tester and tuner. He works exclusively on performance assignments since 2005. You can read more on the Xebia's blog at http://blog.xebia.com.

Published at DZone with permission of its author, Jeroen Borgers.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)