Performance problems are one of the biggest challenges to expect when designing and implementing Java EE related technologies. Some of these common problems can be faced when implementing either lightweight or large IT environments; which typically include several distributed systems from Web portals & ordering applications to enterprise service bus (ESB), data warehouse and legacy Mainframe storage systems.
It is very important for IT architects and Java EE developers to understand their client environments and ensure that the proposed solutions will not only meet their growing business needs but also ensure a long term scalable & reliable production IT environment; and at the lowest cost possible. Performance problems can disrupt your client business which can result in short & long term loss of revenue.
This article will consolidate and share the top 10 causes of Java EE performance problems I have encountered working with IT & Telecom clients over the last 10 years along with high level recommendations.
Please note that this article is in-depth but I'm confident that this substantial read will be worth your time.
#1 - Lack of proper capacity planning
I'm confident that many of you can identify episodes of performance problems following Java EE project deployments. Some of these performance problems could have a very specific and technical explanation but are often symptoms of gaps in the current capacity planning of the production environment.
Capacity planning can be defined as a comprehensive and evolutive process measuring and predicting current and future required IT environment capacity. A proper implemented capacity planning process will not only ensure and keep track of current IT production capacity and stability but also ensure that new projects can be deployed with minimal risk in the existing production environment. Such exercise can also conclude that extra capacity (hardware, middleware, JVM, tuning, etc.) is required prior to project deployment.
In my experience, this is often the most common "process" problem that can lead to short- and long- term performance problems. The following are some examples.
Possible capacity planning gaps
A newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g., java.lang.OutOfMemoryError is observed).
- Lack of understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization
- Lack of memory static and / or dynamic footprint calculation of the newly deployed application
- Lack of performance and load testing preventing detection of problems such as Java Heap memory leak
A newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.
- Lack of understanding of the current CPU utilization (e.g., established baseline)
- Lack of understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU)
- Lack of load and performance testing failing to predict the impact on existing CPU utilization
A new Java EE middleware system is deployed to production but unable to handle the anticipated volume.
- Missing or non-adequate performance and load testing performed
- Data and test cases used in performance and load testing not reflecting the real world traffic and business processes
- Not enough bandwidth (or pages are much bigger than capacity planning anticipated)
One key aspect of capacity planning is load and performance testing that everybody should be familiar with. This involves generating load against a production-like environment or the production environment itself in order to:
- Determine how much concurrent users / orders volumes your application(s) can support
- Expose your platform and Java EE application bottlenecks, allowing you to take corrective actions (middleware tuning, code change, infrastructure and capacity improvement, etc.)
There are several technologies out there allowing you to achieve these goals. Some load-testing products allow you to generate load from inside your network from a test lab while other emerging technologies allow you to generate load from the "Cloud".
Regardless of the load and performance testing tool that you decide to use, this exercise should be done on a regular basis for any dynamic Java EE environments and as part of a comprehensive and adaptive capacity planning process. When done properly, capacity planning will help increase the service availability of your client IT environment.
#2 - Inadequate Java EE middleware environment specifications
The second most common cause of performance problems I have observed for Java EE enterprise systems is an inadequate Java EE middleware environment and / or infrastructure. Not making proper decisions at the beginning of new platform can result in major stability problems and increased costs for your client in the long term. For that reason, it is important to spend enough time brainstorming on required Java EE middleware specifications. This exercise should be combined with an initial capacity planning iteration since the business processes, expected traffic, and application(s) footprint will ultimately dictate the initial IT environment capacity requirements.
Now, find below typical examples of problems I have observed in my past experience:
- Deployment of too many Java EE applications in a single 32-bit JVM
- Deployment of too many Java EE applications in a single middleware domain
- Lack of proper vertical scaling and under-utilized hardware (e.g., traffic driven by one or just a few JVM processes)
- Excessive vertical scaling and over-utilized hardware (e.g., too many JVM processes vs. available CPU cores and RAM)
- Lack of environment redundancy and fail-over capabilities
Trying to leverage a single middleware and / or JVM for many large Java EE applications can be quite attractive from a cost perspective. However, this can result in an operation nightmare and severe performance problems such as excessive JVM garbage collection and many domino effect scenarios (e.g., Stuck Threads) causing high business impact (e.g., App A causing App B, App C, and App D to go down because a full JVM restart is often required to resolve problems).
- Project team should spend enough time creating a proper operation model for the Java EE production environment.
- Attempt to find a good "balance" for your Java EE middleware specifications to provide to the business & operation team proper flexibility in the event of outages scenarios.
- Avoid deployment of too many Java EE applications in a single 32-bit JVM. The middleware is designed to handle many applications, but your JVM may suffer the most.
- Choose a 64-bit over a 32-bit JVM when it is required but combine with proper capacity planning and performance testing to ensure your hardware will support it.
#3 - Excessive Java VM garbage collections
Now let's jump to pure technical problems starting with excessive JVM garbage collection. Most of you are familiar with this famous (or infamous) Java error: java.lang.OutOfMemoryError. This is the result of JVM memory space depletion (Java Heap, Native Heap, etc.).
I'm sure middleware vendors such as Oracle and IBM could provide you with dozens and dozens of support cases involving JVM OutOfMemoryError problems on a regular basis, so no surprise that it made the #3 spot in our list.
Keep in mind that a garbage collection problem will not necessarily manifest itself as an OOM condition. Excessive garbage collection can be defined as an excessive number of minor and / or major collections performed by the JVM GC Threads (collectors) in a short amount of time leading to high JVM pause time and performance degradation. There are many possible causes:
- Java Heap size chosen is too small vs. JVM concurrent load and application(s) memory footprint.
- Inappropriate JVM GC policy used.
- Your application(s) static and / or dynamic memory footprint is too big to fit in a 32-bit JVM.
- The JVM OldGen space is leaking over time * quite common problem *; excessive GC (major collections) is observed after few hours / days.
- The JVM PermGen space (HotSpot VM only) or Native Heap is leaking over time * quite common problem *; OOM errors are often observed over time following application dynamic redeployments.
- Ratio of YoungGen / OldGen space is not optimal to your application(s) (e.g., a bigger YoungGen Space is required for applications generating massive amount of short lived objects). A bigger OldGen space is required for applications creating lot of long lived / cached Objects.
- The Java Heap size used for a 32-bit VM is too big leaving small room for the Native Heap. Problems can manifest as OOM when trying to a new Java EE application, creating new Java Threads or any computing task that requires native memory allocations.
Before pointing a finger at the JVM, keep in mind that the actual "root" cause can be related to our #1 & #2 causes. An overloaded middleware environment will generate many symptoms, including excessive JVM garbage collection.
Proper analysis of your JVM related data (memory spaces, GC frequency, CPU correlation, etc.) will allow you to determine if you are facing a problem or not. Deeper level of analysis to understand your application memory footprint will require you to analyze JVM Heap Dumps and / or profile your application using profiler tools (such as JProfiler) of your choice.
- Ensure that you monitor and understand your JVM garbage collection very closely. There are several commercial and free tools available to do so. At the minimum, you should enable verbose GC, which will provide all the data that you need for your health assessment
- Keep in mind that GC related problems are unlikely to be caught during development or functional testing. Proper garbage collection tuning will require you to perform load and perform testing with high-volume from simultaneous users. This exercise will allow you to fine-tune your Java Heap memory footprint as per your applications behaviour and load level forecast.
#4 - Too many or poor integration with external systems
The next common cause of bad Java EE performance is mainly applicable for highly distributed systems; typical for Telecom IT environments. In such environments, a middleware domain (e.g., Service Bus) will rarely do all the work but rather "delegate" some of the business processes, such as product qualification, customer profile, and order management, to other Java EE middleware platforms or legacy systems such as Mainframe via various payload types and communication protocols.
Such external system calls means that the client Java EE application will trigger creation or reuse of Socket Connections to write and read data to/from external systems across a private network. Some of these calls can be configured as synchronous or asynchronous depending of the implementation and business process nature. It is important to note that the response time can change over time depending on the health of the external systems, so it is very important to shield your Java EE application and middleware via proper use of timeouts.
Major problems and performance slowdown can be observed in the following scenarios:
- Too many external system calls are performed in a synchronous and sequential manner. Such implementation is also fully exposed to instability and slowdown of its external systems.
- Timeouts between Java EE client applications and external systems are missing or values are too high. This will cause client Threads to get Stuck, which can lead to a full domino effect.
- Timeouts are properly implemented but middleware is not fine-tuned to handle the "non-happy" path. Any increase of response time (or outage) of external system will lead to increased Thread utilization and Java Heap utilization (increased # of pending payload data). Middleware environment and JVM must be tuned in a way to predict and handle both "happy" and "non-happy" paths to prevent a full domino effect.
I also recommend that you spend adequate time performing negative testing. This
means that problem conditions should be "artificially" introduced to the
external systems in order to test how your application and middleware
environment handle failures of those external systems. This exercise should
also be performed under a high-volume situation, allowing you to fine-tune the
different timeout values between your applications and external systems.
#5 - Lack of proper database SQL tuning & capacity planning
The next common performance problem should not be a surprise for anybody: database issues. Most Java EE enterprise systems rely on relational databases for various business processes from portal content management to order provisioning systems. A solid database environment and foundation will ensure that your IT environment will scale properly to support your client growing business.
In my production support experience, database-related performance problems are very common. Since most database transactions are typically executed via JDBC Datasources (including for relational persistence API's such as Hibernate), performance problems will initially manifest as Stuck Threads from your Java EE container Thread manager. The following are common database-related problems I have seen over the last 10 years:
* Note that Oracle database is used as an example since it is a common product used by my IT clients.*
- Isolated, long-running SQLs. This problem will manifest as stuck Threads and usually a symptom of lack of SQL tuning, missing indexes, non-optimal execution plan, returned dataset too large, etc.
- Table or row level data lock. This problem can manifest especially when dealing with a two-phase commit transactional model (ex: infamous Oracle In-Doubt Transactions). In this scenario, the Java EE container can leave some pending transactions waiting for final commit or rollback, leaving data lock that can trigger performance problems until such locks are removed. This can happen as a result of a trigger event such as a middleware outage or server crash.
- Sudden change of execution plan. I have seen this problem quite often and usually the result of some data patterns changes, which can (for example) cause Oracle to update the query execution plan on the fly and trigger major performance degradation.
- Lack of proper management of the database facilities. For example, Oracle has several areas to look at such as REDO logs, database data files, etc. Problems such as lack of disk space and log file not rotating can trigger major performance problems and an outage situation.
- Proper capacity planning involving load and performance testing is critical here to fine-tune your database environment and detect any problems at the SQL level.
- If you are using Oracle databases, ensure that your DBA team is reviewing the AWR Report on a regular basis, especially in the context of an incident and root cause analysis process. Same analysis approach should also be performed for other database vendors.
- Take advantage of JVM Thread Dump and AWR Report to pinpoint the slow running SQLs and / or use a monitoring tool of your choice to do the same.
- Make sure to spend enough time to fortify the "Operation" side of your database environment (disk space, data files, REDO logs, table spaces, etc.) along with proper monitoring and alerting. Failure to do so can expose your client IT environment to major outage scenarios and many hours of downtime.