Tuesday, 11 August 2009

How to find Java Memory Leaks - Part 2

In the previous post I looked at how to monitor your application in test or production to observe its heap behaviour and understand whether or not you have a memory leak. In this post I will start to look at the next step - how to gather data to help to trace the root cause of your leak.

In some simple cases it is possible to use inspection of the source code, a debugger or trace statements to figure out what is going wrong by spotting where the developer has made a mistake. If this works for you, then that's great. I want to focus on the more complex situations where you are dealing with lots of code and lots of heap objects that may not be yours and you can't just go straight to a piece of Java and see what the problem is.

To understand a memory leak we usually need to use tools which are very different from those used to diagnose other issues. For most issues we want to understand the dynamic behaviour of our code - the steps that it performs, which paths it takes, how long it takes, where exceptions are thrown and so on. For memory leaks we need to start with something completely different - namely one or more snapshots of the heap at points in time which can give an insight into the way it is being used up. Only once we have this picture can we start to work back to the code to understand why the heap is being used up.

Before going any further, a small warning - most of the techniques from here on are not well suited to production environments. They can hang the JVM for anything between a few seconds and a few minutes and write large files to disk. You really need to reproduce your issue in a test environment and use these tools there. In a crisis, you may be allowed to use them in production if your app is already under severe heap stress and is about to be killed or restarted anyway.

The simplest type of snapshot is a heap 'histogram'. This gives you a quick summary (by class) of which objects are consuming the most space in the heap including how many bytes they are using and how many instances are in the heap.

There are a couple of ways to get a histogram. The simplest is jmap - a tool included in some JDK distributions from Java 1.5 onwards. Alternatively, you can get a histogram using a JVM startup argument -XX:-PrintClassHistogram and sending your app a Ctrl-Break (Windows) or kill -3 (Unix) signal. This JVM option is also settable via JMX if your JVM is recent enough.

Here is an example (from a test app that deliberately leaks memory)...

andy@belstone:~> jmap -histo 2937
Attaching to process ID 2937, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 1.5.0_09-b03
Iterating over heap. This may take a while...
Object Histogram:

Size Count Class description
-------------------------------------------------------
65419568 319406 char[]
7472856 311369 java.lang.String
1966080 61440 MemStress
596000 205 java.lang.Object[]
64160 4010 java.lang.Long
34248 13 byte[]
32112 2007 java.lang.StringBuffer
10336 34 * ObjArrayKlassKlass
4512 47 java.lang.Class
4360 29 int[]
(snip)
8 1 java.lang.reflect.ReflectAccess
8 1 java.util.Collections$ReverseComparator
8 1 java.util.jar.JavaUtilJarAccessImpl
8 1 java.lang.System$2
Heap traversal took 42.243 seconds.
andy@belstone:~>

There are a few things to note here:-
  • Using this tool will hang the JVM until it has finished the histogram dump - in this case for 42 seconds. Don't try this if you can't afford to hang your JVM.
  • The objects highest in the list are java Strings and their associated character arrays. This is very common. You may also see other built in types (e.g. collections) near the top of the list.
  • What you should usually look for are the classes which appear highest in the list over which you actually have some direct control. This should give you the best clue about what types of object are involved in your memory leak.
In the (rather contrived) example above, the class 'MemStress' looks like it could be the real consumer of the heap space and indeed it is. Even though we've found it, most of the space it is using is being used indirectly - in this case by the Strings it uses to hold its instance variables. We were fortunate that MemStress objects on their own are big enough to push the class high enough up the list to be noticed. In some cases you may not be so lucky - instances of the main culprit may be quite small and it may therefore be hiding lower down the list. Try importing the list into a spreadsheet and sorting by instance count to see if that offers any additional clues.

It's usually best to have several heap histograms from the same run of your app at different points in time. Comparing these will give you a much better picture of how your memory leak is building up over time. Try to arrange to trigger them so that your app is in a similar state (usually idle) for each dump - then you will be comparing 'like for like'.

A heap histogram may be all you need to figure out what's going wrong, in which case you can don't need to worry about the next step.

What a histogram can't tell you is why each object is staying in the heap. This happens because at least one other object is still holding a reference to it. A full heap dump has the information needed to trace these references and also look in detail at the values of individual attributes in every object on the heap. It's also a big step up in size and complexity from a heap histogram. I'll look at how to take and analyse a full heap dump in the next installment.

Monday, 10 August 2009

How to find Java Memory Leaks - Part 1

Java heap consumption and memory leak diagnosis is something that (judging by other posts around the net) is often misunderstood - I've lost count of the number of searches that have turned up advice to simply increase the heap size and hope the problem goes away.

Let's address this one question straight away - if you really have a memory leak, increasing the heap size will at best only delay the appearance of an OutOfMemoryError. If your issue is just a temporary spell of high demand for heap memory then a bigger heap may help. If you take it too far you may see much worse performance because of swapping or even crash your JVM.

So what should you do? In essence, I would recommend three steps:-

1. Monitor your application in test and/or production to get an understanding of its heap behavior
2. If you really do have a problem, you need to gather data to help to track down the cause.
3. Finally having got the data, you need to figure out what it means - what is causing your problem and how to fix it.

Sadly these steps are not always easy to do - particularly step 3.

I'll be talking about the Sun JVM from here on, although many of the same principles and sometimes tools apply to other JVM implementations.

First let's look at monitoring...

To get a clear picture, you really need to look in detail at the heap stats from your JVM, but I quite often find that these are not available when I'm first asked to look at a problem, so what can we tell without this?

Clearly if you see this in your application logs then there is a problem...

java.lang.OutOfMemoryError: Java heap space

The usual response if you have real users is to restart the application straight away. You should then continue to monitor because you're probably going to need to know more before you get to the end of the diagnosis.

I often start by looking at CPU consumption using top, vmstat or whatever tools are available. What I'm looking for is periods where the Java app is using close to 100% of a single CPU. While it's not conclusive proof (being stuck in a loop could cause the same effect), this often means that the app is spending a lot of time doing full garbage collections - the most commonly used garbage collectors are single threaded and therefore tend to use 100% of a CPU.

Monitoring memory usage from the operating system is not very informative - Java will typically grow the heap to the maximum size long before there is a problem. If you have oversized your heap compared to physical memory you will probably see heavy swapping activity, but that's a different issue.

Some applications and app servers will make calls to java.lang.Runtime.freeMemory(). While this may be better than nothing, it's fairly uninformative and provides very little info about which generations have free space, how much time is being spent garbage collecting and so on.

The JVM can provide much more detailed info about what is going on. The JVM itself provides two main ways to get hold of this:-

1. JVM Startup arguments to write information to log files. Here are some suggestions, but check the docs for your specific JVM version (or try here) because the available options vary:-

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintTenuringDistribution

You may also want to add -Xloggc:myapp-gc.log to send your GC info to a dedicated log file

You can then analyse the log files to produce a graph using a tool such as gcviewer. This approach gives you detailed information about every GC run but be careful to check that you have enough disk space - these options can produce a lot of data. Typically you will also need to restart your app to enable these options, although they are also controllable via JMX for more recent JVM versions.

2. JDK tools - jvmstat in Java 1.4 and jstat in Java 1.5 and higher. These are tools that you run from the command line. They connect to your Java app via one of the JVM debugging APIs and will then report on heap and GC stats according to the options you specify. I usually use jstat -gcutil -t 60s which produces a line of output once per minute showing the percentage utilisation of each heap space, plus the number of young and full GC events and the cumulative GC times. You can then open the resulting text file in Excel for analysis. Note that jstat is capable of monitoring Java 1.4 VMs, which is handy if you have multiple versions. Alternatively (if your environment allows) you can try visualgc for an instant GUI view but I prefer using jstat to collect logs for offline analysis.

There are other commercial tools (e.g. Wily Introscope) which can provide the same data as the JDK tools. They may provide better visualisation, historical reporting and perhaps proactive alerting
but I don't intend to cover them in detail here because most readers probably don't have them.

So what should you look for to know whether you have a problem?
  • OutOfMemoryErrors in your application logs - clearly you have a problem if you are getting these.
  • A persistent upwards trend in the Old Generation usage after each full garbage collection.
  • Heavy GC activity. What 'heavy' means is rather dependant on your application. A rule of thumb for a non-interactive app might be 15 seconds during each minute - which means that your app only has 75% of its time left to do useful work. You might choose a lower figure or look at the duration of individual full GC runs, especially if your app has real users waiting for it.
Finally, some pictures. Here is a graph showing jstat data from an app on WebLogic 9.2 with a heap memory leak - you can see that the Old Generation usage is climbing, even after garbage collection.

Here is another graph from the same app. The original leak has been fixed and there is now no evidence of a cumulative memory leak, but there are some intermittent spikes in heap usage. If the spike is severe enough it can still cause an error. Tracking this problem down will need some knowledge about what unusual types of system activity are occurring at the time of the spike.


Here is a graph drawn by gcviewer after using JVM startup arguments to produce a log file. The graph is quite 'busy' and includes a lot of extra info - for example this app's arguments allow the heap to grow as more is needed. The graph shows the growth in overall heap size as well as the growth in consumption. This info is available using jstat too, but if you really need it all then using gcviewer will probably be more convenient.

In the next post I'll look at how to gather data to help trace the cause of your memory leak.