Friday, 27 April 2012

How to cause a Native OOM

Why would you want to do that? - a java.lang.OutOfMemoryError (OOM) is a bad thing. A native OOM error is even worse because it means that the JVM has no more memory for its own internal purposes and  results in the immediate death of the Virtual Machine, so in most situations you would not want to cause one deliberately.

In our case, we have a good reason to want to cause one - we have been seeing very occasional native OOM errors with part of our application. They are so rare that they are proving very difficult to diagnose. We would like to improve our monitoring to capture better data for diagnosis but to do so, we need to be able to test our monitoring.

Causing a Java heap OOM is pretty easy - just give your virtual machine a smallish heap size using the -Xmx switch and write some Java code that allocates lots of big Strings, populates them with random content (to defeat optimisations like interning) and stores them in a collection so that there is always a reference to them. As I've said previously, diagnosing these issues is a fairly well understood process, the usual solution being to get a heap dump using a switch such as -XX:+HeapDumpOnOutOfMemoryError and then using a tool such as Eclipse MAT to analyse the contents of the dump.

Causing a native OOM is less straightforward but there are ways to do it. A very comprehensive article on the subject of Native OOM errors can be found on the IBM site. This article also has some sample code which causes a native OOM by using some native code and JNI to allocate space in the native heap using malloc. It also suggests another way to chew up the native heap which doesn't need native code, so this is the way I did it. The java.nio.ByteBuffer class includes a method which allows the caller to allocate a 'direct' buffer which exists in the native heap. Here is a fragment of my code which does this...

private ByteBuffer createBuf() { 
    ByteBuffer bb = ByteBuffer.allocateDirect(BUFFSIZE); 
    int idx = 0; 

    while (idx < BUFFSIZE) { 
        Long l = new Long(r.nextLong());   // r is a java.util.Random
        bb.putLong(idx,l.longValue()); 
        idx += 1024;
    } 
    return bb; 

Call this enough times in a loop and keep the resulting references in a collection and you have created your own native memory leak. I found when testing this that I need to be careful to run it only in a 32 bit JVM on a machine with plenty of RAM, otherwise it won't hit the OOM error but it will use up all of your system's memory to the point where the system starts page thrashing. Don't try this on a production box!

Armed with this I was able to test questions like whether the -XX:+HeapDumpOnOutOfMemoryError switch will produce a heap dump for a native OOM. The answer is 'no', at least on Java 1.5.0_22.

Wednesday, 29 February 2012

Don't be Average

I've commented previously in these pages on the usefulness of the Oracle AWR and ASH reports but they are not the perfect answer to many questions. One of their limitations is that they present average figures over a period of time. If we are not very careful then we might start to believe that the average numbers are the whole story. Sometimes they are not.

One of my projects has had a long-standing performance issue with some Oracle code that has been very difficult to crack. One of the main challenges was that we did not have a detailed understanding of the problem. The information in AWR and ASH appeared to have a circumstantial connection to the problem but didn't really suggest how we could solve it. Our attempts to replicate it in a test environment were also unsuccessful - our tests always ran at least 10 times faster than the production system, even on a full-sized database.

This is a typical section of an AWR report from this issue:-



This shows the PL/SQL call (7q17xs6wcbra6) that was the subject of the problem plus another query (8x5z8zszym698) which always seemed to appear high in the AWR list whenever the PL/SQL call was busy. It was clear that there was a huge number of executions of the SQL, but was this the cause of the problem? As it turned out, the answer was 'no'.

The key to improving our understanding of the problem was to get away from looking at averages. The first step in this direction was to use CA Wily Introscope to look at the same PL/SQL calls from the WebLogic side. Looking at a quiet period in the early morning hours, this is what we saw:-



The graph shows that there is a repeating pattern consisting of a handful (usually 2-4) of calls which complete in less than one second followed by a single call that typically takes 30-40 seconds. This was a crucial piece of information which we had been missing. With this knowledge we were able to look at the business task that the procedure was doing and understand the pattern. For those interested, the system was working on a hierarchy of circuits. Deletion of the child circuits was very fast but deletion of the parent circuit required a full check of the hierarchy and was much slower.

Having identified the calls which took a long time we also extended the analysis by using SQL Trace to watch one of these and proved that the averages for the SQL operation were equally misleading for the same reason.

This understanding also allowed us to see what was missing from our test scenario and reproduce the issue in the test environment. Finally the understanding that we had gained allowed us to work with our supplier and find a fix for the issue.

Below is a graph from Wily for the past month showing the average response time of the PL/SQL call before and after we applied the fix. The improvement speaks for itself.