Monday 30 November 2009

Memory Leak Case Studies Part 1

I’ve been planning to post a little more on memory leak hunting for some time. For some reason (not from my own choice!) it has been a recurring theme of my professional life for the past year or so. What I had in mind was a few brief case studies of real memory leaks and similar Java heap stress conditions that I’ve seen myself and how they were tracked down and fixed. I hope these might help someone faced with similar issues so here goes...

I won’t describe the symptoms that started the investigation for each issue separately – they are usually the same – high CPU usage caused by too much garbage collection activity, sometimes leading up to java.lang.OutOfMemoryError.

If you need any more detail on the tools and techniques used, please see my earlier postings.

The case studies are roughly in chronological order. The first two are below. The rest will be in later postings. I've tried to stick to what actually happened, including any theories that turned out to be wrong. Where I would do things differently with the benefit of hindsight or better tools, I've tried to point this out too.

HSQLDB MemoryNode

A recently developed component was falling over during performance testing. In this case the heap histogram was pretty clear:-

Object Histogram:

Size Count Class description
-------------------------------------------------------
853118944 644449 char[]
44002832 361196 * ConstMethodKlass
26010200 361196 * MethodKlass
18582256 31995 * ConstantPoolKlass
15542520 647605 java.lang.String
14955048 395252 java.lang.Object[]
13799984 31995 * InstanceKlassKlass
13799360 431230 org.hsqldb.MemoryNode
13701992 283429 * SymbolKlass
10192768 24725 * ConstantPoolCacheKlass
8900360 37575 byte[]
7836032 85528 java.util.HashMap$Entry[]
6703232 65651 int[]
6520920 116445 org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry

The key here is org.hsqldb.MemoryNode. Our applications normally use Oracle but one part of the new component made use of an HSQLDB database to keep a record of the transactions that had been received. This caused no problems in development or test, but caused the application to fall over during performance testing because far more log records were being created, all of which were kept in memory by HSQLDB. To make matters worse, HSQLDB had been configured to save the database to a disk file and reload it on startup, so the problem could also build up across several restarts of the application.

In this case we were able to solve the problem by changing the application configuration to disable the logging feature and clearing out all of the accumulated HSQLDB rows.

Oracle T4CVarcharAccessor

A complex EJB based application was failing during performance test at high volumes. The heap histogram showed several Oracle JDBC driver classes high up in the list.

1,171,400,136 265,576 char[]
138,464,419 31,011 byte[]
13,216,708 255,522 int[]
8,717,148 47,897 oracle/jdbc/driver/T4CVarcharAccessor
6,226,876 259,451 java/lang/String
4,418,140 27,273 oracle/jdbc/driver/T4CNumberAccessor
4,402,466 29,138 short[]
2,548,884 17,748 [Ljava/lang/Object;
2,502,556 104,271 java/util/HashMap$Entry
2,075,776 2,221 oracle/jdbc/driver/T4CPreparedStatement
1,982,236 23,840 [Ljava/util/HashMap$Entry;

This problem was more difficult to track down because our code wasn’t creating or running the JDBC statements – this was all happening in an EJB server that we were calling. This was part of a commercial product that we we use and we didn't have the source code.

Looking at some heap dumps using the IBM HeapAnalyzer appeared to show that the offending objects were being held in memory by the WebLogic JDBC statement cache. While this could have created symptoms like a small scale memory leak, this explanation didn’t quite make sense because we were seeing far more leaking statements than the cache size. This is a good example of a tool presenting a summarised view of a complex world which can sometimes lead us astray. Eventually we stopped blaming WebLogic and found a different way to look for clues...

The breakthrough came with the use of ‘strings’ and ‘sort’ to analyse the heap dumps. This allowed us to pinpoint a piece of SQL that was repeatedly appearing on the heap and not being garbage collected. The problem turned out to be a simple program logic bug in the EJB server – it worked just fine in the majority of cases when it generated and ran just one JDBC statement for each client call. On some calls, however, it would generate and run two statements. When this happened it would only call close() on the most recent one, causing the JDBC driver to leak the resources belonging to the earlier statements. We raised a bug with the product vendor and after several calls to explain the issue and our diagnosis, they shipped us a fix.

Finding the root cause for this issue took several weeks. It would have been a lot easier with Eclipse MAT.