Monday 8 February 2010

Memory Leak Case Studies Part 2

The second in the series of case studies describing real memory leaks and similar Java heap stress conditions that I’ve seen myself and how they were tracked down and fixed.

Oracle OC4J oracle.dms.spy.Metric

An application which I’d seen running quite happily on other app servers was giving problems when run on Oracle App Server (OC4J). It would run for a few days and then performance would gradually degrade. A quick check using JVM startup switches to log garbage collection activity showed that increasing GC overhead was the most likely culprit.

A heap histogram showed several oracle.dms classes near to the top of the list. DMS is an instrumentation technology from Oracle that allows various aspects of the app server and application operation to be monitored.

Size Count Class description
-------------------------------------------------------
115136200 813639 char[]
19947648 831152 java.lang.String
19887720 497193 oracle.dms.spy.Metric
13815944 244485 java.lang.Object[]
9190608 67578 oracle.dms.instrument.PhaseEvent
8412640 50732 * MethodKlass
5134584 91689 oracle.dms.spy.Metric[]
3735840 155660 java.util.ArrayList
3635952 15771 byte[]
3601048 79404 * SymbolKlass
3484080 43551 oracle.dms.instrument.Noun
2890560 24088 oracle.dms.instrument.State
2880624 120026 java.util.Hashtable$Entry
2656648 4659 * ConstantPoolKlass
2108424 87851 java.util.Vector
1925072 26967 java.util.HashMap$Entry[]
1643280 4659 * InstanceKlassKlass


We became convinced that this was an app server rather than an application issue but it took a few days for us to track down the correct fix. Googling turned up some ways to switch off the reporting of DMS metrics which we tried, but these did not solve the problem because the metrics were still being captured. Eventually I found the fix rather by chance – there is an option which can be set using a JVM startup switch to disable the collection of DMS metrics for JMS. The switch is ‘-Doc4j.jms.noDms=true’. We tried this and found that the problem was solved.

Annoyingly we then found that this was a known issue with OC4J documented in Metalink bug id 5462430 and fixed in OAS version 10.1.2.99. I say annoyingly because I’d searched Metalink several days earlier and had seen this bug but had misread the OC4J version number and decided that it wasn’t the same as our problem.

AxisHttpSession

A GUI application using a complex Java servlet had been working just fine in production for over 12 months. Recently it had started to suffer from Java heap stress, causing it to fail if the server component was left running for several weeks.

Our immediate workaround was to schedule a restart of the server component each weekend. We also collected a heap dump file before each restart.

Analysis of the heap dump using Eclipse MAT didn’t immediately show an obvious culprit. This was because the server was now being restarted every week, so the problem was not building up to the point where it would be really obvious in the heap dump.

Once this point was understood, further analysis showed that the server appeared to be leaking Apache Axis objects. Classes org.apache.axis.transport.http.AxisHttpSession and org.apache.axis.handlers.soap.SOAPService were both showing retaining big chunks of heap space, although appearing at positions 18 and 19 in the MAT histogram.


A search of the Axis bug database highlighted a similar looking problem - http://issues.apache.org/jira/browse/AXIS-2314 . Checking our Axis JAR manifest revealed that we (or rather a bought in product) were using Axis 1.3 which was too old to have the fix for this bug. We have asked the product vendor to provide a patch to use a later version of Axis.

This still left the question of why the problem had started happening after the system had been stable for so long. The reason was traced to a recent implementation of SiteMinder single sign on. This change had required that the Java applet configuration be changed so that instead of using Java RMI to communicate with the server it started using SOAP and therefore Apache Axis.