Friday 27 April 2012

How to cause a Native OOM

Why would you want to do that? - a java.lang.OutOfMemoryError (OOM) is a bad thing. A native OOM error is even worse because it means that the JVM has no more memory for its own internal purposes and  results in the immediate death of the Virtual Machine, so in most situations you would not want to cause one deliberately.

In our case, we have a good reason to want to cause one - we have been seeing very occasional native OOM errors with part of our application. They are so rare that they are proving very difficult to diagnose. We would like to improve our monitoring to capture better data for diagnosis but to do so, we need to be able to test our monitoring.

Causing a Java heap OOM is pretty easy - just give your virtual machine a smallish heap size using the -Xmx switch and write some Java code that allocates lots of big Strings, populates them with random content (to defeat optimisations like interning) and stores them in a collection so that there is always a reference to them. As I've said previously, diagnosing these issues is a fairly well understood process, the usual solution being to get a heap dump using a switch such as -XX:+HeapDumpOnOutOfMemoryError and then using a tool such as Eclipse MAT to analyse the contents of the dump.

Causing a native OOM is less straightforward but there are ways to do it. A very comprehensive article on the subject of Native OOM errors can be found on the IBM site. This article also has some sample code which causes a native OOM by using some native code and JNI to allocate space in the native heap using malloc. It also suggests another way to chew up the native heap which doesn't need native code, so this is the way I did it. The java.nio.ByteBuffer class includes a method which allows the caller to allocate a 'direct' buffer which exists in the native heap. Here is a fragment of my code which does this...

private ByteBuffer createBuf() { 
    ByteBuffer bb = ByteBuffer.allocateDirect(BUFFSIZE); 
    int idx = 0; 

    while (idx < BUFFSIZE) { 
        Long l = new Long(r.nextLong());   // r is a java.util.Random
        bb.putLong(idx,l.longValue()); 
        idx += 1024;
    } 
    return bb; 

Call this enough times in a loop and keep the resulting references in a collection and you have created your own native memory leak. I found when testing this that I need to be careful to run it only in a 32 bit JVM on a machine with plenty of RAM, otherwise it won't hit the OOM error but it will use up all of your system's memory to the point where the system starts page thrashing. Don't try this on a production box!

Armed with this I was able to test questions like whether the -XX:+HeapDumpOnOutOfMemoryError switch will produce a heap dump for a native OOM. The answer is 'no', at least on Java 1.5.0_22.

Wednesday 29 February 2012

Don't be Average

I've commented previously in these pages on the usefulness of the Oracle AWR and ASH reports but they are not the perfect answer to many questions. One of their limitations is that they present average figures over a period of time. If we are not very careful then we might start to believe that the average numbers are the whole story. Sometimes they are not.

One of my projects has had a long-standing performance issue with some Oracle code that has been very difficult to crack. One of the main challenges was that we did not have a detailed understanding of the problem. The information in AWR and ASH appeared to have a circumstantial connection to the problem but didn't really suggest how we could solve it. Our attempts to replicate it in a test environment were also unsuccessful - our tests always ran at least 10 times faster than the production system, even on a full-sized database.

This is a typical section of an AWR report from this issue:-



This shows the PL/SQL call (7q17xs6wcbra6) that was the subject of the problem plus another query (8x5z8zszym698) which always seemed to appear high in the AWR list whenever the PL/SQL call was busy. It was clear that there was a huge number of executions of the SQL, but was this the cause of the problem? As it turned out, the answer was 'no'.

The key to improving our understanding of the problem was to get away from looking at averages. The first step in this direction was to use CA Wily Introscope to look at the same PL/SQL calls from the WebLogic side. Looking at a quiet period in the early morning hours, this is what we saw:-



The graph shows that there is a repeating pattern consisting of a handful (usually 2-4) of calls which complete in less than one second followed by a single call that typically takes 30-40 seconds. This was a crucial piece of information which we had been missing. With this knowledge we were able to look at the business task that the procedure was doing and understand the pattern. For those interested, the system was working on a hierarchy of circuits. Deletion of the child circuits was very fast but deletion of the parent circuit required a full check of the hierarchy and was much slower.

Having identified the calls which took a long time we also extended the analysis by using SQL Trace to watch one of these and proved that the averages for the SQL operation were equally misleading for the same reason.

This understanding also allowed us to see what was missing from our test scenario and reproduce the issue in the test environment. Finally the understanding that we had gained allowed us to work with our supplier and find a fix for the issue.

Below is a graph from Wily for the past month showing the average response time of the PL/SQL call before and after we applied the fix. The improvement speaks for itself.

Wednesday 30 November 2011

Using Jstat with the CMS Collector

Several months back I was investigating high CPU utilisation on a customer system. One factor that I noticed was that the Concurrent Mark Sweep garbage collector on one of the Java 5 JVMs seemed to be using 100% of a CPU for much of the day. Checking the GC log file, I found that during this period the CMS collector was running almost continuously.

We also use jstat to monitor the GC behaviour because its output is much easier to interpret than the GC log. Jstat showed total GC times much lower than those shown in the GC logs. I also noticed that the number of 'full GCs' reported by jstat frequently increased by two or a larger even number and that odd numbers appeared only rarely and for short periods of time.

So what does this mean? Why does jstat apparently disagree with the information from the GC logs and why does it seem to report that full collections occur in pairs?

Although I've never seen the answer documented anywhere (no doubt some kind soul will now point it out to me), I had a hunch that I knew what it was. My theory was that jstat was reporting just the application pause time, not the total execution time of the collector.

So I set out to try to prove or disprove my theory by running a simple test application to exercise the java heap and collecting both the GC log file (-verbose:gc -XX:+PrintGCDetails -Xloggc:gc.log) and the jstat (-gcutil -t) output for comparison.

I also did a little reading about the CMS collector to help me to identify the application pause time in the logs. This told me that each CMS cycle causes two application pauses - one for the 'initial mark' phase and one for the 'remark' phase.

Here is the jstat output:-

E:\source\memstress>jstat -gcutil -t 3136 10s

Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
13.4 0.00 0.00 58.11 43.79 9.73 8 0.440 0 0.000 0.440
23.5 0.00 0.00 3.35 72.78 9.77 15 0.848 6 0.078 0.926
33.5 0.00 0.00 37.99 72.60 9.77 16 0.896 12 0.313 1.208

So between 13.4 seconds and 23.5 seconds of uptime, jstat reports 0.078s of full GC time.
The GC log is too verbose to reproduce here, even for this simple test, but the relevant messages for the pause phases were:-

15.410: [GC [1 CMS-initial-mark: 223173K(408576K)] 223672K(524224K), 0.0001739 secs]

[1 CMS-remark: 245213K(408576K)] 329218K(524224K), 0.0307485 secs]
19.074: [GC [1 CMS-initial-mark: 274783K(408576K)] 276925K(524224K), 0.0001132 secs]
[1 CMS-remark: 274783K(408576K)] 335949K(524224K), 0.0313580 secs]
21.232: [GC [1 CMS-initial-mark: 287928K(408576K)] 287928K(524224K), 0.0001128 secs]
[1 CMS-remark: 310245K(408576K)] 314122K(524224K), 0.0157990 secs]

So there were three CMS collection cycles in the tenured generation during the 10 second interval covered by the second line of jstat output. This corresponds to six pauses, which is what jstat reports as the 'Full GC count'.

Adding up the pause times, I reached a figure of 0.0779 seconds which agrees nicely with the 0.078 seconds reported by jstat.

So… when used with the CMS collector, jstat seems to report the number of application pauses for garbage collection and the amount of time for which the application was paused. The pause time will probably be much lower than the total amount of GC execution time.

Monday 31 October 2011

Things That go Dead in the Night

Happy Halloween.

This post will be brief. Many people will think that it's a statement of the obvious but it's amazing how much time can be wasted on issues like this, especially in large organisations.

Sometimes we see systems behave rather strangely during performance tests and the teams managing the tests can sometimes be very inventive in their explanations of the issue. This particular issue has crossed my desk a couple of times (too many) in recent weeks and has been presented as 'a strange unexplained issue', 'file corruption', 'network issue' or 'SAN failure'.

Without building up any further mystery or invoking the supernatural, the answer was simple - the file system was allowed to become full. This can cause all sorts of interesting and varied problems to your system and the execution of any tests. In some cases it will fail completely, in others it might continue almost without anything appearing to be wrong and there are any number of variations in between these extremes.

One common factor is that it is quite normal for there to be no helpful error message like 'disk full' visible because often the file system that becomes full is the one being used to write the log file. This is where the creativity comes into play - people don't see an obvious error, so their imagination gets to work.

Scheduled housekeeping and other such automated cleanup solutions can add to the fun because they can sometimes free up space before somebody notices the problem... so then people may tell you that they are certain that the disk wasn't full because it isn't full now.

So if your test goes dead in the night, either temporarily or permanently, please do check your disk space monitoring (you do monitor your disk space utilisation don't you?). Alternatively, scrutinise your log files very carefully - are there any strange 'silent' periods that should contain messages?

Don't have Nightmares.

Wednesday 28 September 2011

And the Second Prize Goes To...

I've written here before about the virtues of Eclipse MAT for doing heap analysis but I recently encountered a situation in which I couldn't persuade it to give me the answers that I really needed.

I was analysing a set of heap dumps from an application and MAT's leak suspects report showed me that circa 300Mb of heap space was being retained by a single object. The object in question was a metadata manager, so it was well known that this object would retain a big chunk of heap space. This overhead was fixed size, and so not really of interest. The report categorised everything else in a single bucket as 'other', which wasn't much help.

I really needed to know about the most significant objects that were retaining space outside of the retained set of the metadata manager. While it's possible to explore around using MAT views like the histogram and come up with some answers, I didn't feel that this was a very methodical approach and it would be quite easy to miss something important.

I realised that the problem could be helped by applying a little simple set theory - what we are interested in is how a histogram or retained set looks after subtracting the retained set of the number one leak suspect. This is very simple to do, but very repetitive - an ideal job for a script, so that's what I wrote.

So at the bottom of the post is a perl script that will read two files saved as CSV from either the MAT histogram or retained set views and print the result of the subtraction in CSV format to its output. Some views don't automatically calculate the retained size, so you need to select the rows and ask MAT to do at least the minimal retained set calculation before saving the CSV. If you only speak Windows, then you can use the script by installing a suitable perl implementation - I usually use ActiveState's ActivePerl.

I managed to use this quite successfully. I started by subtracting the retained set of the metadata manager from the overall histogram. By continuing to subtract more 'suspects' from the result, I was able to identify the second, third and fourth placed consumers of heap space in my dump files, which I felt was going far enough.

Finally, the script... unfortunately Blogger doesn't seem to allow attachments.



# perl script to subtract one Java retained set from another.
# Retained sets or histograms can be saved in CSV format from Eclipse MAT
# The retained size is required, so ask MAT to calculate it if it is missing
#
# usage: perl subtract-retained.pl .csv .csv [.csv]...
#
# Andy Carlson
# http://threeturves.blogspot.com/
# September 2011
#
use strict;

my ($basefile) = shift;
my (%count,%shallow,%retained,%loaders);
my ($lineno) = 0;

# read the baseline histogram
open (HIN,$basefile) || die;
while () {
$lineno++;
next if ($lineno == 1);
next if (!$_);
my ($class,$count,$shallow,$retained) = split (/,/);

if (!$retained && ($retained ne '0')) {
warn "warning: no retained size for $class at $basefile line $lineno - skipping remainder of file";
last;
}

$loaders{$class}++;
$count{$class} += $count;
$shallow{$class} += $shallow;
$retained{$class} += $retained;
}
close HIN;

# now read one or more retained sets and subtract from the base histogram
while (1) {
my ($rsetfile) = shift;
last if (!$rsetfile);
open (HIN,$rsetfile) || die;
$lineno = 0;
while () {
$lineno++;
next if ($lineno == 1);
next if (!$_);
my ($class,$count,$shallow,$retained) = split (/,/);
if (!$retained && ($retained ne '0')) {
warn "warning: no retained size for $class at $rsetfile line $lineno - skipping remainder of file";
last;
}
$count{$class} -= $count;
$shallow{$class} -= $shallow;
$retained{$class} -= $retained;
}
close HIN;
}

# produce the output
print "class,objects,shallow heap,retained heap,loaders\n";
my ($class);
foreach $class (sort keys (%loaders)) {
print "$class,$count{$class},$shallow{$class},$retained{$class},$loaders{$class}\n";
}

Thursday 18 August 2011

Why does a query change plans when the stats have not changed?

I was recently asked to investigate a performance issue on one of our customer's production systems. Some comparative analysis of AWR reports between 'good' and 'bad' periods showed that the queries being run by the part of the application that had problems were circa 50% slower but what really caught my attention was an unrelated query that had jumped from about 8th place in the AWR response time 'offenders' list up to first place. It was also accounting for huge amounts of logical I/O and a fair bit of physical I/O.

Looking in more detail at this query we got AWR SQL reports that showed its performance during the 'good' and 'bad' periods. These showed that a different plan was in use during the 'bad' period. This plan had a table scan whereas the other plan used an index lookup. It has to be said though that neither plan can have been all that great - if the index plan was 'good' then the query shouldn't have been in the AWR top 10 offenders list at all.

So we knew that the query was changing plans. We've seen this problem before and the usual first step is to go and check the statistics on the table. We did this and found that the stats were just over a week old (i..e not really 'stale') and that they had not changed at all between the 'good' and 'bad' periods of performance.

So, that brought us back to the frequently asked question 'why does a query change plans when the stats have not changed?'. Richard Foote has written a good article about this and suggests one explanation…

http://richardfoote.wordpress.com/2010/02/16/how-does-an-execution-plan-suddenly-change-when-the-statistics-remain-the-same-in-limbo/

We did some more checking and proved to ourselves that Richard's explanation (that the data being queried is newer than the histograms in the table stats so the optimizer can't get a good cost estimate) did not fit our situation.

Sometimes our data was older than the stats, sometimes newer. In principle then, Richards explanation might fit. We checked the AWR data for this query manually over the past several days. In each AWR snapshot period we found that there was just one plan and that the plan seemed to be stable for more than 12 hours before switching again. This didn't fit Richard's explanation - the application would typically be hitting a mix of older and newer data so we should see both plans being used during the same period.

We shifted over to a test system and found a couple of key values - one that would access data that was older than the stats and one newer. The shared pool was flushed before each test to force Oracle to 'forget' any previous decisions about the plan. We found that both the older and the newer key values would choose the index plan in spite of the lack of histogram data for the newer key value.

We then looked in more detail at the query and how the application used it. We found that it was being used to query a relationship that had very variable cardinality. Sometimes it would find just 2 or 3 related objects, other times (by design) it would find tens of thousands or even hundreds of thousands. We then tried testing using key values that would find low cardinality against ones with high cardinality. When we picked the high cardinality key we found that it picked the table scan plan - so there was the initial answer to the question…

  • Q: 'why does a query change plans when the stats have not changed?'
  • A: 'because the optimizer uses bind variable peeking and can quite correctly decide on different plans because different key values result in large differences in cardinality'

The snag was that the results of our earlier AWR queries also meant that this was not the full explanation of our problem - if bind variable peeking was the full explanation then we should see both plans being used during each snapshot.

We did some more tests and discovered that if we ran the high cardinality test first and then ran the low cardinality keys immediately afterwards (without flushing the shared pool), the low cardinality key would also use the table scan plan (and run slowly). We could also reverse the two tests and cause the high cardinality key to use the index lookup plan.

The Oracle documentation seems to provide the explanation for this…

http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm

"On subsequent invocations of the cursor, no peeking takes place, and the cursor is shared, based on the standard cursor-sharing criteria, even if subsequent invocations use different bind values"

So basically if the two queries pass the cursor sharing test (which in our case they would) then bind variable peeking is only done for the first run of the query and the optimizer assumes that the plan is good for any subsequent invocations.

So what was happening in our production system was that…
  • Usually the low cardinality query would run and cause the optimizer to choose the index lookup plan.
  • Sometimes the query would be evicted from the library cache by other activity
  • If the query is not in the library cache and the next run of the query has a high cardinality, the optimizer will choose the table scan plan
  • All subsequent invocations of the query would then use the table scan plan until the query was again evicted from the library cache. For low cardinality queries this resulted in the performance issue that we saw.
There are probably lots of potential solutions to this. SQL hints and Oracle 11 SQL profiles spring to mind. In our case upgrading to Oracle 11 or changing the application were not practical so we created a new composite index that covered more of the 'where' clause and would persuade the optimizer to choose an index lookup regardless of whether the cardinality was high or low.


Friday 29 July 2011

Handy OQL Recipies

I've posted a few times previously on the subject of Java heap analysis and related tools. My current tool of choioce is Eclipse MAT.

MAT provides a number of very useful features to allow users to quickly get an overview of what is happening in a heap dump and in many cases to identify the objects responsible for memory leaks. Sometimes, however, these are not enough. There are also situations where I need to ask the same question about multiple heap dumps without the need to go through multiple steps of navigation. This is where Object Query Language (OQL) can be very helpful.

For those who are unfamiliar with it, OQL is (as its name suggests) a language similar to SQL but intended to query collections of objects instead of rows in a database. The MAT project has chosen to implement OQL as a way to perform targeted queries on the objects in a heap dump.

You can find plenty of good documentation with MAT to help you to understand OQL syntax, but I found that there was rather a shortage of examples, so I thought it would be a good idea to post some here. These are all real queries that I have used to diagnose memory consumption issues (and leaks) on real projects. This does, of course, mean that the examples are biased towards the products I use most frequently - Oracle, WebLogic and some open source libraries.

After running these queries, it is often a good next step to ask MAT to calculate the retained size for each entry in the list. Then the results can be saved as a CSV file for further analysis using your favourite spreadsheet application.

JDBC object leaks are a regular cause for concern, so here are some queries to help...

SELECT toString(s.sqlObject.oracle_sql) FROM oracle.jdbc.driver.T4CCallableStatement s
SELECT toString(s.sqlObject.oracle_sql) FROM oracle.jdbc.driver.T4CPreparedStatement s
SELECT toString(poolname), lastSuccessfulConnectionUse,destroyed FROM weblogic.jdbc.common.internal.ConnectionEnv

The first two relate to JDBC statement leaks. Both are basically the same but they work on different types of statement. They may also be useful for result set leaks because the result set tends to hold a reference back to the statement. The third one provides a picture of JDBC connections in a WebLogic connection pool. In particular, it is useful to see which connections have not been used for a long time.

Another WebLogic JDBC Example...

SELECT stmtCache.list.size, toString(poolname) FROM weblogic.jdbc.common.internal.ConnectionEnv

This query also looks at WebLogic JDBC connections, but in this case it is examining the size of the WebLogic JDBC statement cache. Statement caching is normal behaviour (i.e. not a bug) but it can look a lot like a statement leak and can sometimes cause high levels of heap usage, so needs to be tuned appropriately.

Some non JDBC examples now...

select creationTime from weblogic.servlet.internal.session.MemorySessionData

This one looks at WebLogic sessions. In particular we're interested in any that have been hanging around for a long time.

SELECT toString(destination.name) FROM weblogic.jms.client.JMSProducer

This query was written to investigate a JMS connection leak. We used it to find some code that was not closing down its JMS API objects correctly.

Finally, a non-WebLogic and Oracle example...

SELECT toString(fDocumentURI), toString(docElement.name) FROM org.apache.xerces.dom.DeferredDocumentImpl

Many applications use XML, so this query looks at XML documents that are being retained in the heap.

I hope that this has given you a quick taste of OQL and that you will get some ideas for queries that will help you to solve other problems.