Monday, 30 November 2009

Memory Leak Case Studies Part 1

I’ve been planning to post a little more on memory leak hunting for some time. For some reason (not from my own choice!) it has been a recurring theme of my professional life for the past year or so. What I had in mind was a few brief case studies of real memory leaks and similar Java heap stress conditions that I’ve seen myself and how they were tracked down and fixed. I hope these might help someone faced with similar issues so here goes...

I won’t describe the symptoms that started the investigation for each issue separately – they are usually the same – high CPU usage caused by too much garbage collection activity, sometimes leading up to java.lang.OutOfMemoryError.

If you need any more detail on the tools and techniques used, please see my earlier postings.

The case studies are roughly in chronological order. The first two are below. The rest will be in later postings. I've tried to stick to what actually happened, including any theories that turned out to be wrong. Where I would do things differently with the benefit of hindsight or better tools, I've tried to point this out too.

HSQLDB MemoryNode

A recently developed component was falling over during performance testing. In this case the heap histogram was pretty clear:-

Object Histogram:

Size Count Class description
-------------------------------------------------------
853118944 644449 char[]
44002832 361196 * ConstMethodKlass
26010200 361196 * MethodKlass
18582256 31995 * ConstantPoolKlass
15542520 647605 java.lang.String
14955048 395252 java.lang.Object[]
13799984 31995 * InstanceKlassKlass
13799360 431230 org.hsqldb.MemoryNode
13701992 283429 * SymbolKlass
10192768 24725 * ConstantPoolCacheKlass
8900360 37575 byte[]
7836032 85528 java.util.HashMap$Entry[]
6703232 65651 int[]
6520920 116445 org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry

The key here is org.hsqldb.MemoryNode. Our applications normally use Oracle but one part of the new component made use of an HSQLDB database to keep a record of the transactions that had been received. This caused no problems in development or test, but caused the application to fall over during performance testing because far more log records were being created, all of which were kept in memory by HSQLDB. To make matters worse, HSQLDB had been configured to save the database to a disk file and reload it on startup, so the problem could also build up across several restarts of the application.

In this case we were able to solve the problem by changing the application configuration to disable the logging feature and clearing out all of the accumulated HSQLDB rows.

Oracle T4CVarcharAccessor

A complex EJB based application was failing during performance test at high volumes. The heap histogram showed several Oracle JDBC driver classes high up in the list.

1,171,400,136 265,576 char[]
138,464,419 31,011 byte[]
13,216,708 255,522 int[]
8,717,148 47,897 oracle/jdbc/driver/T4CVarcharAccessor
6,226,876 259,451 java/lang/String
4,418,140 27,273 oracle/jdbc/driver/T4CNumberAccessor
4,402,466 29,138 short[]
2,548,884 17,748 [Ljava/lang/Object;
2,502,556 104,271 java/util/HashMap$Entry
2,075,776 2,221 oracle/jdbc/driver/T4CPreparedStatement
1,982,236 23,840 [Ljava/util/HashMap$Entry;

This problem was more difficult to track down because our code wasn’t creating or running the JDBC statements – this was all happening in an EJB server that we were calling. This was part of a commercial product that we we use and we didn't have the source code.

Looking at some heap dumps using the IBM HeapAnalyzer appeared to show that the offending objects were being held in memory by the WebLogic JDBC statement cache. While this could have created symptoms like a small scale memory leak, this explanation didn’t quite make sense because we were seeing far more leaking statements than the cache size. This is a good example of a tool presenting a summarised view of a complex world which can sometimes lead us astray. Eventually we stopped blaming WebLogic and found a different way to look for clues...

The breakthrough came with the use of ‘strings’ and ‘sort’ to analyse the heap dumps. This allowed us to pinpoint a piece of SQL that was repeatedly appearing on the heap and not being garbage collected. The problem turned out to be a simple program logic bug in the EJB server – it worked just fine in the majority of cases when it generated and ran just one JDBC statement for each client call. On some calls, however, it would generate and run two statements. When this happened it would only call close() on the most recent one, causing the JDBC driver to leak the resources belonging to the earlier statements. We raised a bug with the product vendor and after several calls to explain the issue and our diagnosis, they shipped us a fix.

Finding the root cause for this issue took several weeks. It would have been a lot easier with Eclipse MAT.

Thursday, 1 October 2009

A Better Heap Analysis Tool

Sometimes I wonder how a really useful tool can be out there for over a year before I find it. I can't answer that question. I can tell my readers a little about it...

I recently found the Eclipse Memory Analyzer project (or 'MAT'). This is a tool which combines most of the best features of IBM's HeapAnalyzer, features from SAP and JHat (for example, OQL) plus a bunch of other useful things into a single tool. It also indexes your heap dump and saves the indexes so that when you reopen the same dump later it loads very quickly.


MAT uses the notion of a 'Dominator' which I suspect comes from the SAP contributions - this is an object which if removed would free up a bunch of other objects. I think this is slightly different from the IBM HeapAnalyzer's notion of 'owner'. Having used the IBM tool, it takes a little bit of effort to get used to MAT's Dominator concept, but once you do this it makes perfect sense. So far I've found that I miss the freedom with which you can navigate up and down the owner/parent/child relationships in HeapAnalyzer - MAT makes you use a few more clicks to do this. On the whole though, it is a much better tool than its predecessors.

The thing I really like about MAT is its ability to do String frequency analysis (plus other useful stuff like finding sparsely populated collections) and to be able to focus this analysis onto children of a particular class or a particular part of the heap (or both!) - no need for separate analysis using 'strings' any more.

MAT is available either as an Eclipse plugin or a standalone tool. My own preference is for the standalone tool version - I have way too many plugins in my Eclipse workbench already and for heap analysis I prefer to have lean and mean tools so that memory is kept free for doing the real work.

MAT comes with excellent tutorials and also a blog containing good ideas about how to use it to track down those pesky memory leaks.

I think this will be my tool of choice from now on.

Friday, 18 September 2009

How to find Java Memory Leaks - Part 3

In the previous post I looked at how to get a heap histogram and how to use it to help to understand the cause of a memory leak. While a heap histogram will tell you what is leaking, it doesn't provide many clues about why. For the (probably) final installment, I will look at how to perform a more in-depth analysis using a heap dump. This is most definitely in the 'advanced' category of debugging.

A heap dump is a diagnostic file containing information about the entire contents of the heap. As a result, it can be a pretty big file - very roughly it will be about the same size as the space used on the heap. Before going any further you need to make sure that you have enough disk space to allow the heap dump to be written. If you are going to analyse it somewhere else (which is probably a good idea) then you need to check that you will be able to transfer such a big file.

Next, my usual warning - this level of debugging is very intrusive and will hang the JVM for several seconds so you should reproduce your issue on a test environment and take heap dumps there.

There are several ways to get a heap dump... and the options depend on which JVM version you have. Because I often need to work with 1.4 and 1.5, I usually use a JVM startup switch -XX:+HeapDumpOnCtrlBreak . Depending on your JVM version, you can also trigger a heap dump using jmap (e.g. jmap -heap:format=b), jconsole or via JMX. I've seen some very long pauses (several minutes) when using jmap like this, so be warned. There is another JVM startup switch -XX:-HeapDumpOnOutOfMemoryError which should be fairly self-explanatory. I've never used this myself because I usually want to control when heap dumps are taken and am always worried that triggering a heap dump when the JVM is already in trouble might make things even worse.

So having got your heap dump, how do you analyse it? The most frequently quoted tool is jhat from Sun. This will read your heap dump file and start a web server. You can then connect a browser to it to analyse your heap dump. I'm not going to say much more about jhat because I have not used it very often.

Naturally there are commercial tools but I wont cover them here. My own (free) tool of choice is called HeapAnalyzer from IBM. This is a Java Swing application which provides most of the capabilities available in jhat (with the notable exception of OQL) and also provides some useful power tools for quickly homing in on a memory leak - basically this tool has the feel of having been written by people who have actually spent some serious time tracking down memory leaks.

Whichever tool you choose, you will probably need to make sure that the tool itself has enough heap space by using the -Xmx option with a suitably high value when starting the tool.

Here is a screenshot of HeapAnalyzer just after opening a heap dump.



If you're lucky then the the first view that HeapAnalyzer shows will highlight your leak in blue. Before we look at this in more detail, let's look at some other views that will help us to understand what HeapAnalyzer is telling us.

View/Root List - this shows all of the 'top level' objects that will stay on the heap without being referred to by another object. This view (and the others) is sorted by 'total size' which needs a little explanation. This number is not the same as the total size of the object on the heap. HeapAnalyzer tries to do something more useful than that...

HeapAnalyzer organizes all objects as 'parent' and 'child' based on their references (the target of a reference is called the 'child'). The total size is the sum of the sizes of the object itself and all of the children that it owns. This is useful and often works in a way that seems natural, but since the heap is just a bunch of objects that refer to each other, it doesnt always present things in the way that you might expect, especially in the presence of circular reference chains (which happens quite a lot). It also tracks which objects have already been visited when working this out, so if an object has more than one 'parent', its size will only be counted under one of them, which HeapAnalyzer nominates as the 'owner'. HeapAnalyzer has to make a fairly arbitrary choice of which parent to count as the owner which may or may not be what you would consider to be 'correct'. Keep this final point in mind when using HeapAnalyzer.

View/Type List - this is very similar to the info provided by the heap histogram which I described in part 2. HeapAnalyzer also adds its 'total size' column. This is a view that I use a lot - once I find a class that is of interest, I can right click on it and select 'Find Same Type' which takes me to the...

View/Object List - this shows a list of object instances, again ordered by 'total size'. Once you have found an object which is of interest, you can right click and select 'Find Object in Tree View' to jump to the...

View/Tree View - this is probably the most useful view of all. It allows you to see an object in the context of its parents (or to be precise, owner) and children. Children are ordered by total size, so the ones that HeapAnalyzer thinks are using the most space show up at the top of the list of children. The tree view also allows you to view (in the right hand pane) the values of each attribute of the object.

Be careful with the tree view - keep in mind that it is presenting a simple view of something which is actually more complex than it appears. Each object can have multiple parents but the tree view can only show one of them (the one HA picked as the 'owner'). You can see the other parents by right clicking the object and selecting 'List Parents'.

And finally the tools - by right clicking an object in the tree view we have the option to 'go to the largest drop in subtrees' - i.e. find the point in the tree of children that is accounting for the most heap space. At the top of the view we have the 'Subpoena Leak Suspects' tool which will jump to the objects that HA has decided are the most likely candidates as leaking objects. This brings us back to the initial view that I described above because this is where HA will go to when the heap dump is first opened.

HeapAnalyzer also comes with a reasonably comprehensive help page which describes most of the key features.

So what can HA tell us? Unfortunately it still can't tell us why we have a memory leak. What it can do is allow us to home in on the leaking objects and understand which other objects are holding references to them and keeping them in the heap. Working out why we have a leak is something that we have to use our own brains to do, for example by figuring out where the code should be resetting a reference and making the objects eligible for garbage collection. This may be both difficult and time consuming. You will also need to decide which 'leaks' are actually object caches which are working as intended and eliminate these from your list of suspects... maybe. Sometimes object caches may be incorrectly tuned or misbehaving so they may really be the cause of the leak.

Finally I'd like to mention one more (and rather simpler) technique. As I observed earlier, string data is usually the thing which takes up most heap space. Looking at the content of those strings may provide a better clue about your memory leak. So use the Unix 'strings' command on the heap dump, followed by 'sort'. This will give you a big text file which you can analyze to find out which are the most commonly occurring strings. I've used this in the past to track down a JDBC related leak by finding the most common SQL statements on the heap.

There are a few pitfalls with the 'strings' approach:-
  • Double byte character sets - try using different flavours of the '-e' switch to do another run of 'strings' to pick up double byte strings.
  • Multi-line strings - what may be a single string object in Java will become multiple lines in your text file. Sorting the text file will then redistribute these lines so that parts of the same Java string are widely separated.
  • Relating the content to anything in HeapAnalyzer - I haven't found a good way to do this. It would be nice if HA had a string content search feature.
So that's it - you now have a kit bag of tools and techniques for tracking down a Java memory leak. It probably wont be easy, even with these tools so I wish you the best of luck.

Tuesday, 11 August 2009

How to find Java Memory Leaks - Part 2

In the previous post I looked at how to monitor your application in test or production to observe its heap behaviour and understand whether or not you have a memory leak. In this post I will start to look at the next step - how to gather data to help to trace the root cause of your leak.

In some simple cases it is possible to use inspection of the source code, a debugger or trace statements to figure out what is going wrong by spotting where the developer has made a mistake. If this works for you, then that's great. I want to focus on the more complex situations where you are dealing with lots of code and lots of heap objects that may not be yours and you can't just go straight to a piece of Java and see what the problem is.

To understand a memory leak we usually need to use tools which are very different from those used to diagnose other issues. For most issues we want to understand the dynamic behaviour of our code - the steps that it performs, which paths it takes, how long it takes, where exceptions are thrown and so on. For memory leaks we need to start with something completely different - namely one or more snapshots of the heap at points in time which can give an insight into the way it is being used up. Only once we have this picture can we start to work back to the code to understand why the heap is being used up.

Before going any further, a small warning - most of the techniques from here on are not well suited to production environments. They can hang the JVM for anything between a few seconds and a few minutes and write large files to disk. You really need to reproduce your issue in a test environment and use these tools there. In a crisis, you may be allowed to use them in production if your app is already under severe heap stress and is about to be killed or restarted anyway.

The simplest type of snapshot is a heap 'histogram'. This gives you a quick summary (by class) of which objects are consuming the most space in the heap including how many bytes they are using and how many instances are in the heap.

There are a couple of ways to get a histogram. The simplest is jmap - a tool included in some JDK distributions from Java 1.5 onwards. Alternatively, you can get a histogram using a JVM startup argument -XX:-PrintClassHistogram and sending your app a Ctrl-Break (Windows) or kill -3 (Unix) signal. This JVM option is also settable via JMX if your JVM is recent enough.

Here is an example (from a test app that deliberately leaks memory)...

andy@belstone:~> jmap -histo 2937
Attaching to process ID 2937, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 1.5.0_09-b03
Iterating over heap. This may take a while...
Object Histogram:

Size Count Class description
-------------------------------------------------------
65419568 319406 char[]
7472856 311369 java.lang.String
1966080 61440 MemStress
596000 205 java.lang.Object[]
64160 4010 java.lang.Long
34248 13 byte[]
32112 2007 java.lang.StringBuffer
10336 34 * ObjArrayKlassKlass
4512 47 java.lang.Class
4360 29 int[]
(snip)
8 1 java.lang.reflect.ReflectAccess
8 1 java.util.Collections$ReverseComparator
8 1 java.util.jar.JavaUtilJarAccessImpl
8 1 java.lang.System$2
Heap traversal took 42.243 seconds.
andy@belstone:~>

There are a few things to note here:-
  • Using this tool will hang the JVM until it has finished the histogram dump - in this case for 42 seconds. Don't try this if you can't afford to hang your JVM.
  • The objects highest in the list are java Strings and their associated character arrays. This is very common. You may also see other built in types (e.g. collections) near the top of the list.
  • What you should usually look for are the classes which appear highest in the list over which you actually have some direct control. This should give you the best clue about what types of object are involved in your memory leak.
In the (rather contrived) example above, the class 'MemStress' looks like it could be the real consumer of the heap space and indeed it is. Even though we've found it, most of the space it is using is being used indirectly - in this case by the Strings it uses to hold its instance variables. We were fortunate that MemStress objects on their own are big enough to push the class high enough up the list to be noticed. In some cases you may not be so lucky - instances of the main culprit may be quite small and it may therefore be hiding lower down the list. Try importing the list into a spreadsheet and sorting by instance count to see if that offers any additional clues.

It's usually best to have several heap histograms from the same run of your app at different points in time. Comparing these will give you a much better picture of how your memory leak is building up over time. Try to arrange to trigger them so that your app is in a similar state (usually idle) for each dump - then you will be comparing 'like for like'.

A heap histogram may be all you need to figure out what's going wrong, in which case you can don't need to worry about the next step.

What a histogram can't tell you is why each object is staying in the heap. This happens because at least one other object is still holding a reference to it. A full heap dump has the information needed to trace these references and also look in detail at the values of individual attributes in every object on the heap. It's also a big step up in size and complexity from a heap histogram. I'll look at how to take and analyse a full heap dump in the next installment.

Monday, 10 August 2009

How to find Java Memory Leaks - Part 1

Java heap consumption and memory leak diagnosis is something that (judging by other posts around the net) is often misunderstood - I've lost count of the number of searches that have turned up advice to simply increase the heap size and hope the problem goes away.

Let's address this one question straight away - if you really have a memory leak, increasing the heap size will at best only delay the appearance of an OutOfMemoryError. If your issue is just a temporary spell of high demand for heap memory then a bigger heap may help. If you take it too far you may see much worse performance because of swapping or even crash your JVM.

So what should you do? In essence, I would recommend three steps:-

1. Monitor your application in test and/or production to get an understanding of its heap behavior
2. If you really do have a problem, you need to gather data to help to track down the cause.
3. Finally having got the data, you need to figure out what it means - what is causing your problem and how to fix it.

Sadly these steps are not always easy to do - particularly step 3.

I'll be talking about the Sun JVM from here on, although many of the same principles and sometimes tools apply to other JVM implementations.

First let's look at monitoring...

To get a clear picture, you really need to look in detail at the heap stats from your JVM, but I quite often find that these are not available when I'm first asked to look at a problem, so what can we tell without this?

Clearly if you see this in your application logs then there is a problem...

java.lang.OutOfMemoryError: Java heap space

The usual response if you have real users is to restart the application straight away. You should then continue to monitor because you're probably going to need to know more before you get to the end of the diagnosis.

I often start by looking at CPU consumption using top, vmstat or whatever tools are available. What I'm looking for is periods where the Java app is using close to 100% of a single CPU. While it's not conclusive proof (being stuck in a loop could cause the same effect), this often means that the app is spending a lot of time doing full garbage collections - the most commonly used garbage collectors are single threaded and therefore tend to use 100% of a CPU.

Monitoring memory usage from the operating system is not very informative - Java will typically grow the heap to the maximum size long before there is a problem. If you have oversized your heap compared to physical memory you will probably see heavy swapping activity, but that's a different issue.

Some applications and app servers will make calls to java.lang.Runtime.freeMemory(). While this may be better than nothing, it's fairly uninformative and provides very little info about which generations have free space, how much time is being spent garbage collecting and so on.

The JVM can provide much more detailed info about what is going on. The JVM itself provides two main ways to get hold of this:-

1. JVM Startup arguments to write information to log files. Here are some suggestions, but check the docs for your specific JVM version (or try here) because the available options vary:-

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintTenuringDistribution

You may also want to add -Xloggc:myapp-gc.log to send your GC info to a dedicated log file

You can then analyse the log files to produce a graph using a tool such as gcviewer. This approach gives you detailed information about every GC run but be careful to check that you have enough disk space - these options can produce a lot of data. Typically you will also need to restart your app to enable these options, although they are also controllable via JMX for more recent JVM versions.

2. JDK tools - jvmstat in Java 1.4 and jstat in Java 1.5 and higher. These are tools that you run from the command line. They connect to your Java app via one of the JVM debugging APIs and will then report on heap and GC stats according to the options you specify. I usually use jstat -gcutil -t 60s which produces a line of output once per minute showing the percentage utilisation of each heap space, plus the number of young and full GC events and the cumulative GC times. You can then open the resulting text file in Excel for analysis. Note that jstat is capable of monitoring Java 1.4 VMs, which is handy if you have multiple versions. Alternatively (if your environment allows) you can try visualgc for an instant GUI view but I prefer using jstat to collect logs for offline analysis.

There are other commercial tools (e.g. Wily Introscope) which can provide the same data as the JDK tools. They may provide better visualisation, historical reporting and perhaps proactive alerting
but I don't intend to cover them in detail here because most readers probably don't have them.

So what should you look for to know whether you have a problem?
  • OutOfMemoryErrors in your application logs - clearly you have a problem if you are getting these.
  • A persistent upwards trend in the Old Generation usage after each full garbage collection.
  • Heavy GC activity. What 'heavy' means is rather dependant on your application. A rule of thumb for a non-interactive app might be 15 seconds during each minute - which means that your app only has 75% of its time left to do useful work. You might choose a lower figure or look at the duration of individual full GC runs, especially if your app has real users waiting for it.
Finally, some pictures. Here is a graph showing jstat data from an app on WebLogic 9.2 with a heap memory leak - you can see that the Old Generation usage is climbing, even after garbage collection.

Here is another graph from the same app. The original leak has been fixed and there is now no evidence of a cumulative memory leak, but there are some intermittent spikes in heap usage. If the spike is severe enough it can still cause an error. Tracking this problem down will need some knowledge about what unusual types of system activity are occurring at the time of the spike.


Here is a graph drawn by gcviewer after using JVM startup arguments to produce a log file. The graph is quite 'busy' and includes a lot of extra info - for example this app's arguments allow the heap to grow as more is needed. The graph shows the growth in overall heap size as well as the growth in consumption. This info is available using jstat too, but if you really need it all then using gcviewer will probably be more convenient.

In the next post I'll look at how to gather data to help trace the cause of your memory leak.

Tuesday, 28 July 2009

Java Heaps and Virtual Memory - Part 2

The story so far... in my earlier post I described a memory stress test that demonstrated the soundness of the advice to keep your Java heap size smaller than your physical memory.

I also wanted to check on the very limited explanations that I'd found and get a better understanding of what was going on. In particular, I wanted to know how Java (i.e. the Sun JVM) allocates heap memory and whether it adopts any strategies to avoid swap thrashing.

I did some further digging using the various things under the /proc file system and the JDK source code to find out.

The first surprise was in /proc/meminfo - the only counter that was going up significantly during the test was 'Mapped' - i.e. memory mapped files. I was expected this approach to be used for reading in .jar files and native libraries (and it was), but I wasn't expecting this for the heap. Digging into the source code explains why - The JDK uses the mmap system call to request more heap memory from the O/S.

I also took several snapshots of /proc/PID/smaps to see exactly what memory regions were being used in the processes address space. What this showed was:-
  1. There was a memory region (in my case starting from 0x51840000) that was clearly growing as the app allocated more and more heap. During the early part of the app's execution this would show up with a resident size 7Mb less than its overall size and with all of the resident pages showing up as being dirty.
  2. Once memory started to become scarce, many of the other memory regions start to show a reduction in their resident sizes and their shared sizes.
  3. Once swap thrashing was happening, the memory region which had been growing still had a 6-7Mb difference between the resident size and the allocated size. The big difference, however was that 15Mb of the space was now showing up as 'Private_Clean'.
So what does it all mean? Here's what I think is happening...
  1. During the early stages of execution the app is getting as much memory as it asks for but Linux delays giving it real physical memory until the specific pages are really accessed. This explains why the resident size is less than the allocated size - Java has probably extended the heap but hasn't yet accessed all of the allocated space.
  2. Memory is getting scarce, so Linux starts to reclaim pages that have the smallest impact. In the first instance it is hunting around for less critical pages (e.g. pages of jar files or libraries that haven't been used recently) that it can reclaim.
  3. This behaviour surprised me a little - I was expecting the resident size of the heap to have reduced, but this doesn't seem to have happened. What we can see is that part of the heap is now 'clean' - this tells us that Linux has indeed flushed part of the heap out to the swap file. The fact that the resident size has not reduced significantly tells us that we aren't getting much benefit - basically I think that the swapper is trying to swap pages out but the garbage collector is pulling them all back in again.
Finally I went back to the JDK sources again to see whether these would help me to understand what was going on. What I really wanted to understand was where the per-object data used by the garbage collector resides. The answer appears to be that it resides at the beginning of the memory block allocated to the object itself. The implication of this is that with 'normal' sized objects, the garbage collector run will need to access a few fields at the start of every single object on the heap, thus generating read and write accesses to practically every page contained in the heap.

So it would seem to me that the design choices in terms of the in-memory layout of objects and their garbage collector data mean that the garbage collectors really do conflict with the swapper once memory becomes tight. Based on the simple test that I did earlier, this happens both suddenly and with a severe impact.

In real life situations there may be several other Java and non-Java apps running on the same machine. I think this has a couple of implications:-
  1. The requirements of other apps may mean that memory becomes scarce much sooner - i.e. well before your Java heap size reaches the amount of physical memory.
  2. The swapper is not redundant - there may be plenty of 'low risk' pages belonging to other apps (or JAR mappings used only at startup time) that can be swapped out before the system gets to the point of swap thrashing.

Java Heaps and Virtual Memory - Part 1

Virtual memory has been around for a long time - Wikipedia reckons that it was first introduced in the early 1960s and it's still with us today. When we start using Java for large scale applications, however, it seems that virtual memory is perhaps not such a good thing. Several sources around the Internet recommend sizing the Java heap so that it fits within physical memory. The reason given is that the Java garbage collector likes to visit every page, so if some pages have been swapped out the GC will take a long time to run.

This question has cropped up on several occasions in my current project. While I have no reason to disagree with the advice on heap sizing, I was a little uncomfortable that I hadn't seen much real evidence to back it up or indication of how bad things would be once the limit was reached, so I decided to find out for myself.

The first thing I tried was creating a simple Java class to stress the heap. This class will progressively populate an ArrayList with a large number of Java objects each owning five 80 byte random strings. It can also be asked to 'churn' the objects by selecting and replacing groups of them, thus making the old ones eligible for garbage collection. I ran this on a small Linux box and watched what happened using 'top' and 'vmstat' ...

What I found was this...
  1. While there was plenty of free memory, the resident size of the process grew.
  2. Once free memory became short, the shared size started to shrink
  3. Very soon after that, the swap file usage started to grow
  4. If the 'churn' feature of the stress test was enabled, the system quickly got into heavy swap thrashing and the stress test ground to a halt.
  5. With no churn (probably not realistic for most real apps), the app could get a little further, but not much and would still get into swap thrashing.
My original intention was to capture some numbers and draw a graph or two to illustrate what happens. In practice what I found was that the results were rather variable, even on the same machine. In every case though there was a point soon after swapping started where the test tipped dramatically into swap thrashing and was unable to make any further progress.

I drew two conclusions from my simple test:-
  1. The advice to keep the Java heap smaller than physical memory is very sound.
  2. The degradation in performance if you let your Java heap grow bigger than physical memory is both sudden and severe.
I also wanted to check on the very limited explanations that I'd found and get a better understanding of what was going on. In particular, I wanted to know how Java (i.e. the Sun JVM) allocates heap memory and whether it adopts any strategies to avoid swap thrashing. I'll save this for a later post.