Wednesday 30 November 2011

Using Jstat with the CMS Collector

Several months back I was investigating high CPU utilisation on a customer system. One factor that I noticed was that the Concurrent Mark Sweep garbage collector on one of the Java 5 JVMs seemed to be using 100% of a CPU for much of the day. Checking the GC log file, I found that during this period the CMS collector was running almost continuously.

We also use jstat to monitor the GC behaviour because its output is much easier to interpret than the GC log. Jstat showed total GC times much lower than those shown in the GC logs. I also noticed that the number of 'full GCs' reported by jstat frequently increased by two or a larger even number and that odd numbers appeared only rarely and for short periods of time.

So what does this mean? Why does jstat apparently disagree with the information from the GC logs and why does it seem to report that full collections occur in pairs?

Although I've never seen the answer documented anywhere (no doubt some kind soul will now point it out to me), I had a hunch that I knew what it was. My theory was that jstat was reporting just the application pause time, not the total execution time of the collector.

So I set out to try to prove or disprove my theory by running a simple test application to exercise the java heap and collecting both the GC log file (-verbose:gc -XX:+PrintGCDetails -Xloggc:gc.log) and the jstat (-gcutil -t) output for comparison.

I also did a little reading about the CMS collector to help me to identify the application pause time in the logs. This told me that each CMS cycle causes two application pauses - one for the 'initial mark' phase and one for the 'remark' phase.

Here is the jstat output:-

E:\source\memstress>jstat -gcutil -t 3136 10s

Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
13.4 0.00 0.00 58.11 43.79 9.73 8 0.440 0 0.000 0.440
23.5 0.00 0.00 3.35 72.78 9.77 15 0.848 6 0.078 0.926
33.5 0.00 0.00 37.99 72.60 9.77 16 0.896 12 0.313 1.208

So between 13.4 seconds and 23.5 seconds of uptime, jstat reports 0.078s of full GC time.
The GC log is too verbose to reproduce here, even for this simple test, but the relevant messages for the pause phases were:-

15.410: [GC [1 CMS-initial-mark: 223173K(408576K)] 223672K(524224K), 0.0001739 secs]

[1 CMS-remark: 245213K(408576K)] 329218K(524224K), 0.0307485 secs]
19.074: [GC [1 CMS-initial-mark: 274783K(408576K)] 276925K(524224K), 0.0001132 secs]
[1 CMS-remark: 274783K(408576K)] 335949K(524224K), 0.0313580 secs]
21.232: [GC [1 CMS-initial-mark: 287928K(408576K)] 287928K(524224K), 0.0001128 secs]
[1 CMS-remark: 310245K(408576K)] 314122K(524224K), 0.0157990 secs]

So there were three CMS collection cycles in the tenured generation during the 10 second interval covered by the second line of jstat output. This corresponds to six pauses, which is what jstat reports as the 'Full GC count'.

Adding up the pause times, I reached a figure of 0.0779 seconds which agrees nicely with the 0.078 seconds reported by jstat.

So… when used with the CMS collector, jstat seems to report the number of application pauses for garbage collection and the amount of time for which the application was paused. The pause time will probably be much lower than the total amount of GC execution time.