Showing posts with label Monitoring. Show all posts
Showing posts with label Monitoring. Show all posts

Friday, 27 April 2012

How to cause a Native OOM

Why would you want to do that? - a java.lang.OutOfMemoryError (OOM) is a bad thing. A native OOM error is even worse because it means that the JVM has no more memory for its own internal purposes and  results in the immediate death of the Virtual Machine, so in most situations you would not want to cause one deliberately.

In our case, we have a good reason to want to cause one - we have been seeing very occasional native OOM errors with part of our application. They are so rare that they are proving very difficult to diagnose. We would like to improve our monitoring to capture better data for diagnosis but to do so, we need to be able to test our monitoring.

Causing a Java heap OOM is pretty easy - just give your virtual machine a smallish heap size using the -Xmx switch and write some Java code that allocates lots of big Strings, populates them with random content (to defeat optimisations like interning) and stores them in a collection so that there is always a reference to them. As I've said previously, diagnosing these issues is a fairly well understood process, the usual solution being to get a heap dump using a switch such as -XX:+HeapDumpOnOutOfMemoryError and then using a tool such as Eclipse MAT to analyse the contents of the dump.

Causing a native OOM is less straightforward but there are ways to do it. A very comprehensive article on the subject of Native OOM errors can be found on the IBM site. This article also has some sample code which causes a native OOM by using some native code and JNI to allocate space in the native heap using malloc. It also suggests another way to chew up the native heap which doesn't need native code, so this is the way I did it. The java.nio.ByteBuffer class includes a method which allows the caller to allocate a 'direct' buffer which exists in the native heap. Here is a fragment of my code which does this...

private ByteBuffer createBuf() { 
    ByteBuffer bb = ByteBuffer.allocateDirect(BUFFSIZE); 
    int idx = 0; 

    while (idx < BUFFSIZE) { 
        Long l = new Long(r.nextLong());   // r is a java.util.Random
        bb.putLong(idx,l.longValue()); 
        idx += 1024;
    } 
    return bb; 

Call this enough times in a loop and keep the resulting references in a collection and you have created your own native memory leak. I found when testing this that I need to be careful to run it only in a 32 bit JVM on a machine with plenty of RAM, otherwise it won't hit the OOM error but it will use up all of your system's memory to the point where the system starts page thrashing. Don't try this on a production box!

Armed with this I was able to test questions like whether the -XX:+HeapDumpOnOutOfMemoryError switch will produce a heap dump for a native OOM. The answer is 'no', at least on Java 1.5.0_22.

Wednesday, 29 February 2012

Don't be Average

I've commented previously in these pages on the usefulness of the Oracle AWR and ASH reports but they are not the perfect answer to many questions. One of their limitations is that they present average figures over a period of time. If we are not very careful then we might start to believe that the average numbers are the whole story. Sometimes they are not.

One of my projects has had a long-standing performance issue with some Oracle code that has been very difficult to crack. One of the main challenges was that we did not have a detailed understanding of the problem. The information in AWR and ASH appeared to have a circumstantial connection to the problem but didn't really suggest how we could solve it. Our attempts to replicate it in a test environment were also unsuccessful - our tests always ran at least 10 times faster than the production system, even on a full-sized database.

This is a typical section of an AWR report from this issue:-



This shows the PL/SQL call (7q17xs6wcbra6) that was the subject of the problem plus another query (8x5z8zszym698) which always seemed to appear high in the AWR list whenever the PL/SQL call was busy. It was clear that there was a huge number of executions of the SQL, but was this the cause of the problem? As it turned out, the answer was 'no'.

The key to improving our understanding of the problem was to get away from looking at averages. The first step in this direction was to use CA Wily Introscope to look at the same PL/SQL calls from the WebLogic side. Looking at a quiet period in the early morning hours, this is what we saw:-



The graph shows that there is a repeating pattern consisting of a handful (usually 2-4) of calls which complete in less than one second followed by a single call that typically takes 30-40 seconds. This was a crucial piece of information which we had been missing. With this knowledge we were able to look at the business task that the procedure was doing and understand the pattern. For those interested, the system was working on a hierarchy of circuits. Deletion of the child circuits was very fast but deletion of the parent circuit required a full check of the hierarchy and was much slower.

Having identified the calls which took a long time we also extended the analysis by using SQL Trace to watch one of these and proved that the averages for the SQL operation were equally misleading for the same reason.

This understanding also allowed us to see what was missing from our test scenario and reproduce the issue in the test environment. Finally the understanding that we had gained allowed us to work with our supplier and find a fix for the issue.

Below is a graph from Wily for the past month showing the average response time of the PL/SQL call before and after we applied the fix. The improvement speaks for itself.

Wednesday, 30 November 2011

Using Jstat with the CMS Collector

Several months back I was investigating high CPU utilisation on a customer system. One factor that I noticed was that the Concurrent Mark Sweep garbage collector on one of the Java 5 JVMs seemed to be using 100% of a CPU for much of the day. Checking the GC log file, I found that during this period the CMS collector was running almost continuously.

We also use jstat to monitor the GC behaviour because its output is much easier to interpret than the GC log. Jstat showed total GC times much lower than those shown in the GC logs. I also noticed that the number of 'full GCs' reported by jstat frequently increased by two or a larger even number and that odd numbers appeared only rarely and for short periods of time.

So what does this mean? Why does jstat apparently disagree with the information from the GC logs and why does it seem to report that full collections occur in pairs?

Although I've never seen the answer documented anywhere (no doubt some kind soul will now point it out to me), I had a hunch that I knew what it was. My theory was that jstat was reporting just the application pause time, not the total execution time of the collector.

So I set out to try to prove or disprove my theory by running a simple test application to exercise the java heap and collecting both the GC log file (-verbose:gc -XX:+PrintGCDetails -Xloggc:gc.log) and the jstat (-gcutil -t) output for comparison.

I also did a little reading about the CMS collector to help me to identify the application pause time in the logs. This told me that each CMS cycle causes two application pauses - one for the 'initial mark' phase and one for the 'remark' phase.

Here is the jstat output:-

E:\source\memstress>jstat -gcutil -t 3136 10s

Timestamp S0 S1 E O P YGC YGCT FGC FGCT GCT
13.4 0.00 0.00 58.11 43.79 9.73 8 0.440 0 0.000 0.440
23.5 0.00 0.00 3.35 72.78 9.77 15 0.848 6 0.078 0.926
33.5 0.00 0.00 37.99 72.60 9.77 16 0.896 12 0.313 1.208

So between 13.4 seconds and 23.5 seconds of uptime, jstat reports 0.078s of full GC time.
The GC log is too verbose to reproduce here, even for this simple test, but the relevant messages for the pause phases were:-

15.410: [GC [1 CMS-initial-mark: 223173K(408576K)] 223672K(524224K), 0.0001739 secs]

[1 CMS-remark: 245213K(408576K)] 329218K(524224K), 0.0307485 secs]
19.074: [GC [1 CMS-initial-mark: 274783K(408576K)] 276925K(524224K), 0.0001132 secs]
[1 CMS-remark: 274783K(408576K)] 335949K(524224K), 0.0313580 secs]
21.232: [GC [1 CMS-initial-mark: 287928K(408576K)] 287928K(524224K), 0.0001128 secs]
[1 CMS-remark: 310245K(408576K)] 314122K(524224K), 0.0157990 secs]

So there were three CMS collection cycles in the tenured generation during the 10 second interval covered by the second line of jstat output. This corresponds to six pauses, which is what jstat reports as the 'Full GC count'.

Adding up the pause times, I reached a figure of 0.0779 seconds which agrees nicely with the 0.078 seconds reported by jstat.

So… when used with the CMS collector, jstat seems to report the number of application pauses for garbage collection and the amount of time for which the application was paused. The pause time will probably be much lower than the total amount of GC execution time.

Monday, 31 October 2011

Things That go Dead in the Night

Happy Halloween.

This post will be brief. Many people will think that it's a statement of the obvious but it's amazing how much time can be wasted on issues like this, especially in large organisations.

Sometimes we see systems behave rather strangely during performance tests and the teams managing the tests can sometimes be very inventive in their explanations of the issue. This particular issue has crossed my desk a couple of times (too many) in recent weeks and has been presented as 'a strange unexplained issue', 'file corruption', 'network issue' or 'SAN failure'.

Without building up any further mystery or invoking the supernatural, the answer was simple - the file system was allowed to become full. This can cause all sorts of interesting and varied problems to your system and the execution of any tests. In some cases it will fail completely, in others it might continue almost without anything appearing to be wrong and there are any number of variations in between these extremes.

One common factor is that it is quite normal for there to be no helpful error message like 'disk full' visible because often the file system that becomes full is the one being used to write the log file. This is where the creativity comes into play - people don't see an obvious error, so their imagination gets to work.

Scheduled housekeeping and other such automated cleanup solutions can add to the fun because they can sometimes free up space before somebody notices the problem... so then people may tell you that they are certain that the disk wasn't full because it isn't full now.

So if your test goes dead in the night, either temporarily or permanently, please do check your disk space monitoring (you do monitor your disk space utilisation don't you?). Alternatively, scrutinise your log files very carefully - are there any strange 'silent' periods that should contain messages?

Don't have Nightmares.

Wednesday, 18 May 2011

Monitoring Application Locks in Oracle

Oracle provides a variety of forms of locking protection for SQL operations to ensure that correct transactional semantics are provided to applications. Nine times out of ten when looking at issues related to locking, we will be interested in the locking facilities used by SQL, for example by monitoring row lock contention and tracking down which objects are the subject of the contention.

For some purposes, the locking provided by SQL operations does not meet the needs of the application. For these situations, Oracle provides the DBMS_LOCK package which provides application level locking.

This posting describes how to understand application level locking activity from AWR, ASH and other sources of information. It will help you to understand Oracle wait events like 'enq: UL - contention' and diagnose one possible cause of 'missing' response time in which some of the info provided by the Oracle AWR report may not be quite what you would expect.

Locking Strategies

There are two types of strategy that I've encountered when looking at application level locking. The first strategy uses DBMS_LOCK.REQUEST to try to acquire a lock and optionally specify the length of time for which the application is prepared to wait if another session already has the lock.

The second strategy is one that I would call 'polling' in which the application performs a rapid check to try to acquire a shared resource. This may use DBMS_LOCK with a very short timeout or it could (more likely) use a completely different implementation. If the resource cannot be acquired then the application waits using DBMS_LOCK.SLEEP for a period of time before retrying.

Trying it Out

I wanted to be able to run some predictable tests and then look at the results using various monotoring tools. For this purpose, I used some sample code that in the form of a PL/SQL package creates and then acquires an application lock called 'printer_lock' using DBMS_LOCK. I then started two sessions using SQL*Plus. In the first session, I created the lock and acquired it. In the second session I tried again to acquire the same lock which of course caused this session to wait because the lock was held by the other session.

I also added my own procedure to call DBMS_LOCK.SLEEP. In this case I didn't do any real 'locking' - I just wanted to see how a session calling this procedure would show up in the monitoring reports.

Visibility in Performance Reports

AWR can shed light on the wait time for application level locking using DBMS_LOCK.REQUEST. From this we can get a good understanding of whether our application might be getting delayed by application locks.

The first sections that contain information that we're interested are the 'wait class' and 'wait events'. Let's look at the more detailed 'wait events' section first. Here we will see the wait event 'enq: UL - contention' providing details of the total time waited while using APIs such as DBMS_LOCK.REQUEST (but not SLEEP) to acquire a lock. This info is also summarized into the wait class under 'Application'.


If the locking is being executed by PL/SQL code then AWR may also show us the top-level calling PL/SQL procedure in the 'SQL by Elapsed Time' section. The time spent waiting to acquire the lock will be counted in the elapsed time of this procedure.


The ASH report provides similar visibility, again highlighting the same wait event and the calling procedure. It can also provide details of the session that is waiting. Because ASH works by sampling it may give a slightly different picture from that shown by AWR and can be a better tool for some types of problem.



Because the meaning of any locks is application defined, neither AWR nor ASH provide any information regarding the resource that is the subject of any contention. We will look at how to diagnose this later.

For applications using a polling strategy, the visibility is not so good. In fact it would be very easy to miss it completely. The database performance reports seem to assume that time spent in DBMS_LOCK.SLEEP is of little of no interest to the audience of an AWR or ASH report. It is visible in the AWR report as the wait event 'PL/SQL lock timer', but you will need to look carefully to find this event because it appears very low down in the table (possibly right at the end), even if the time waited is very high.


The thing which is perhaps more surprising is that time spend in DBMS_LOCK.SLEEP is not counted in the elapsed time of the calling procedure. This means that for any procedure that sleeps in this way, you can't rely on the elapsed time reported by AWR. The same principle seems to apply to ASH - if the ASH sample catches the procedure sleeping like this then it doesn't seem to register it as a 'hit' on the calling procedure at all. This may make sense to someone looking at a problem from a database infrastructure perspective, but to an application developer looking at an AWR or ASH report it is very counter-intuitive.

System Tables and Views

As I said earlier, neither AWR nor ASH will shed much light on the resources that are the subject of application level locking conflicts. To get any detail on this we will need to query some system tables and views.

The first query looks for user locks. The most interesting columns are CTIME (which I think is the number of seconds since the lock was acquired), the SID which is the session currently holding the lock and ID1 which is used in the second query

SQL> select * from v$lock where TYPE = 'UL';
ADDR KADDR SID TY ID1 ID2 LMODE REQUEST -------- -------- ---------- -- ---------- ---------- ---------- ---------- CTIME BLOCK ---------- ---------- 1EF3E270 1EF3E29C 35 UL 1073741921 0 6 0 76 0

The second query looks up the lock name using the ID from the previous query. Clearly since we are dealing with application level locking, the meaning of the lock name is application-defined, so you will need to use your knowledge of the application to figure out what it means and whether you can do anything to improve the situation. Note that the 'expiration' column does not relate to the tenure of the lock by any session, but rather the lifetime of this particular application-level lock. It can be set when the lock is created by DBMS_LOCK.ALLOCATE_UNIQUE and defaults to 10 days.

SQL> select * from dbms_lock_allocated where lockid=1073741921;

NAME
--------------------------------------------------------------------------------

LOCKID EXPIRATIO
---------- ---------
printer_lock
1073741921 21-MAY-11

Naturally you could join the two queries, but I will leave that as an exercise for the reader.

Lock Housekeeping


The previous section mentioned the expiration time of an application level lock. In some circumstances this may also have a performance impact for your application. The reason is buried in the following sentence from the Oracle documentation…

"Subsequent calls to ALLOCATE_UNIQUE may delete expired locks to recover space."

The implication of this is that the housekeeping of the dbms_lock_allocated system table can also get mixed up with your application's execution. The system table is indexed only by name but the deletion statement used for housekeeping accesses the table by expiration date so the housekeeping query may cause a performance problem if your application creates a lot of locks with different names.

Here is an example AWR from an application that is being hit by this issue

Thursday, 31 March 2011

Getting the Big Picture from AWR

Oracle AWR reports are one of the first tools that I turn to when I am asked to look at a database performance issue. They contain a huge amount of very useul information and can often lead me straight to the cause of the problem. But sometimes they don't - there are some situations where the data provided by a single AWR report just starts to feel like a whole bunch of meaningless numbers.

The reason for this is simple - some problems don't show up as obviously 'bad' things such as excessive elapsed time, I/O or CPU consumption. We only know what 'bad' looks like if we can also see what 'good' looks like.

Sometimes when I am in this situation the answer is quick and easy - I can identify another time period when performance was 'good', run an AWR and compare the two results. If I have the right Oracle version and sufficient access then I can even ask Enterprise Manager to do it for me.

But there are still some situations where this approach doesn't give a clear answer. Perhaps there are multiple types of workload running and we can't find an obvious time period to use as our 'good' baseline.

There is a third approach that I have used to good effect in these situations - run a set of AWR (and possibly ASH) reports for consecutive time periods (let's say one hour each) and get hold of thansaction throughput and response time stats from your application for the same time periods. Create a simple spreadsheet with the interesting stats - application throughput, response time, database CPU usage, wait events and perhaps some of the performance stats for any queries you suspect. You can then graph how these things vary over time. Looking at a graph like this often helps me to figure out the cause of the problem.

Let's look at an example. The Excel graph below was done to investigate a temporary drop in application throughput that didn't show a very obvious cause in a single AWR due to the level of other activity happening at the same time. Picking 6-7PM as a 'good' period and comparing the 5PM and 6PM AWR reports didn't give a clear answer either.



The graph plots (on the primary axis) the SQL elapsed time as a line and the database CPU and key wait times as a stacked area chart - these should almost add up to the elapsed time. On the secondary axis you can see an orange line showing the drop in throughput at 4PM and 5PM.

So how do we use this graph to find the root cause? - any wait events that are not causing the issue are likely to drop as application throughput drops. A wait event that is causing the issue will probably stay constant or increase. From this graph we can see that row lock contention is the likely cause. We can also see that row lock waits are still significant at other times, which is why comparing just two AWRs didn't give a clear answer in this case.

You can use some clever ways to get the data for the graphs such as querying the AWR views directly or writing a script to parse the AWR report and extract the info. These are fine and dandy if you are ready to do them before you are under pressure to solve the issue. I didn't do either - I just manually typed the info into Excel - it didn't take very long.

Monday, 31 January 2011

Dashboards for Production Monitoring

We've been using CA Wily Introscope to monitor one of our production systems for several years. Some dashboards had been developed when the tool was first installed and these have been in use ever since. These dashboards provided useful information, but had been built from a technical perspective and didn't really reflect the way that we usually think about our system. For example, the top level dashboard showed items like 'JMS Subsystem' and 'Java Heap'. When one of these indicated an issue, it wasn't really very clear which service was being impacted… and in some cases whether there was really a problem at all.

Following the experience with Wily for performance test monitoring (see last month's posting) I decided to spend some time creating some new dashboards that were a better match for the way we think about our system - i.e. as a collection of services each one of which might be working well or having some problems. I therefore wanted a top-level dashboard with one traffic light indicator for each of our key services and drill-down into a more detailed dashboard for each individual service to allow problems to be tracked down. I will return to this later.

This posting describes what I wanted to do, what I did and what I learned in the process. It's mostly about Wily but other tools are available so I hope (dear reader) that the principles described would apply reasonably well to a different tool.

Individual Service Dashboards

To figure out whether an individual service is working properly we need to monitor several things…
  • What is the response time of the service and is it within acceptable limits?
  • What is the throughput for the service?
  • For a JMS messaging service, how many messages are sat in the queue?
  • For a JMS service, does the queue have one or more listeners connected?
  • Are there any messages in the error queue?
  • Is the application reporting an abnormal number of errors?
All of the above are available in what Wily terms the 'front end' or in the JMX data for the relevant JMS destination.

Our application is clustered, so in most cases I needed to create a Metric Group to bring together the numbers for a particular metric from each member of the cluster. For most of these I then created an Alert to contain the threshold values that would trigger an amber or red warning.

Sometimes I was able to find a Service Level Agreement (SLA) to set the alert thresholds. In cases where there was no SLA I used the Wily Investigator to get a picture of the average and maximum values for the metric over the past month and set thresholds based on this.

For some metrics (e.g. throughput) it was not appropriate to create an Alert. For the error alerts on our user interface I needed to ask Wily to count how many occurrences happened in a given period of time (e.g. 5 minutes) because in many cases a small error rate is quite normal and acceptable.

The service dashboards typically contain a graph to track the values of each metric across all managed servers plus a traffic light showing the status of the alert. In most cases I found it best to bind the graph to the alert rather than the metric group so that I got an amber and a red line on the graph to show the trigger points for each alert like the example below.


Laying out the dashboards was challenging in some cases and for some more complex services I needed to resort to multiple layers of drill-down to show all of the graphs that were needed. In some cases I did this by having a higher level dashboard with 'throughput', 'response time' and 'queue backlog' panels and providing drill-down to more detailed dashboards from here. For other services it was more appropriate to break the service down into functional elements (sub-services) and provide a second level drill-down dashboard for each of these. Here's an example.


Top Level Dashboards

Wily allows us to combine individual alerts into higher level summary alerts - in multiple tiers if necessary. This is exactly what I did to create the top level dashboard and also for the summary dashboard where I broke things down into sub-services. The top level dashboard with one traffic light per service is shown below. I also added a CPU usage graph below the traffic lights for information (and because there was some free space).


Tricky Things

So far so good, but not everything went totally as I wanted…
  • I couldn't find a way to combine metrics using any kind of logic - for example if I wanted to raise an alert if a queue had zero listeners AND had a non-zero message count.
  • I have not so far found any way to turn cumulative metrics (e.g. events since startup) into a number of events per interval. This would be very useful for some JMX metrics that are only available as a cumulative number.
  • For WebLogic JMS queues it is really necessary to know both the 'current' message count and the 'pending' (i.e. being processed or buffered for delivery) count and to sum these together. It is possible to do this using Wily calculators but this gets complicated in a cluster because we really need a separate calculator per metric per cluster member… which is pretty time-consuming to set up. In practice I have ignored the pending count in most cases - except for those parts of the application where it is really critical.
  • Debugging summary alerts can be a bit tricky - I found one that kept going amber or red on the top level dashboard but after drilling down into the detailed dashboard everything was green. In the end I traced it to a duplicate alert that was misconfigured and not shown on the dashboard but it took a lot of manual checking to track this down.
  • Wily can't 'see' everything. This isn't really a tool issue, but it does mean that 'green' on the Wily dashboard doesn't necessarily mean that everything is fine. For example, some parts of our application keep internal 'transaction queues' in application database tables and other parts use background jobs in the database which may go wrong in interesting ways without Wily having any visibility. The Wily Oracle Agent might help here, but at the moment we don't have it installed.
Future Possibilities

As I said at the beginning, my dashboard is focused on monitoring the services supported by our application. As a result, the dashboards focus on the 'front end' metrics. Wily also provides a lot of visibility of the 'back end' (e.g. JDBC calls into Oracle) and infrastructure (e.g. Java garbage collector behaviour) but none of these are used by my dashboards. I figured that if there is a backend issue then it must have a visible effect on the front end service otherwise it isn't really an issue at all. I would therefore expect the dashboard to detect a problem and flag it as an amber or red condition. It would, of course, be nice to be able to drill down and isolate this problem to the database (or whatever), but I've left that as a possible improvement for the future.

Friday, 24 December 2010

JRockit Mission Control vs Wily - Postscript

We've just completed our project to performance test an Apache Camel integration solution hosted on WebLogic. I thought it would be useful to write a few words about our experiences using JRMC and Wily to monitor the tests as a postscript to 'JRockit Mission Control vs Wily'

CA Wily Introscope

We wound up using Wily for most of our monitoring needs in the WebLogic tier. On the whole it did the job that we expected of it very well.

Our main thoughts on our Wily experience are...
  • The fact that Wily's monitoring is 'always on' was a big advantage, as well as the ability to get historical results in graphical or CSV format.
  • We were not able to use any custom probes because Wily was installed and configured by the customer's infrastructure team and getting custom probes created, tested and deployed was just going to need too many phone calls and email round trips.
  • Initially there were quite a lot of navigation steps and other clicking around needed to capture all of the metrics that we needed into graphs or CSV form. We largely overcame this by creating some metric groupings and custom dashboards to bring all of the necessary data together into one place.
  • For some reason, the JMX data from the Camel route MBeans was not captured by Wily during our tests (although the processor MBeans worked fine and the route MBeans worked fine before we started the tests).
  • Wily didn't recognise our Apache CXFServlet as a 'Frontend' (but it knew it was a Servlet) nor did it recognise outbound web service and TIBCO calls as 'Backends', so some of the more advanced features (e.g. transaction tracing) were not usable.
  • Wily often picks a non-intuitive sampling interval for graphs (e.g. 6 minutes) but it was quite easy to override this and use something more sensible.
  • Data captured from JMX MBeans needed some further analysis afterwards to calculate the throughput and average response times during the test period. It may have been possible to use Wily calculators to automate this, but we didn't have time to get that far

JRockit Mission Control

We had a couple of issues with JRMC right from the start. These were probably not faults with the tool, but they severely limited its usefulness
  • Our customer had an older version of JRockit than that supported by the latest version of JRMC, so we needed to downgrade to JRMC 3.1.2 before most of the JRMC features would work.
  • We never managed to get any kind of recording capability to work. JRMC 3.1.2 has JRA recording, but whenever we tried this it would cause the WebLogic managed server to go into an 'UNKNOWN' state and stop working. I've tested JRA recording on my local setup and it works fine, so this issue seems to be specific to our customer's managed WebLogic setup. It may have been fixable but we didn't get this far.
So in the end we used Wily for most of our WebLogic monitoring needs. The biggest gap in our Wily capability was measurement of callouts by the application to other web services. These were not being made directly by Camel components, so we had no visibility using JMX either.

We were able to use JRMC to fill this gap by using JRMC method profiling to trap the web service and TIBCO callouts. This wasn't a perfect setup because it could only give us an average response time for the whole test. It also had a few other issues:-
  • Someone needs to remember to start the method profiler before the test and save the results afterwards.
  • It isn't easy to share JRMC method profiling configurations (or any other JRMC configuration) with other team members, so each person had to set the probes up manually.
  • Only one person can use method profiling on a managed server at a time. If someone accidentally starts profiling while another session is in progress the first session is terminated.
  • We had one occasion on which the Windows machine running JRMC automatically rebooted itself part way through a 24 hour test run so we lost all of the JRMC data prior to the reboot.
It would have been nice to have been able to use the latest version of JRMC and be able to use all of the features, but circumstances dictated otherwise.

Summary

On the whole, our experience has reinforced my original conclusions that JRMC is a good tool for single developers or testers to quickly get a handle on how their Java code is performing but that Wily is much stronger in a team environment or for production monitoring.

In the end it wasn't an 'either/or' decision for us - we needed to use both tools to capture the data that we needed.

Wednesday, 17 November 2010

How Fast is my Camel?

I'm working on a project using Apache Camel to do some web service and messaging bus transformation and integration. My part of the project is mainly concerned with performance testing, so I needed to figure out how we could measure the performance of various components within our Camel setup - both in terms of how our integration piece was performing and also how the various services on which we depend were performing.

For those not familiar with Camel you can find an overview here

To figure out how to make these measurements I set up a couple of sample contexts and routes and deployed them into a WebLogic 10.3 server. For no particularly good reason, both provide a service based on the Spring Web Services echo sample.

One route is based on the Camel CXF sample and uses a CXF endpoint, a custom transformation bean and a custom backend bean to generate the response.


The second route uses the Camel HTTP components to provide a servlet and then proxy the SOAP call to the same app running on another (Jetty) app server.


What we need to measure is:-
  • how many times is each component/route called in a given time period?
  • how long does each route take to respond back to the client?
  • where is the response time being spent? - in our own components or in calls to other systems?
I will look at two ways to make these performance measurements in this posting - using JMX and by trapping and measuring invocations to individual Java mathods. Because the project is using JRockit, I will be showing how to make these measurements using JRockit Mission Control (JRMC), but the techniques are equally applicable to other tools.

JMX

I was pleasantly surprised to find a bunch of Camel related JMX MBeans appear in the JRMC console without any need for me to make any changes to the application. Camel enables JMX by default and it seems that it also creates some usable default MBeans for each component.


Browsing through these it seems that most of them relate to the configuration of the Camel components rather than their runtime behaviour. Some of these include options to turn tracing on or off and reconfigure the detail level - useful but not quite what we are looking for. The two relevant parts of the tree for performance measurement are 'routes' and 'processors' because both of these contain invocation counts and timings. As their name suggests, 'routes' relates to the behaviour of the entire route and 'processors' to the behaviour of some individual components. The 'processors' tree doesnt cover every component - things like servlets that receive messages via HTTP and initiate the route processing are not available here, but their response time will in most cases equate to the response time of the whole route.

Below is a screenshot of one of the components in my proxying route showing the attributes that will alow us to monitor the number of invocations and response time. You can see the two custom MBeans in the echo_context route and the HTTP client in the servlet_echo_context route. Unfortunately the names of the MBeans don't seem to be very meaningful, but by browsing into the attributes it is usually possible to figure out which one is which. The attributes provided by the route level MBean are very similar.


We can therefore use JMX to monitor the performance of our Camel routes and individual components.

If you press the 'Visualize' button, you can also chart the value of one or more MBean attributes to see how they change over time. The chart will appear on the 'General' view in the console, soomething like the one below...


(Incidentally, if you want to graph behaviour over time and be able to zoom in and out etc then I think that Wily Introscope does a much better job - see my post comparing JRMC and Wily)

I noticed one bit of strangeness - the HTTP client MBean seems to report double the number of invocations that I would expect - each time the route runs the HTTP client's 'ExchangesTotal' count goes up by two.

Profiling Traps

The JMX MBeans will probably serve most of our day to day needs, but sometimes I expect we will want to 'drill down' a bit further to investigate performance issues. For this purpose, we can place traps or probes into the code which can make measurements at a level of detail lower than those provided by JMX.

These traps can be inserted using Aspect Oriented Programming technologies such as aspectj, Spring's built in AOP capabilities or using a monitoring tool such as Wily or JRMC.

If you want to use this approach, you need to have some knowledge of the code of the framework that you are using. You can get this by browsing the Javadocs or the source code itself or by running your app in a debugger, placing breakpoints at 'interesting' places and examining the call stack. Setting up a debugger takes some time, but I think it provides the answers that I want more quickly than trying to figure things out from the documentation or the source.

Here are some methods where I've placed traps to try out this technique...
  • org.apache.cxf.transport.servlet.AbstractHTTPServlet.handleRequest(HttpServletRequest, HttpServletResponse) - The entry point into my CXF driven route
  • org.apache.camel.component.http.CamelServlet.service(HttpServletRequest, HttpServletResponse) - The entry point into my HTTP servlet driven route
  • org.apache.camel.component.http.HttpProducer.executeMethod(HttpMethod) - The point where my HTTP servlet driven proxying route calls out to an external server
  • my own custom bean methods
The above points will capture data similar to that captured by JMX. These can be used along with the source code as a starting point for 'drilling down' into the detail.

Creating profiling traps in JRMC is fairly simple - go to the 'Advanced' view in the console and click the 'Add...' button under 'select methods to profile. JRMC will ask you for the class name. Assuming that the class is found, it will then allow you to select the methods that you want to profile.

Once the methods are selected, you need to click the 'Start Profiling' button (green triangle) to start capturing data. You should see something like the screenshot below.



Profiling traps can be used to get a more detailed picture of where response time is being sent. They are more complex to set up than JMX and probably incur some measurement overhead. They do have one other big drawback - they are based only on class and method, so you won't be able to separate out the stats for different component instances if you have a number of routes using the same type of component. For all of these reasons, method profiling is probably best restricted to very targeted tests in which you can focus your workload and your profiling traps purely on a small area of interest.

Sunday, 31 October 2010

JRockit Mission Control vs Wily

We’ve been using CA Wily Introscope to monitor one of our production systems for a couple of years. It has been very useful in providing a detailed insight into what is going on with the Java tier including the ability to look back in time and do root cause analysis after a problem has happened.

In particular, we use Wily to monitor the throughput and response time of our ‘front ends’ (servlets, EJBs etc), the ‘back end’ (JDBC calls) as well as keeping an eye on key JMX metrics such as the number of messages in JMS queues and the behaviour of the heap and garbage collector. All in all, it’s a very useful tool that I wouldn’t like to be without.



Another customer that I’ve been working with more recently has suggested either Wily or Oracle JRockit Mission Control (JRMC) as monitoring solutions. I’ve mostly worked with the Sun JVM until now so I didn’t know much about JRMC. I decided to try to figure out how it stacked up against Wily.


This post is about my first impressions of JRMC and its suitability for my own needs. I wouldn’t claim that it’s a comprehensive comparison, so please forgive any errors or omissions.

Pros
  • It’s much easier to get a copy of JRMC than Wily
  • No interceptor setup needed
  • Easier to use for a 'quick look'
  • Probably a better tool for general profiling when you’re not sure where a performance issue might be
  • I like the ‘wait events’ profiling – this seems rather similar to what an Oracle AWR report shows in the database… it remains to be seen whether it proves as useful in the Java tier.
  • Looks like it has some useful features for diagnosing memory leaks, but I’ll reserve judgement until I’ve used them on a real life problem.
  • Can trigger thread/heap dumps etc from the same console
  • Probably better targeted garbage collection visibility for JRockit
Cons
  • Many features only work with the JRockit 1.6 VM
  • It doesn’t seem to support the notion of a centralized permanent data repository
  • Without a data repository, I suspect that any long term trend reporting will be very difficult
  • It doesn’t seem to have the capability to build ‘traffic light’ dashboards and drill down from these to the detailed monitoring data
In summary, I think that both Wily and JRMC are useful tools. JRMC probably scores more highly on more detailed JVM analysis and ease of getting started. Wily does a better job of monitoring a production system and historical trend reporting.

My first choice would be to have both tools available. If I had to choose between them I’d probably stick with Wily for production system monitoring and possibly go for JRMC for monitoring performance tests.

Tuesday, 14 September 2010

Capturing Local Network Traffic on Windows

I often find myself using a network packet sniffer to capture traffic between two network applications. This can be a very powerful technique, both for debugging and for diagnosing performance issues because it shows exactly what is happening on the wire without any need for debugging or logging code in the application and without any room for doubt that the debugging code may itself be causing a problem or not showing the issue.

One serious limitation with this technique is that it is not usually possible to monitor traffic between two applications running on the same Windows machine. I think that the reason is that the Windows loopback adapter doesn’t support packet sniffing. This is rather a pain for me because I quite often want to look at two network applications ‘in captivity’ by running them on my laptop.

Until recently I had just accepted this limitation and found other ways around it such as using another machine to run one of the applications, but this needs extra effort to get the application working on another machine. Sometimes I simply abandoned the packet sniffing route and tried to solve the problem using other techniques.

I recently wanted to monitor and debug some Oracle database link traffic. Installing Oracle on another machine was really going to be overkill so I spent some time thinking about other ways to see the traffic.

I figured out a very simple way around the problem that allows packet sniffing without installing one of the applications on a different machine. The basic principle is to make the network traffic from the client go to a different machine and then have that machine send the traffic back to the target application. This forces the traffic between the two applications ‘onto the wire’ where it can be captured by a packet sniffer (Wireshark in my case). In my setup the extra machine is running Linux, but it should be possible to adapt this technique for other operating systems.

I tried a couple of ways to do this without success before settling on a really simple answer. The first two ways should also work and may be more suitable in some situations. Having found a simpler way, I didn’t pursue them further, but here they are:-
  1. Use iptables port forwarding to identify the relevant network packets and route them back to the intended host.
  2. Use netcat as an application level proxy configured to open a TCP connection to the target host when it receives an inbound connection and forward traffic in both directions.
The technique that I settled on doesn’t require any clever stuff on the Linux machine:-
  1. Use Putty on the Windows machine to open an SSH session onto the Linux machine and log in.
  2. Create a tunnel in Putty, providing a free port number (let’s say we choose 8521) on the Windows machine as the ‘source port’ and the IP address of the Windows machine and the real port number of the application (e.g. 1521) as the ‘destination’
  3. Reconfigure the client application to talk to localhost:8521
  4. Set up the packet sniffer to capture network traffic on the real application port number (1521).
  5. Run the client application and watch the packets arrive.
The screenshot below shows the Putty settings window with the tunnel info being set up. Don't forget to click the 'Add' and then 'Apply' buttons or else it won't work.


This works because Putty listens on port 8521 on the Windows box, sends all traffic received over to the Linux box via its SSH connection and then opens a connection from the Linux box back to port 8521 on the Windows box. The server application and packet sniffer both see a connection coming from the Linux box and send all responses back the same way, even though the client is really on the Windows box.

Step 3 is of course very dependant on the client application. In the case of a database link the ‘client’ is actually an Oracle database instance. I had to set up an extra aliased entry in tnsnames.ora and create a database link to talk to the aliased TNS name instead of going direct to the real target instance.

Monday, 10 August 2009

How to find Java Memory Leaks - Part 1

Java heap consumption and memory leak diagnosis is something that (judging by other posts around the net) is often misunderstood - I've lost count of the number of searches that have turned up advice to simply increase the heap size and hope the problem goes away.

Let's address this one question straight away - if you really have a memory leak, increasing the heap size will at best only delay the appearance of an OutOfMemoryError. If your issue is just a temporary spell of high demand for heap memory then a bigger heap may help. If you take it too far you may see much worse performance because of swapping or even crash your JVM.

So what should you do? In essence, I would recommend three steps:-

1. Monitor your application in test and/or production to get an understanding of its heap behavior
2. If you really do have a problem, you need to gather data to help to track down the cause.
3. Finally having got the data, you need to figure out what it means - what is causing your problem and how to fix it.

Sadly these steps are not always easy to do - particularly step 3.

I'll be talking about the Sun JVM from here on, although many of the same principles and sometimes tools apply to other JVM implementations.

First let's look at monitoring...

To get a clear picture, you really need to look in detail at the heap stats from your JVM, but I quite often find that these are not available when I'm first asked to look at a problem, so what can we tell without this?

Clearly if you see this in your application logs then there is a problem...

java.lang.OutOfMemoryError: Java heap space

The usual response if you have real users is to restart the application straight away. You should then continue to monitor because you're probably going to need to know more before you get to the end of the diagnosis.

I often start by looking at CPU consumption using top, vmstat or whatever tools are available. What I'm looking for is periods where the Java app is using close to 100% of a single CPU. While it's not conclusive proof (being stuck in a loop could cause the same effect), this often means that the app is spending a lot of time doing full garbage collections - the most commonly used garbage collectors are single threaded and therefore tend to use 100% of a CPU.

Monitoring memory usage from the operating system is not very informative - Java will typically grow the heap to the maximum size long before there is a problem. If you have oversized your heap compared to physical memory you will probably see heavy swapping activity, but that's a different issue.

Some applications and app servers will make calls to java.lang.Runtime.freeMemory(). While this may be better than nothing, it's fairly uninformative and provides very little info about which generations have free space, how much time is being spent garbage collecting and so on.

The JVM can provide much more detailed info about what is going on. The JVM itself provides two main ways to get hold of this:-

1. JVM Startup arguments to write information to log files. Here are some suggestions, but check the docs for your specific JVM version (or try here) because the available options vary:-

-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintTenuringDistribution

You may also want to add -Xloggc:myapp-gc.log to send your GC info to a dedicated log file

You can then analyse the log files to produce a graph using a tool such as gcviewer. This approach gives you detailed information about every GC run but be careful to check that you have enough disk space - these options can produce a lot of data. Typically you will also need to restart your app to enable these options, although they are also controllable via JMX for more recent JVM versions.

2. JDK tools - jvmstat in Java 1.4 and jstat in Java 1.5 and higher. These are tools that you run from the command line. They connect to your Java app via one of the JVM debugging APIs and will then report on heap and GC stats according to the options you specify. I usually use jstat -gcutil -t 60s which produces a line of output once per minute showing the percentage utilisation of each heap space, plus the number of young and full GC events and the cumulative GC times. You can then open the resulting text file in Excel for analysis. Note that jstat is capable of monitoring Java 1.4 VMs, which is handy if you have multiple versions. Alternatively (if your environment allows) you can try visualgc for an instant GUI view but I prefer using jstat to collect logs for offline analysis.

There are other commercial tools (e.g. Wily Introscope) which can provide the same data as the JDK tools. They may provide better visualisation, historical reporting and perhaps proactive alerting
but I don't intend to cover them in detail here because most readers probably don't have them.

So what should you look for to know whether you have a problem?
  • OutOfMemoryErrors in your application logs - clearly you have a problem if you are getting these.
  • A persistent upwards trend in the Old Generation usage after each full garbage collection.
  • Heavy GC activity. What 'heavy' means is rather dependant on your application. A rule of thumb for a non-interactive app might be 15 seconds during each minute - which means that your app only has 75% of its time left to do useful work. You might choose a lower figure or look at the duration of individual full GC runs, especially if your app has real users waiting for it.
Finally, some pictures. Here is a graph showing jstat data from an app on WebLogic 9.2 with a heap memory leak - you can see that the Old Generation usage is climbing, even after garbage collection.

Here is another graph from the same app. The original leak has been fixed and there is now no evidence of a cumulative memory leak, but there are some intermittent spikes in heap usage. If the spike is severe enough it can still cause an error. Tracking this problem down will need some knowledge about what unusual types of system activity are occurring at the time of the spike.


Here is a graph drawn by gcviewer after using JVM startup arguments to produce a log file. The graph is quite 'busy' and includes a lot of extra info - for example this app's arguments allow the heap to grow as more is needed. The graph shows the growth in overall heap size as well as the growth in consumption. This info is available using jstat too, but if you really need it all then using gcviewer will probably be more convenient.

In the next post I'll look at how to gather data to help trace the cause of your memory leak.