Wednesday 29 February 2012

Don't be Average

I've commented previously in these pages on the usefulness of the Oracle AWR and ASH reports but they are not the perfect answer to many questions. One of their limitations is that they present average figures over a period of time. If we are not very careful then we might start to believe that the average numbers are the whole story. Sometimes they are not.

One of my projects has had a long-standing performance issue with some Oracle code that has been very difficult to crack. One of the main challenges was that we did not have a detailed understanding of the problem. The information in AWR and ASH appeared to have a circumstantial connection to the problem but didn't really suggest how we could solve it. Our attempts to replicate it in a test environment were also unsuccessful - our tests always ran at least 10 times faster than the production system, even on a full-sized database.

This is a typical section of an AWR report from this issue:-



This shows the PL/SQL call (7q17xs6wcbra6) that was the subject of the problem plus another query (8x5z8zszym698) which always seemed to appear high in the AWR list whenever the PL/SQL call was busy. It was clear that there was a huge number of executions of the SQL, but was this the cause of the problem? As it turned out, the answer was 'no'.

The key to improving our understanding of the problem was to get away from looking at averages. The first step in this direction was to use CA Wily Introscope to look at the same PL/SQL calls from the WebLogic side. Looking at a quiet period in the early morning hours, this is what we saw:-



The graph shows that there is a repeating pattern consisting of a handful (usually 2-4) of calls which complete in less than one second followed by a single call that typically takes 30-40 seconds. This was a crucial piece of information which we had been missing. With this knowledge we were able to look at the business task that the procedure was doing and understand the pattern. For those interested, the system was working on a hierarchy of circuits. Deletion of the child circuits was very fast but deletion of the parent circuit required a full check of the hierarchy and was much slower.

Having identified the calls which took a long time we also extended the analysis by using SQL Trace to watch one of these and proved that the averages for the SQL operation were equally misleading for the same reason.

This understanding also allowed us to see what was missing from our test scenario and reproduce the issue in the test environment. Finally the understanding that we had gained allowed us to work with our supplier and find a fix for the issue.

Below is a graph from Wily for the past month showing the average response time of the PL/SQL call before and after we applied the fix. The improvement speaks for itself.