Thursday, 31 March 2011

Getting the Big Picture from AWR

Oracle AWR reports are one of the first tools that I turn to when I am asked to look at a database performance issue. They contain a huge amount of very useul information and can often lead me straight to the cause of the problem. But sometimes they don't - there are some situations where the data provided by a single AWR report just starts to feel like a whole bunch of meaningless numbers.

The reason for this is simple - some problems don't show up as obviously 'bad' things such as excessive elapsed time, I/O or CPU consumption. We only know what 'bad' looks like if we can also see what 'good' looks like.

Sometimes when I am in this situation the answer is quick and easy - I can identify another time period when performance was 'good', run an AWR and compare the two results. If I have the right Oracle version and sufficient access then I can even ask Enterprise Manager to do it for me.

But there are still some situations where this approach doesn't give a clear answer. Perhaps there are multiple types of workload running and we can't find an obvious time period to use as our 'good' baseline.

There is a third approach that I have used to good effect in these situations - run a set of AWR (and possibly ASH) reports for consecutive time periods (let's say one hour each) and get hold of thansaction throughput and response time stats from your application for the same time periods. Create a simple spreadsheet with the interesting stats - application throughput, response time, database CPU usage, wait events and perhaps some of the performance stats for any queries you suspect. You can then graph how these things vary over time. Looking at a graph like this often helps me to figure out the cause of the problem.

Let's look at an example. The Excel graph below was done to investigate a temporary drop in application throughput that didn't show a very obvious cause in a single AWR due to the level of other activity happening at the same time. Picking 6-7PM as a 'good' period and comparing the 5PM and 6PM AWR reports didn't give a clear answer either.



The graph plots (on the primary axis) the SQL elapsed time as a line and the database CPU and key wait times as a stacked area chart - these should almost add up to the elapsed time. On the secondary axis you can see an orange line showing the drop in throughput at 4PM and 5PM.

So how do we use this graph to find the root cause? - any wait events that are not causing the issue are likely to drop as application throughput drops. A wait event that is causing the issue will probably stay constant or increase. From this graph we can see that row lock contention is the likely cause. We can also see that row lock waits are still significant at other times, which is why comparing just two AWRs didn't give a clear answer in this case.

You can use some clever ways to get the data for the graphs such as querying the AWR views directly or writing a script to parse the AWR report and extract the info. These are fine and dandy if you are ready to do them before you are under pressure to solve the issue. I didn't do either - I just manually typed the info into Excel - it didn't take very long.