Thursday 18 August 2011

Why does a query change plans when the stats have not changed?

I was recently asked to investigate a performance issue on one of our customer's production systems. Some comparative analysis of AWR reports between 'good' and 'bad' periods showed that the queries being run by the part of the application that had problems were circa 50% slower but what really caught my attention was an unrelated query that had jumped from about 8th place in the AWR response time 'offenders' list up to first place. It was also accounting for huge amounts of logical I/O and a fair bit of physical I/O.

Looking in more detail at this query we got AWR SQL reports that showed its performance during the 'good' and 'bad' periods. These showed that a different plan was in use during the 'bad' period. This plan had a table scan whereas the other plan used an index lookup. It has to be said though that neither plan can have been all that great - if the index plan was 'good' then the query shouldn't have been in the AWR top 10 offenders list at all.

So we knew that the query was changing plans. We've seen this problem before and the usual first step is to go and check the statistics on the table. We did this and found that the stats were just over a week old (i..e not really 'stale') and that they had not changed at all between the 'good' and 'bad' periods of performance.

So, that brought us back to the frequently asked question 'why does a query change plans when the stats have not changed?'. Richard Foote has written a good article about this and suggests one explanation…

http://richardfoote.wordpress.com/2010/02/16/how-does-an-execution-plan-suddenly-change-when-the-statistics-remain-the-same-in-limbo/

We did some more checking and proved to ourselves that Richard's explanation (that the data being queried is newer than the histograms in the table stats so the optimizer can't get a good cost estimate) did not fit our situation.

Sometimes our data was older than the stats, sometimes newer. In principle then, Richards explanation might fit. We checked the AWR data for this query manually over the past several days. In each AWR snapshot period we found that there was just one plan and that the plan seemed to be stable for more than 12 hours before switching again. This didn't fit Richard's explanation - the application would typically be hitting a mix of older and newer data so we should see both plans being used during the same period.

We shifted over to a test system and found a couple of key values - one that would access data that was older than the stats and one newer. The shared pool was flushed before each test to force Oracle to 'forget' any previous decisions about the plan. We found that both the older and the newer key values would choose the index plan in spite of the lack of histogram data for the newer key value.

We then looked in more detail at the query and how the application used it. We found that it was being used to query a relationship that had very variable cardinality. Sometimes it would find just 2 or 3 related objects, other times (by design) it would find tens of thousands or even hundreds of thousands. We then tried testing using key values that would find low cardinality against ones with high cardinality. When we picked the high cardinality key we found that it picked the table scan plan - so there was the initial answer to the question…

  • Q: 'why does a query change plans when the stats have not changed?'
  • A: 'because the optimizer uses bind variable peeking and can quite correctly decide on different plans because different key values result in large differences in cardinality'

The snag was that the results of our earlier AWR queries also meant that this was not the full explanation of our problem - if bind variable peeking was the full explanation then we should see both plans being used during each snapshot.

We did some more tests and discovered that if we ran the high cardinality test first and then ran the low cardinality keys immediately afterwards (without flushing the shared pool), the low cardinality key would also use the table scan plan (and run slowly). We could also reverse the two tests and cause the high cardinality key to use the index lookup plan.

The Oracle documentation seems to provide the explanation for this…

http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm

"On subsequent invocations of the cursor, no peeking takes place, and the cursor is shared, based on the standard cursor-sharing criteria, even if subsequent invocations use different bind values"

So basically if the two queries pass the cursor sharing test (which in our case they would) then bind variable peeking is only done for the first run of the query and the optimizer assumes that the plan is good for any subsequent invocations.

So what was happening in our production system was that…
  • Usually the low cardinality query would run and cause the optimizer to choose the index lookup plan.
  • Sometimes the query would be evicted from the library cache by other activity
  • If the query is not in the library cache and the next run of the query has a high cardinality, the optimizer will choose the table scan plan
  • All subsequent invocations of the query would then use the table scan plan until the query was again evicted from the library cache. For low cardinality queries this resulted in the performance issue that we saw.
There are probably lots of potential solutions to this. SQL hints and Oracle 11 SQL profiles spring to mind. In our case upgrading to Oracle 11 or changing the application were not practical so we created a new composite index that covered more of the 'where' clause and would persuade the optimizer to choose an index lookup regardless of whether the cardinality was high or low.