Thursday, April 10, 2008
Problem with _gby_hash_aggregation_enabled parameter
Here's a tale about an Oracle initialisation parameter...and a lesson we should all take note of...
For three days, my colleagues in the support team on one of the warehouses I'm involved in, were struggling with a piece of code which was exhausting the available temp space and after trying everything they could think of, they asked me to take a look. I must admit I was a little baffled at first because the piece of code in question had been happily running for some time now and every time I'd run it, I'd never noticed that TEMP was anywhere near being exhausted so, whilst I could tell the process had some kind of problem, I was in the dark as to exactly what that problem was.
After trying to break down the large query into several smaller steps, I realised that it was an early step in the query that was exhibiting the problem of temp exhaustion - the step being a pivot of around 300 million measure rows into a pivot target table.
This process runs monthly and it had run without issue on all of the previous 20 occurrences and for roughly the same, if not more rows to pivot...so it didn't appear to be the volume that was causing the problem.
There had been no changes in the code itself for many months and the current version of the code had been run successfully in previous months, so it didn't appear to be a code fault.
I obtained the actual execution path for the statement whilst it was running and it looked reasonable, although looking at it, triggered a thought in my mind...what if something had changed in the database configuration?
Why would I get that thought from looking at the execution path?
Well, a while back, we had received some advice that a line in our init.ora as follows, should be changed to set the feature to TRUE instead of FALSE, so that the feature became active:
_gby_hash_aggregation_enabled = FALSE
This results in a line in the plan that reads:
HASH GROUP BY
instead of
SORT GROUP BY
The parameter was set to FALSE due to a known bug and the issues we'd seen with it, however the recent advice we'd received, indicated that the bug had been resolved at the version level we were on and that by enabling the feature - which enables GROUP BY and Aggregation using a hash scheme - we'd gain a performance boost for certain queries.
So, the DBA team researched the advice and it appeared to be the case, that the bug (4604970) which led to the disabling of the feature was fixed at our version level (10.2.0.3 on HP-UX). We duly turned on the feature in a pre production environment and ran it for a while without noticing any issues. We then enabled it in production and again, for a while, we've not noticed any issues...until now.
After a check back through the logs, it appeared that since the parameter was first enabled, the queries which were now failing, had not been run at all...they had only run prior to the parameter change...so with my suspicions aroused further, I disabled the feature at the session level and reran the process. It completed in a normal time frame and used a small amount of TEMP - hooray!
So, now we have to go back to support to try and understand if the original bug is not quite fixed or whether this is a different scenario...in any event, we're going to disable the feature for now, even though we're only getting problems with the feature on 2 processes out of perhaps thousands.
So, what's the lesson to learn?
Well, quite simply, that you need to have a thorough audit of what configuration changes you've made together with a good audit of the processes you've run so that you can work out what has changed since the last time you successfully ran a process. This gives you a fighting chance of spotting things like the above.
For three days, my colleagues in the support team on one of the warehouses I'm involved in, were struggling with a piece of code which was exhausting the available temp space and after trying everything they could think of, they asked me to take a look. I must admit I was a little baffled at first because the piece of code in question had been happily running for some time now and every time I'd run it, I'd never noticed that TEMP was anywhere near being exhausted so, whilst I could tell the process had some kind of problem, I was in the dark as to exactly what that problem was.
After trying to break down the large query into several smaller steps, I realised that it was an early step in the query that was exhibiting the problem of temp exhaustion - the step being a pivot of around 300 million measure rows into a pivot target table.
This process runs monthly and it had run without issue on all of the previous 20 occurrences and for roughly the same, if not more rows to pivot...so it didn't appear to be the volume that was causing the problem.
There had been no changes in the code itself for many months and the current version of the code had been run successfully in previous months, so it didn't appear to be a code fault.
I obtained the actual execution path for the statement whilst it was running and it looked reasonable, although looking at it, triggered a thought in my mind...what if something had changed in the database configuration?
Why would I get that thought from looking at the execution path?
Well, a while back, we had received some advice that a line in our init.ora as follows, should be changed to set the feature to TRUE instead of FALSE, so that the feature became active:
_gby_hash_aggregation_enabled = FALSE
This results in a line in the plan that reads:
HASH GROUP BY
instead of
SORT GROUP BY
The parameter was set to FALSE due to a known bug and the issues we'd seen with it, however the recent advice we'd received, indicated that the bug had been resolved at the version level we were on and that by enabling the feature - which enables GROUP BY and Aggregation using a hash scheme - we'd gain a performance boost for certain queries.
So, the DBA team researched the advice and it appeared to be the case, that the bug (4604970) which led to the disabling of the feature was fixed at our version level (10.2.0.3 on HP-UX). We duly turned on the feature in a pre production environment and ran it for a while without noticing any issues. We then enabled it in production and again, for a while, we've not noticed any issues...until now.
After a check back through the logs, it appeared that since the parameter was first enabled, the queries which were now failing, had not been run at all...they had only run prior to the parameter change...so with my suspicions aroused further, I disabled the feature at the session level and reran the process. It completed in a normal time frame and used a small amount of TEMP - hooray!
So, now we have to go back to support to try and understand if the original bug is not quite fixed or whether this is a different scenario...in any event, we're going to disable the feature for now, even though we're only getting problems with the feature on 2 processes out of perhaps thousands.
So, what's the lesson to learn?
Well, quite simply, that you need to have a thorough audit of what configuration changes you've made together with a good audit of the processes you've run so that you can work out what has changed since the last time you successfully ran a process. This gives you a fighting chance of spotting things like the above.
Subscribe to Posts [Atom]
