Tag Archives: DW

SCD2′s and their affect on the CBO

We’ve got lots of SCD2 type tables in our warehouse and I’ve been wondering about how much affect it has on the CBO. Essentially, my concern is that when you query an SCD2 you generally look for records as they were on a specific date – the analysis date as we call it. You end up writing a predicate such as:

and [analysis_date] BETWEEN from_date and to_date

Now, how many rows will the optimiser think are going to be returned from the table ?

I figured that topic might have already been investigated by somebody so I did a search on google which was interesting:

What a pleasant surprise to find a reference from my own blog (via orafaq) showing as the number 1 hit…and Mark Rittman also shortly after! Unfortunately my own post was on a different matter and Mark’s was too so I was still a bit stumped…

After a bit more research I found something from Wolfgang Breitling on this subject which confirmed my thoughts and discussed it very eloquently along with other fallacies of the CBO.

Now – Wolfgang tells us here what the problem is and that there isn’t really a remedy other than using hints or stored outlines to guide the CBO…and who am I to argue!

I did think that maybe we could create some interface tables to hold all the possible range to date permutations and then when a user queries for a given analysis date they could use an equality predicate on the interface tables which would convert to pairs of from/to dates which then get equality matched to the target SCD2 – it kinda works but it means a lot of work to crunch through the interface tables just to avoid the problem of the CBO not being able to work out the selectivity/cardinality and potentially making a bad plan choice. The more the possible permutations the more work it becomes and in reality the number of permutations seems to be prohibitive so I’ve binned that idea for our environment. I might try to catch up with Wolfgang at the UKOUG to discuss this one further if I can grab his attention.

Addendum – One of Wolfgang’s suggestions in his paper is to artificially set the stats on the table to some large number so that even when it factors the number down for the probability calculation it does then the number is still large and it will consequently choose hash/merge joins over a nested loop index lookups approach. I tried this by setting the table stats on the tables in my query to be large values using:

exec dbms_stats.set_table_stats(ownname => ‘THE_SCHEMA’,tabname => ‘THE_TABLE’,numrows => 3000000000,numblks => 24000000);

This seemed to work but I still wasn’t overly keen on it since that means the optimizer is going to be coerced for any access to such tables – even if there isn’t a join required.

My colleague Tank then came up with the idea that given most of our processes run off an “analysis date” which we store in a table, we could just create a materialized view of the contents of this table and set its stats to an artificially high value for numrows/blks and then given this table is used as the driver of most queries it would propagate through the plan and the optimiser, even applying heavy reductions for the probability would still realise that there were a lot of rows to process and choose plans accordingly….it worked a treat.