Tag Archives: OWB

Parallel downgrades to serial after an OWB 10.1.0.4 upgrade

For the last few days we’ve encountered a problem with our OWB mappings running with wildly varying performance from one day/mapping to the next and we were a little confused as to why…figured it out now and thought I’d blog what we found…

The problem we found was that we were getting lots of “Parallel operations downgraded to serial” – which, given that we’re not using adaptive multi user whilst running our batch suite is indicative of mappings starting up and finding that there are no parallel slaves left whatsoever to play with. This was further confirmed by the fact that there were not many occurrences, in comparison, of downgrades by 25/50/75%…

select name
, value
from v$sysstat
where name like ‘Parallel%’
/

Name Value
——————————————- ——-
Parallel operations not downgraded 18,694
Parallel operations downgraded to serial 476,364
Parallel operations downgraded 75 to 99 pct 15
Parallel operations downgraded 50 to 75 pct 137
Parallel operations downgraded 25 to 50 pct 214
Parallel operations downgraded 1 to 25 pct 85

So why would that be then when the batch suite has been working fine (i.e. getting finished in good time) for quite some time now ?

The obvious thing to ask would be “what has changed ?” but we thought we’d try and look at it from an analysis of database metrics – we’ve got Enterprise Manager and AWR and ASH stuff to look at…so we tried to find where it tells us whether a given job (i.e. a piece of DML SQL in an ETL mapping) ran Parallel or not ? and more importantly with what degree of parallel ? Now, we may be novice EM users but basically we couldn’t find what we wanted which was essentially the contents of V$PX_SESSION at the point in time that the query ran so we could determine the DEGREE (actual achieved) and the REQ_DEGREE (what DOP was asked for) and see if there is any difference or if there is no record at all in the V$PX_SESSION thereby indicating that the query ran serially.

I’m sure there is probably a way of getting this information from EM/AWR/ASH but I can’t find it so after a fruitless search I decided to build a simple capture process which recorded the contents of V$PX_SESSION in a history table and we scheduled an EM job to run it periodically (repeated every minute in our case).

After we’d recorded this information we were able to analyse it by creating a simple view which linked this history data (in MGMT_T_V$PX_SESSION_HISTORY) to some of the ASH views…

CREATE OR REPLACE VIEW mgmt_v_sql_dop_analysis AS
WITH sql_dop AS
(
SELECT /*+ NO_MERGE MATERIALIZE */
DISTINCT hsp.sql_id
FROM dba_hist_sql_plan hsp
, dba_tables t
WHERE hsp.object_owner = t.owner
AND hsp.object_name = t.table_name
AND hsp.object_owner IS NOT NULL
AND t.degree IS NOT NULL
AND LTRIM(RTRIM(t.degree)) != ’1′
)
SELECT /*+ ORDERED USE_HASH(dhs,sd,ps) */
hsh.sql_id
, hsh.sample_time
, TRUNC(hsh.sample_time) ash_sample_date
, dhs.sql_text
, (CASE WHEN sd.sql_id IS NULL
THEN ‘SERIAL JOB’
ELSE ‘PARALLEL JOB’
END) job_type
, (CASE WHEN ps.saddr IS NULL
THEN ‘NOPARALLEL’
ELSE ‘PARALLEL’
END) ran_parallel_flag
, ps.rundate,ps.saddr,ps.sid
, ps.serial#
, ps.qcsid
, ps.qcserial#
, ps.degree
, ps.req_degree
FROM dba_hist_active_sess_history hsh
, dba_hist_sqltext dhs
, sql_dop sd
, mgmt_t_v$px_session_history ps
WHERE ps.qcsid(+) = hsh.session_id
AND ps.qcserial#(+) = hsh.session_serial#
AND hsh.sql_id = dhs.sql_id
AND hsh.sql_id = sd.sql_id(+)
– we started collecting mgmt_t_v$px_session_history
– at this point…
AND hsh.sample_time >=
TO_DATE(’15-JUL-2006′,’DD-MON-YYYY’)
/

…and then run this query:


SELECT a.sql_id
, a.sample_time
, a.ash_sample_date
, a.sql_text
, a.rundate
, a.job_type
, a.ran_parallel_flag
, a.rundate
, a.saddr
, a.sid
, a.serial#
, a.qcsid
, a.qcserial#
, a.degree
, a.req_degree
FROM mgmt_v_sql_dop_analysis a
where ash_sample_date = ’21-JUL-2007′
and job_type=’PARALLEL JOB’
and (ran_parallel_flag = ‘NOPARALLEL’
or req_degree != degree)

/

which finds jobs which involve tables with a DOP set on them and therefore should be using parallel but which actually ran either without any parallel slaves [i.e. Downgraded to Serial]…and hence were probably really slow… or ran with less than the requested parallel slaves [i.e. Downgraded by some percentage (25/50/75)]…and hence were probably slower than we’d have liked.

This highlighted lots of jobs which were not getting any DOP (as per the high figures of “Parallel operations downgraded to serial”) and a small number of jobs which were being downgraded somewhat – the same information as from V$SYSSTAT but more detailed in terms of which jobs had been affected and how.

So, we’ve identified which jobs were getting downgraded and it was pretty much in line with the elapsed times of those jobs – the more the downgrade the slower they went. Problem is that doesn’t specifically identify why except we noticed that it looked like lots of the mappings were running with far higher DOP than we’d expected (24 v 8) and hence probably why we were exhausting the supply of slaves.

We couldn’t work out why some of these mappings were getting DOP of 24 instead of the 8 we were expecting and had set on the tables. It pointed to the fact that the table level DOP was being ignored or overridden which in turn pointed the finger of suspicion on the mapping code.

So, back to the original question of “What’s changed recently ?”…

Well, we’d undergone an upgrade of OWB from 10.1.0.2 to 10.1.0.4 very recently and to cut a long story short, it seems that the problem has occurred as a result of that. The reason being that 10.1.0.2 OWB corrupted the PARALLEL hints on the mapping and the DML SQL then resorted to using the table level degree which was 8. Unfortunately, the 10.1.0.4 version has fixed this hint corruption issue and some new mappings recently created in this version are now using the hints the mapping specifies which are of the form:

PARALLEL(table_alias DEFAULT)

…which on our box with CPU_COUNT = 24 and PARALLEL_THREADS_PER_CPU = 1 means those mappings now get DOP of 24 instead of the table level DOP 8.

The issue didn’t affect the mappings which were originally created in 10.1.0.2 as they had the corrupted hints still in place but new mappings created in OWB 10.1.0.4 suffered from the problem.

What happens then is that those new mappings when they run in the schedule, ask for and get, far more of the available pool of px slaves than we anticipated and thus the pool of slaves is depleted and later running mappings have no slaves to use thereby running serially.

The solution was to go through all the mappings and remove the PARALLEL hints – instead we’ll be working off table level parallel degree settings exclusively and our batch suite should settle down again.

TRUNCATE command marks previously UNUSABLE indexes as USABLE

Yes, I’m back…but on an adhoc basis as I still don’t have enough time for this really! Yeah – I know, you’re all in the same boat ;-)

I came across this problem on the warehouse the other day and was slightly surprised by what was going on…

We have an OWB mapping which does a TRUNCATE and then INSERT APPEND into a target table. The target table has some bitmap indexes on it. We run a little preprocess before calling the mapping to mark the bitmaps as unusable so it doesn’t maintain them and then we (hope to) get better performance during the INSERT APPEND…we then rebuild the indexes afterwards more efficiently than maintaining them on the fly…sounds simple enough…except that performance was dragging…so I had a quick shufty and observed that the indexes were still marked as USABLE. I asked the guy running the stuff to check the run script and he said “No, we marked them as UNUSABLE before we ran the mapping – I’m sure of it”.

That’s strange I thought…time for some debug…which proved that we were setting the indexes UNUSABLE but somehow by the time the mapping was running they had miraculously been marked as USABLE.

I decided to create a simple test script to run in SQL*Plus which would be doing the same things as the OWB mapping – to try and repeat it at the SQL level and also to prove/disprove whether OWB was having some affect.

It’s a useful thing to do this in any case as it’s helpful if you’re going to end up:

A) Testing it on other platforms/versions
B) Sending it to Oracle Support via an SR.

(Excuse formatting – either Blogger is crap or I don’t understand how to get this to look nice -take your pick)


drop table j4134_test_bmi;

create table j4134_test_bmi(col1 number not null
,col2 number not null);

create unique index jtb_pk_i on j4134_test_bmi(col1);

alter table j4134_test_bmi add constraint jtb_pk primary key(col1) using index;

create bitmap index jtb_bmi on j4134_test_bmi(col2);

select index_name,index_type,status,partitioned from user_indexes where table_name=’J4134_TEST_BMI’;

alter index jtb_bmi unusable;

alter table j4134_test_bmi disable constraint jtb_pk;

alter index jtb_pk_i unusable;

select index_name,index_type,status,partitioned from user_indexes where table_name=’J4134_TEST_BMI’;

truncate table j4134_test_bmi;

select index_name,index_type,status,partitioned from user_indexes where table_name=’J4134_TEST_BMI’;

insert /*+ append */ into j4134_test_bmi(col1,col2)
select rownum
, 100
from (select level l from dual connect by level


commit;

select index_name,index_type,status,partitioned from user_indexes where table_name=’J4134_TEST_BMI’;

A section of the output of the script follows:

Index IndexName Type STATUS PAR
—————————— —— ——– —
JTB_PK_I NORMAL UNUSABLE NO
JTB_BMI BITMAP UNUSABLE NO

2 rows selected.

Elapsed: 00:00:00.06

Table truncated.

Elapsed: 00:00:00.11

Index IndexName Type STATUS PAR
—————————— —— ——– —
JTB_PK_I NORMAL VALID NO
JTB_BMI BITMAP VALID NO

Which shows that immediately after the TRUNCATE command was issued, the previously UNUSABLE indexes were suddenly marked as USABLE!

Why ? It’s not documented anywhere that this would happen…so I created an SR with Oracle Support who initially created a Bug for it…but Development came back to say that it is “Expected behaviour”. The reasoning Support give for this being expected is that when the Table is truncated, the associated Index segments are also truncated…and if they have no data in them then they can’t possibly be UNUSABLE so they get marked as USABLE as a side effect of the process.

It’s actually a reasonable assumption for them (Oracle development) to make to be fair – but it would have helped if it was documented – they’re going to create a Metalink support note to document the issue.

Thanks to Doug for pushing me to blog this – I’ll have to buy you a drink at the OUG Scotland Conference 2006!