For the last few days we’ve encountered a problem with our OWB mappings running with wildly varying performance from one day/mapping to the next and we were a little confused as to why…figured it out now and thought I’d blog what we found…
The problem we found was that we were getting lots of “Parallel operations downgraded to serial” – which, given that we’re not using adaptive multi user whilst running our batch suite is indicative of mappings starting up and finding that there are no parallel slaves left whatsoever to play with. This was further confirmed by the fact that there were not many occurrences, in comparison, of downgrades by 25/50/75%…
where name like ‘Parallel%’
Parallel operations not downgraded 18,694
Parallel operations downgraded to serial 476,364
Parallel operations downgraded 75 to 99 pct 15
Parallel operations downgraded 50 to 75 pct 137
Parallel operations downgraded 25 to 50 pct 214
Parallel operations downgraded 1 to 25 pct 85
So why would that be then when the batch suite has been working fine (i.e. getting finished in good time) for quite some time now ?
The obvious thing to ask would be “what has changed ?” but we thought we’d try and look at it from an analysis of database metrics – we’ve got Enterprise Manager and AWR and ASH stuff to look at…so we tried to find where it tells us whether a given job (i.e. a piece of DML SQL in an ETL mapping) ran Parallel or not ? and more importantly with what degree of parallel ? Now, we may be novice EM users but basically we couldn’t find what we wanted which was essentially the contents of V$PX_SESSION at the point in time that the query ran so we could determine the DEGREE (actual achieved) and the REQ_DEGREE (what DOP was asked for) and see if there is any difference or if there is no record at all in the V$PX_SESSION thereby indicating that the query ran serially.
I’m sure there is probably a way of getting this information from EM/AWR/ASH but I can’t find it so after a fruitless search I decided to build a simple capture process which recorded the contents of V$PX_SESSION in a history table and we scheduled an EM job to run it periodically (repeated every minute in our case).
After we’d recorded this information we were able to analyse it by creating a simple view which linked this history data (in MGMT_T_V$PX_SESSION_HISTORY) to some of the ASH views…
CREATE OR REPLACE VIEW mgmt_v_sql_dop_analysis AS
WITH sql_dop AS
SELECT /*+ NO_MERGE MATERIALIZE */
FROM dba_hist_sql_plan hsp
, dba_tables t
WHERE hsp.object_owner = t.owner
AND hsp.object_name = t.table_name
AND hsp.object_owner IS NOT NULL
AND t.degree IS NOT NULL
AND LTRIM(RTRIM(t.degree)) != ’1′
SELECT /*+ ORDERED USE_HASH(dhs,sd,ps) */
, TRUNC(hsh.sample_time) ash_sample_date
, (CASE WHEN sd.sql_id IS NULL
THEN ‘SERIAL JOB’
ELSE ‘PARALLEL JOB’
, (CASE WHEN ps.saddr IS NULL
FROM dba_hist_active_sess_history hsh
, dba_hist_sqltext dhs
, sql_dop sd
, mgmt_t_v$px_session_history ps
WHERE ps.qcsid(+) = hsh.session_id
AND ps.qcserial#(+) = hsh.session_serial#
AND hsh.sql_id = dhs.sql_id
AND hsh.sql_id = sd.sql_id(+)
– we started collecting mgmt_t_v$px_session_history
– at this point…
AND hsh.sample_time >=
…and then run this query:
FROM mgmt_v_sql_dop_analysis a
where ash_sample_date = ’21-JUL-2007′
and job_type=’PARALLEL JOB’
and (ran_parallel_flag = ‘NOPARALLEL’
or req_degree != degree)
which finds jobs which involve tables with a DOP set on them and therefore should be using parallel but which actually ran either without any parallel slaves [i.e. Downgraded to Serial]…and hence were probably really slow… or ran with less than the requested parallel slaves [i.e. Downgraded by some percentage (25/50/75)]…and hence were probably slower than we’d have liked.
This highlighted lots of jobs which were not getting any DOP (as per the high figures of “Parallel operations downgraded to serial”) and a small number of jobs which were being downgraded somewhat – the same information as from V$SYSSTAT but more detailed in terms of which jobs had been affected and how.
So, we’ve identified which jobs were getting downgraded and it was pretty much in line with the elapsed times of those jobs – the more the downgrade the slower they went. Problem is that doesn’t specifically identify why except we noticed that it looked like lots of the mappings were running with far higher DOP than we’d expected (24 v 8) and hence probably why we were exhausting the supply of slaves.
We couldn’t work out why some of these mappings were getting DOP of 24 instead of the 8 we were expecting and had set on the tables. It pointed to the fact that the table level DOP was being ignored or overridden which in turn pointed the finger of suspicion on the mapping code.
So, back to the original question of “What’s changed recently ?”…
Well, we’d undergone an upgrade of OWB from 10.1.0.2 to 10.1.0.4 very recently and to cut a long story short, it seems that the problem has occurred as a result of that. The reason being that 10.1.0.2 OWB corrupted the PARALLEL hints on the mapping and the DML SQL then resorted to using the table level degree which was 8. Unfortunately, the 10.1.0.4 version has fixed this hint corruption issue and some new mappings recently created in this version are now using the hints the mapping specifies which are of the form:
…which on our box with CPU_COUNT = 24 and PARALLEL_THREADS_PER_CPU = 1 means those mappings now get DOP of 24 instead of the table level DOP 8.
The issue didn’t affect the mappings which were originally created in 10.1.0.2 as they had the corrupted hints still in place but new mappings created in OWB 10.1.0.4 suffered from the problem.
What happens then is that those new mappings when they run in the schedule, ask for and get, far more of the available pool of px slaves than we anticipated and thus the pool of slaves is depleted and later running mappings have no slaves to use thereby running serially.
The solution was to go through all the mappings and remove the PARALLEL hints – instead we’ll be working off table level parallel degree settings exclusively and our batch suite should settle down again.