One of those key procedures had to be executed in a kind of a loop since many of the transformations steps depend on each other's result.
So although each transformation step was executed in an rather efficient way using Oracle best-practices like parallel DML / exchange partition still the iterative approach meant that the underlying server wasn't used as efficiently as possible since many of those transformations steps also included smaller data sets that didn't use all of the resources available.
In order to address that they've recently introduced a more sophisticated version of that key procedure that uses some clever algorithm based on a dependency tree to determine which steps could be executed in parallel since they do not depend on each other.
Now as part of that optimization they've worked out that on their particular hardware they could run as much as 25 transformation steps in parallel and furthermore up to four of those 25 as parallel execution steps.
So you end up in principle with two queues - one is dequeued by four consumers processing "big" slices via parallel execution and a second queue is dequeued by approx. 20 consumers each processing "small" slices using serial execution (It is quite a big "iron" hardware).
This allowed them to increase the resource utilization of the available server hardware and resulted in a significant drop in overall runtime - actually they saved more than 50% of the original runtime which is quite a good result given that they have achieved that on exactly the same hardware!
So far, so very good - now they got an even "bigger" server with newer CPUs and ran the same database with the same configuration, the only change were the CPUs (which required also a newer version of the operating system, but that is a different story). Since most of the processes really use a lot of CPU (due to the design of the system and therefore running them in parallel makes good usage of all the available CPU power since I/O is not the only constraining factor) they expected a further drop in runtime but to their surprise the key procedure they tested took far longer than on the old hardware.
An analysis showed a really interesting case: It turned out that one of the "weak" points of the system which are rather inefficient user-defined PL/SQL functions that perform recursive SQL were going sideways in their performance and required almost hundred times longer per execution than before.
The explanation was rather stunning: Although no change to the database and in particular no code change to the user defined functions was introduced these recursive queries issued by the user-defined functions suddenly started to be executed in parallel using parallel query - which is a rather bad idea when getting executed 20 million times within an hour or so.
The interesting side effect was that on the operating system side it could be monitored that the kernel time increased to more than 50% of the available CPU time bringing the whole system more or less to a grinding halt.
Now the big question was: How could this have happened? This has never happened before on any of the previous systems but the same code / database on the new system showed this weird behaviour.
The first surprising finding was that the recursive queries showed a parallel execution plan when explaining a plan for them - simply due to the fact that the underlying tables and indexes were marked as parallel. This was an obvious bug in the code that seemed to be there from the beginning but hadn't surfaced yet.
So the question became more: Why had this not happened all the time in the past?
Looking further it turned out that the new CPUs resulted in a doubled CPU_COUNT reported by the operating system to Oracle - and they've used a DEFAULT parallel degree for their parallel executions.
This resulted obviously in a doubled degree of parallelism (DOP) of the parallel executions, but they still used the same instance configuration - in particular the same PARALLEL_MAX_SERVERS setting.
So what happened was that using the same approach as described above (using four consumers for the "big" queue) they simply ran out of parallel servers resulting in severe downgrades for some of the intended parallel executions - and some of them were even downgraded to serial.
And here these two issues together allowed the bug above to surface: Once the parallel execution was downgraded to serial the recursive query was eligible to be executed in parallel. Since the user-defined functions were marked with "PARALLEL_ENABLE" they usually were pushed into the parallel slaves and got executed there. Oracle recognizes this execution in the slaves and prevents a recursive parallel execution implicitly - you could call this a limitation or restriction, but in this case it prevented the bug from showing up.
So, in summary the following happened:
- In the past the old implementation ran at most one parallel execution at the same time, so due to the restrictions enforced via the Resource Manager for other sessions no serial downgrade was possible, so the recursive query was pushed into the slaves and executed serially there - no problems so far
- A new implementation was introduced that used a clever queuing approach. As part of that queuing multiple parallel executions happened simultaneously, but due to the carefully chosen settings these were not downgraded, at least not to serial and the recursive query ran as intended in the parallel slaves - still no problems
- A new hardware was introduced, doubling the CPU_COUNT and therefore doubling the default parallel degree, and bingo: Using the same carefully chosen settings parallel executions were downgraded to serial due to the doubled DOP, and the recursive query eventually started to be executed in parallel - the performance went south
After fixing the issue in the code by explicitly forcing serial execution of the recursive query even with the unadjusted settings causing serial downgrades the problem didn't show up anymore.
What can be taken away from this?
- Oracle treats things in parallel slaves differently from things executed in normal sessions
- Using a DEFAULT parallel degree can lead to interesting effects when modifying the underlying hardware / configuration
- Even if your code ran without problems for years it doesn't mean there isn't a bug waiting for you
Just for curiosity, here is a generic testcase that can be used to reproduce the issue at will:
-----------------------------
-- The complete testcase
-----------------------------
-- Tested with the following settings:
-- NO RAC
-- PARALLEL_MAX_SERVERS = 4
-- PARALLEL_ADAPTIVE_MULTI_USER = TRUE
-- PARALLEL_AUTOMATIC_TUNING = FALSE
-- If above parameters are different the setup needs to adjusted
-----------------------------
-- The setup
-----------------------------
alter system set parallel_max_servers = 4 parallel_adaptive_multi_user = true scope = memory;
drop package pkg_recursive_query;
drop table t_parallel_index purge;
drop table t_driver purge;
create table t_parallel_index (
id integer not null,
large_vc varchar2(200),
pkey integer not null
)
partition by range (pkey)
(
partition pkey_0 values less than (1),
partition pkey_1 values less than (2),
partition pkey_2 values less than (3)
);
create index t_parallel_index_idx on t_parallel_index (id) local parallel 2;
exec dbms_random.seed(0)
insert into t_parallel_index (
id
, large_vc
, pkey
)
select
rownum as id
, dbms_random.string('X', 100)
, 1
from
dual
connect by
level <= 10000;
commit;
insert into t_parallel_index (
id
, large_vc
, pkey
)
select
rownum as id
, dbms_random.string('X', 100)
, 2
from
dual
connect by
level <= 10000;
commit;
exec dbms_stats.gather_table_stats(null, 't_parallel_index')
explain plan for
select
large_vc
from
t_parallel_index
where
id = 2;
-- This is going to be a parallel index range scan by default
select * from table(dbms_xplan.display);
create table t_driver (
id integer not null,
pkey integer not null
);
alter table t_driver add constraint pk_t_driver primary key (id);
insert into t_driver (
id
, pkey
) values (
100
, 1
);
insert into t_driver (
id
, pkey
) values (
200
, 2
);
commit;
exec dbms_stats.gather_table_stats(null, 't_driver')
explain plan for
select /*+ opt_param('parallel_execution_enabled', 'true') */
f.large_vc
from
t_driver d
, t_parallel_index f
where
d.id = 100
and f.pkey = d.pkey
and f.id = 2;
-- This is again going to be a parallel index range scan by default
select * from table(dbms_xplan.display);
-- The user-defined function performing recursive SQL
create or replace package pkg_recursive_query
as
function get_value(p_f_id in integer, p_d_id in integer) return varchar2 parallel_enable;
end pkg_recursive_query;
/
create or replace package body pkg_recursive_query
as
function get_value(p_f_id in integer, p_d_id in integer) return varchar2 parallel_enable
as
s_value varchar2(200);
begin
select /*+ opt_param('parallel_execution_enabled', 'true') */
/* no_parallel_index(f) */
f.large_vc
into
s_value
from
t_driver d
, t_parallel_index f
where
d.id = p_d_id
and f.pkey = d.pkey
and f.id = p_f_id;
return 'SID: ' || sys_context('USERENV', 'SID') || '|' || s_value;
exception
when NO_DATA_FOUND then
s_value := 'SID: ' || sys_context('USERENV', 'SID') || '|<NO DATA FOUND>' ;
return s_value;
end get_value;
end pkg_recursive_query;
/
-----------------------------
-- The steps to perform
-----------------------------
-- First session
-- The main query is executed in parallel
-- Due to PARALLEL_ENABLE the recursive query is pushed into the parallel slaves
-- Being executed in the slaves forces serial execution of the recursive query
-- Keep the result set open to allocate the parallel slaves
set echo on pause on arraysize 1 pagesize 3 linesize 120
column result format a50 truncated
select /*+ parallel(t, 2) full(t) */
pkg_recursive_query.get_value(t.id, 100) as result
from
t_parallel_index t
where
id <= 50;
-- Second session
-- Simply allocate the second set of parallel slaves
-- Now all parallel slaves are allocated
-- Any further request for parallel execution will be downgraded to serial
set echo on
variable c refcursor
begin
open :c for '
select /*+ parallel(t, 2) full(t) */
*
from
t_parallel_index t
where
id <= 50
';
end;
/
-- Third session
-- The main query is now downgraded to serial
-- Now that the main query is serialized the recursive query potentially can get executed in parallel
-- But no parallel slaves are available so it shares the serial execution with the existing child cursor in the library cache
-- Keep the result set open
set echo on pause on arraysize 1 pagesize 3 linesize 120
column result format a50 truncated
select /*+ parallel(t, 2) full(t) */
pkg_recursive_query.get_value(t.id, 100) as result
from
t_parallel_index t
where
id <= 50;
-- Fourth session forces invalidation of child cursor for recursive query
alter table t_parallel_index truncate partition pkey_0;
-- Now close the second session to free parallel slaves
-- Further executions of the recursive subquery of session 3 will be performed in parallel
-- as long as parallel slaves are available otherwise it will be downgraded to serial
excellent story -thanks.
ReplyDeleteIts great to read one of those situations and think "what would I do" and "could I have figured this one out?"
Awesome sharing of experiences.