When certain SQL features get used in
pre-12c versions that force
non-parallel evaluation, like using
ROWNUM or certain
Analytic Functions like LAG/LEAD, then - depending on the overall plan shape - Oracle can start to
decompose the parallel execution plan into several so called
DFO trees (If you want learn more about DFO trees and DFOs I recommend watching my
online tutorial on my Youtube channel).
Now having
multiple DFO trees in a single parallel execution plan comes with several
side effects that are confusing and complicate matters unnecessarily, like each DFO tree allocates its
own PX slave set(s), and so each one can potenially end up with a
different DOP, which means you can have more than one DOP in a single parallel execution plan.
Depending on the overall plan shape this might also mean that a DFO tree can get
started multiple times, and again this means that each time it is started / completed PX slaves need to be
allocated and de-allocated, potentially causing a significant overhead coordinating all that activity that is not directly related to the actual execution.
This also means that having multiple DFO trees can lead to a situation where (a lot) more PX slaves are
allocated than expected, in case multiple DFO trees are
active at the same time - which again means that if you believe you can
limit the number of PX slaves allocated by a single parallel execution using
Resource Manager directives you might be wrong.
Since all these are undesirable side effects, starting with release 12c Oracle has put effort into new features that minimize the need for such a decomposition into multiple DFO trees. One of these new features is the so called
"1 SLAVE" distribution method that can get used if such a non-parallel evaluation is required.
Quite similar to the recently described
"PX SELECTOR" operator the "1 SLAVE" distribution uses a
single PX slave out of a slave set to execute the non-parallel operations instead of the Query Coordinator. The main difference this makes is that the parallel and serial operations now are still part of the
same DFO tree instead of having parts of the execution plan executed by the Query Coordinator and different DFO trees before and after such serial operations.
Let's have a look at a simple example to demonstrate the new feature. I use here three identical tables, just for the sake of being able to differentiate the tables in the plan output - in principle re-using a single table three times would be sufficient.
create table t2
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;
exec dbms_stats.gather_table_stats(null, 't2')
create table t4
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;
exec dbms_stats.gather_table_stats(null, 't4')
create table t6
compress
as
select
(rownum * 2) + 1 as id
, mod(rownum, 2000) + 1 as id2
, rpad('x', 100) as filler
from
(select /*+ cardinality(100000) */ * from dual
connect by
level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;
exec dbms_stats.gather_table_stats(null, 't6')
explain plan for
select /*+ no_merge(x) */
*
from
(
select /*+ parallel(t6 4)
--optimizer_features_enable('11.2.0.4')
*/
*
from
(
select /*+ parallel(t2 4) */
--lag(id) over (order by id) as v1_rn
rownum as v1_rn
, t2.id as v1_id
, t2.filler as v1_filler
from
t2
) v1
, (
select /*+ parallel(t4 2) */
--lag(id) over (order by id) as v2_rn
rownum as v2_rn
, t4.id as v2_id
, t4.filler as v2_filler
from
t4
) v2
, t6
where
v1_id = v2_id
and v1_id = t6.id
) x
where
rownum > 1
;
-- 11.2.0.4 plan shape
----------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ30002 | Q3,02 | P->S | QC (RAND) |
| 5 | VIEW | | Q3,02 | PCWP | |
|* 6 | HASH JOIN | | Q3,02 | PCWP | |
| 7 | PX RECEIVE | | Q3,02 | PCWP | |
| 8 | PX SEND HASH | :TQ30001 | Q3,01 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q3,01 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q3,01 | PCWP | |
| 11 | BUFFER SORT | | Q3,02 | PCWC | |
| 12 | PX RECEIVE | | Q3,02 | PCWP | |
| 13 | PX SEND HASH | :TQ30000 | | S->P | HASH |
|* 14 | HASH JOIN | | | | |
| 15 | VIEW | | | | |
| 16 | COUNT | | | | |
| 17 | PX COORDINATOR | | | | |
| 18 | PX SEND QC (RANDOM)| :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 19 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 20 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 21 | VIEW | | | | |
| 22 | COUNT | | | | |
| 23 | PX COORDINATOR | | | | |
| 24 | PX SEND QC (RANDOM)| :TQ20000 | Q2,00 | P->S | QC (RAND) |
| 25 | PX BLOCK ITERATOR | | Q2,00 | PCWC | |
| 26 | TABLE ACCESS FULL| T4 | Q2,00 | PCWP | |
----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
14 - access("V1_ID"="V2_ID")
-- 12.1.0.2 plan shape
---------------------------------------------------------------------------------
| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | |
| 1 | COUNT | | | | |
|* 2 | FILTER | | | | |
| 3 | PX COORDINATOR | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10004 | Q1,04 | P->S | QC (RAND) |
| 5 | VIEW | | Q1,04 | PCWP | |
|* 6 | HASH JOIN BUFFERED | | Q1,04 | PCWP | |
| 7 | PX RECEIVE | | Q1,04 | PCWP | |
| 8 | PX SEND HASH | :TQ10002 | Q1,02 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | Q1,02 | PCWC | |
| 10 | TABLE ACCESS FULL | T6 | Q1,02 | PCWP | |
| 11 | PX RECEIVE | | Q1,04 | PCWP | |
| 12 | PX SEND HASH | :TQ10003 | Q1,03 | S->P | HASH |
|* 13 | HASH JOIN BUFFERED | | Q1,03 | SCWC | |
| 14 | VIEW | | Q1,03 | SCWC | |
| 15 | COUNT | | Q1,03 | SCWP | |
| 16 | PX RECEIVE | | Q1,03 | SCWP | |
| 17 | PX SEND 1 SLAVE | :TQ10000 | Q1,00 | P->S | 1 SLAVE |
| 18 | PX BLOCK ITERATOR | | Q1,00 | PCWC | |
| 19 | TABLE ACCESS FULL| T2 | Q1,00 | PCWP | |
| 20 | VIEW | | Q1,03 | SCWC | |
| 21 | COUNT | | Q1,03 | SCWP | |
| 22 | PX RECEIVE | | Q1,03 | SCWP | |
| 23 | PX SEND 1 SLAVE | :TQ10001 | Q1,01 | P->S | 1 SLAVE |
| 24 | PX BLOCK ITERATOR | | Q1,01 | PCWC | |
| 25 | TABLE ACCESS FULL| T4 | Q1,01 | PCWP | |
---------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(ROWNUM>1)
6 - access("V1_ID"="T6"."ID")
13 - access("V1_ID"="V2_ID")
Let's start with the
11.2.0.4 plan shape: We can see from multiple occurrences of the
PX COORDINATOR operator, the TQ column and the corresponding TQ entries in the NAME column that this parallel execution plan consists of
three DFO trees. When analyzing the actual run time activity it would also become obvious that the different DFO trees actually run at a
different DOP and use different PX slave sets. In this particular case here it would even become obvious that two DFO trees are
active at the same time - so it's pretty much demonstrating all the undesirable side effects of having multiple DFO trees that I mentioned above.
The HASH JOIN between V1 and V2 runs
serially, whereas the HASH JOIN between T6 and the join result of V1 and V2 runs
parallel. Since we have a
Serial->Parallel distribution between these two HASH JOINs, an additional
BUFFER SORT operation gets added - as outlined in the PX SELECTOR note.
If we now look at the
12.1.0.2 plan shape we notice that the execution plan consists of a
single DFO tree, and hence all the side effects I've outlined are gone. The serial evaluation is done by a single PX slave of one set, again indicated by the
SCWC/SCWP decorator, similar to the PX SELECTOR operator.
However, the plan shape also demonstrates one possible
disadvantage of the new feature: Since now everything is part of a single DFO tree we end up with more re-distributions per DFO tree, and hence might have an increased demand for the
dreaded BUFFERing to prevent multiple re-distributions being active at the same time - here this can be seen in operation ID 13 - the HASH JOIN executed serially by one PX slave is actually turned into a HASH JOIN
BUFFERED (so the worst case for a (not really) "parallel" hash join - executed only by a single PX slave and needs to buffer the right hand row source in addition), and also the second HASH JOIN is turned into its BUFFERED variant. This means compared to the 11.2.0.4 plan shape that contains no HASH JOIN BUFFERED but an additional BUFFER SORT operation, the 12.1.0.2 plan shape with this data pattern actually requires
more PGA / TEMP space than the 11.2.0.4 plan shape due to the
double buffering now necessary.
Footnote
The new 1 SLAVE distribution doesn't get used always in 12c. If you run just the join between V1 and V2 for example, then the old plan shape will be used, and there are again multiple DFO trees. Furthermore, in this particular case, when you start changing the DOP used in the PARALLEL hints you also might end up with a plan shape where one view uses the 1 SLAVE distribution whereas the other one uses the old plan shape with Query Coordinator activity - I haven't investigated further why this happens.
If
Analytic Functions get used, you might also see a
"1 SLAVE (ORDER)" variation of the distribution that enforces a certain order when re-distributing the data, similar to the
"PX SEND QC (ORDER)" operator.