Monday, June 22, 2015

12c Parallel Execution New Features: 1 SLAVE distribution

When certain SQL features get used in pre-12c versions that force non-parallel evaluation, like using ROWNUM or certain Analytic Functions like LAG/LEAD, then - depending on the overall plan shape - Oracle can start to decompose the parallel execution plan into several so called DFO trees (If you want learn more about DFO trees and DFOs I recommend watching my online tutorial on my Youtube channel).

Now having multiple DFO trees in a single parallel execution plan comes with several side effects that are confusing and complicate matters unnecessarily, like each DFO tree allocates its own PX slave set(s), and so each one can potenially end up with a different DOP, which means you can have more than one DOP in a single parallel execution plan.

Depending on the overall plan shape this might also mean that a DFO tree can get started multiple times, and again this means that each time it is started / completed PX slaves need to be allocated and de-allocated, potentially causing a significant overhead coordinating all that activity that is not directly related to the actual execution.

This also means that having multiple DFO trees can lead to a situation where (a lot) more PX slaves are allocated than expected, in case multiple DFO trees are active at the same time - which again means that if you believe you can limit the number of PX slaves allocated by a single parallel execution using Resource Manager directives you might be wrong.

Since all these are undesirable side effects, starting with release 12c Oracle has put effort into new features that minimize the need for such a decomposition into multiple DFO trees. One of these new features is the so called "1 SLAVE" distribution method that can get used if such a non-parallel evaluation is required.

Quite similar to the recently described "PX SELECTOR" operator the "1 SLAVE" distribution uses a single PX slave out of a slave set to execute the non-parallel operations instead of the Query Coordinator. The main difference this makes is that the parallel and serial operations now are still part of the same DFO tree instead of having parts of the execution plan executed by the Query Coordinator and different DFO trees before and after such serial operations.

Let's have a look at a simple example to demonstrate the new feature. I use here three identical tables, just for the sake of being able to differentiate the tables in the plan output - in principle re-using a single table three times would be sufficient.
create table t2
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't2')

create table t4
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't4')

create table t6
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't6')

explain plan for
select /*+ no_merge(x) */
       *
from
     (
       select /*+ parallel(t6 4)
                  --optimizer_features_enable('11.2.0.4')
              */
               *
       from
            (
              select /*+ parallel(t2 4) */
                     --lag(id) over (order by id)    as v1_rn
                     rownum    as v1_rn
                   , t2.id     as v1_id
                   , t2.filler as v1_filler
              from
                     t2
            ) v1
          , (
              select /*+ parallel(t4 2) */
                     --lag(id) over (order by id)    as v2_rn
                     rownum    as v2_rn
                   , t4.id     as v2_id
                   , t4.filler as v2_filler
              from
                     t4
            ) v2
          , t6
       where
              v1_id = v2_id
       and    v1_id = t6.id
     ) x
where
      rownum > 1
;

-- 11.2.0.4 plan shape
----------------------------------------------------------------------------------
| Id  | Operation                        | Name     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |        |      |            |
|   1 |  COUNT                           |          |        |      |            |
|*  2 |   FILTER                         |          |        |      |            |
|   3 |    PX COORDINATOR                |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)          | :TQ30002 |  Q3,02 | P->S | QC (RAND)  |
|   5 |      VIEW                        |          |  Q3,02 | PCWP |            |
|*  6 |       HASH JOIN                  |          |  Q3,02 | PCWP |            |
|   7 |        PX RECEIVE                |          |  Q3,02 | PCWP |            |
|   8 |         PX SEND HASH             | :TQ30001 |  Q3,01 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR       |          |  Q3,01 | PCWC |            |
|  10 |           TABLE ACCESS FULL      | T6       |  Q3,01 | PCWP |            |
|  11 |        BUFFER SORT               |          |  Q3,02 | PCWC |            |
|  12 |         PX RECEIVE               |          |  Q3,02 | PCWP |            |
|  13 |          PX SEND HASH            | :TQ30000 |        | S->P | HASH       |
|* 14 |           HASH JOIN              |          |        |      |            |
|  15 |            VIEW                  |          |        |      |            |
|  16 |             COUNT                |          |        |      |            |
|  17 |              PX COORDINATOR      |          |        |      |            |
|  18 |               PX SEND QC (RANDOM)| :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|  19 |                PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  20 |                 TABLE ACCESS FULL| T2       |  Q1,00 | PCWP |            |
|  21 |            VIEW                  |          |        |      |            |
|  22 |             COUNT                |          |        |      |            |
|  23 |              PX COORDINATOR      |          |        |      |            |
|  24 |               PX SEND QC (RANDOM)| :TQ20000 |  Q2,00 | P->S | QC (RAND)  |
|  25 |                PX BLOCK ITERATOR |          |  Q2,00 | PCWC |            |
|  26 |                 TABLE ACCESS FULL| T4       |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(ROWNUM>1)
   6 - access("V1_ID"="T6"."ID")
  14 - access("V1_ID"="V2_ID")

-- 12.1.0.2 plan shape
---------------------------------------------------------------------------------
| Id  | Operation                       | Name     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |        |      |            |
|   1 |  COUNT                          |          |        |      |            |
|*  2 |   FILTER                        |          |        |      |            |
|   3 |    PX COORDINATOR               |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)         | :TQ10004 |  Q1,04 | P->S | QC (RAND)  |
|   5 |      VIEW                       |          |  Q1,04 | PCWP |            |
|*  6 |       HASH JOIN BUFFERED        |          |  Q1,04 | PCWP |            |
|   7 |        PX RECEIVE               |          |  Q1,04 | PCWP |            |
|   8 |         PX SEND HASH            | :TQ10002 |  Q1,02 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR      |          |  Q1,02 | PCWC |            |
|  10 |           TABLE ACCESS FULL     | T6       |  Q1,02 | PCWP |            |
|  11 |        PX RECEIVE               |          |  Q1,04 | PCWP |            |
|  12 |         PX SEND HASH            | :TQ10003 |  Q1,03 | S->P | HASH       |
|* 13 |          HASH JOIN BUFFERED     |          |  Q1,03 | SCWC |            |
|  14 |           VIEW                  |          |  Q1,03 | SCWC |            |
|  15 |            COUNT                |          |  Q1,03 | SCWP |            |
|  16 |             PX RECEIVE          |          |  Q1,03 | SCWP |            |
|  17 |              PX SEND 1 SLAVE    | :TQ10000 |  Q1,00 | P->S | 1 SLAVE    |
|  18 |               PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  19 |                TABLE ACCESS FULL| T2       |  Q1,00 | PCWP |            |
|  20 |           VIEW                  |          |  Q1,03 | SCWC |            |
|  21 |            COUNT                |          |  Q1,03 | SCWP |            |
|  22 |             PX RECEIVE          |          |  Q1,03 | SCWP |            |
|  23 |              PX SEND 1 SLAVE    | :TQ10001 |  Q1,01 | P->S | 1 SLAVE    |
|  24 |               PX BLOCK ITERATOR |          |  Q1,01 | PCWC |            |
|  25 |                TABLE ACCESS FULL| T4       |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(ROWNUM>1)
   6 - access("V1_ID"="T6"."ID")
  13 - access("V1_ID"="V2_ID")
Let's start with the 11.2.0.4 plan shape: We can see from multiple occurrences of the PX COORDINATOR operator, the TQ column and the corresponding TQ entries in the NAME column that this parallel execution plan consists of three DFO trees. When analyzing the actual run time activity it would also become obvious that the different DFO trees actually run at a different DOP and use different PX slave sets. In this particular case here it would even become obvious that two DFO trees are active at the same time - so it's pretty much demonstrating all the undesirable side effects of having multiple DFO trees that I mentioned above.

The HASH JOIN between V1 and V2 runs serially, whereas the HASH JOIN between T6 and the join result of V1 and V2 runs parallel. Since we have a Serial->Parallel distribution between these two HASH JOINs, an additional BUFFER SORT operation gets added - as outlined in the PX SELECTOR note.

If we now look at the 12.1.0.2 plan shape we notice that the execution plan consists of a single DFO tree, and hence all the side effects I've outlined are gone. The serial evaluation is done by a single PX slave of one set, again indicated by the SCWC/SCWP decorator, similar to the PX SELECTOR operator.

However, the plan shape also demonstrates one possible disadvantage of the new feature: Since now everything is part of a single DFO tree we end up with more re-distributions per DFO tree, and hence might have an increased demand for the dreaded BUFFERing to prevent multiple re-distributions being active at the same time - here this can be seen in operation ID 13 - the HASH JOIN executed serially by one PX slave is actually turned into a HASH JOIN BUFFERED (so the worst case for a (not really) "parallel" hash join - executed only by a single PX slave and needs to buffer the right hand row source in addition), and also the second HASH JOIN is turned into its BUFFERED variant. This means compared to the 11.2.0.4 plan shape that contains no HASH JOIN BUFFERED but an additional BUFFER SORT operation, the 12.1.0.2 plan shape with this data pattern actually requires more PGA / TEMP space than the 11.2.0.4 plan shape due to the double buffering now necessary.

Footnote


The new 1 SLAVE distribution doesn't get used always in 12c. If you run just the join between V1 and V2 for example, then the old plan shape will be used, and there are again multiple DFO trees. Furthermore, in this particular case, when you start changing the DOP used in the PARALLEL hints you also might end up with a plan shape where one view uses the 1 SLAVE distribution whereas the other one uses the old plan shape with Query Coordinator activity - I haven't investigated further why this happens.

If Analytic Functions get used, you might also see a "1 SLAVE (ORDER)" variation of the distribution that enforces a certain order when re-distributing the data, similar to the "PX SEND QC (ORDER)" operator.

3 comments:

  1. Hi Christian

    Couple of questions here.

    1. As you said, multiple DFO trees could be undesirable. Is there a way we can control the number of DFO trees on 11.2.0.4 ?

    2. And about this 1SLAVE new feature which addresses the multiple DFO problem. Is there a fix for this that could be backported to 11.2.0.4 ?

    ReplyDelete
  2. Hi Vishal,

    Randolf here - Christian might be Christian Antognini? But that's not his blog :-)

    Anyway - if you read the blog carefully you'll notice that there are pros and cons having multiple DFO trees - one of the potential drawbacks of having everything in one DFO tree is that you can end up with more "BUFFERED" hash joins.

    What I'm trying to say here is that this isn't black and white and as so often depends on a lot of details which of the two variants might be more desirable and this can change from case to case.

    Having said that, to answer your questions:

    1. There is no direct way to control the number of DFO trees, no matter which version you run. The number of DFO trees depends on the SQL features used and the execution plan generated by the optimizer. You do have some indirect control by replacing one feature with another (for example don't use materialized WITH clauses, don't use certain kind of Analytic Functions, don't make use of ROWNUM etc.)

    2. The 1 SLAVE feature described here isn't backported to pre-12c as far as I know. You need at least a 12c optimizer code to get support for that feature (which means using for example OPTIMIZER_FEATURES_ENABLE = 11.2.0.4 in 12c and later will prevent the feature).

    Randolf

    ReplyDelete
  3. Really sorry about that Randolf. I was reading couple of blogs around that time - one from Christian Antognini and I probably messed up. Apologies. Thank you for answering the questions.

    ReplyDelete

Note: Only a member of this blog may post a comment.