Wednesday, December 30, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 3: The Optimizer And Distribution Methods

As mentioned in the first and second part of this instalment the different available distribution methods of the new parallel FILTER are selected automatically by the optimizer - in this last post of this series I want to focus on that optimizer behaviour.

It looks like there are two new optimizer related parameters that control the behaviour of the new feature: "_px_filter_parallelized" is the overall switch to enable/disable the new parallel filter capability - and defaults to "true" in 12c, and "_px_filter_skew_handling" influences how the optimizer determines the distribution methods - the parameter naming suggests that it somehow has to do with some kind of "skew" - note that the internal parameter that handles the new automatic join skew handling is called "_px_join_skew_handling" - rather similar in name.

But even after playing around with the feature for quite a while I couldn't come up with a good test case where the optimizer chose a different distribution method based on the typical data distribution skew patterns - so that the expression used for the FILTER lookup had some more popular values than others. So I got in touch with Yasin Baskan - product manager for Parallel Execution at Oracle, asking what kind of skew is meant to see a difference in behaviour.

As it turns out "skew" means something different in this context here. When the mentioned parameter "_px_filter_skew_handling" is set to "true" (default value in 12c) the optimizer will choose a different distribution method based on the size of object driving the filter. According to my tests this effectively means: If the object is such small that only one granule (usually 13 blocks) per PX slave can be assigned the optimizer will use automatically a HASH distribution, otherwise - if the object is larger than this threshold - no re-distribution will be selected. I wasn't able to come up with an example where the optimizer automatically comes up with the other available distribution method, which is RANDOM / ROUND-ROBIN (see previous post). To demonstrate the point, here is a small example:
create table t2 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't2')

create table t3 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Default plan, no redistribution before parallel FILTER
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------

exec dbms_stats.set_table_stats(null, 't3', numblks => 52)

-- Setting stats of T3 to 52 (13 * DOP) blocks or smaller - HASH distribution will be used, 53 blocks or greater => no redistribution
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |     1 |     5 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |     5 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER               |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE          |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH       | :TQ10000 | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         PX BLOCK ITERATOR |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| T3       | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWP |            |
|* 10 |       TABLE ACCESS FULL   | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------
So this example shows that the HASH distribution will be used by the optimizer if the object T3 driving the FILTER operation is 52 blocks or smaller, which corresponds to 13 blocks per PX slave at a degree of 4.

Now I find this behaviour pretty odd to explain - since usually you wouldn't want to use Parallel Execution on such a small object anyway. But things become even worse: Not only to me the "skew" handling based on the object size is questionable, but the behaviour can become a potential threat if the row source driving the FILTER operator no longer is a plain table but the result of a more complex operation, which can be simply a join or non-mergeable view:
-- Resetting stats to true size of table - this would mean no redistribution at a DOP of 4, see above
exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        (select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- But simply using a NO_MERGE hint on the select from the simple T3 row source results in an unnecessary HASH re-distribution
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |     1 |    13 |  9755K  (1)| 00:06:22 |        |      |            |
|   1 |  SORT AGGREGATE            |          |     1 |    13 |            |          |        |      |            |
|   2 |   PX COORDINATOR           |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)     | :TQ10001 |     1 |    13 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE         |          |     1 |    13 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER                |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE           |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH        | :TQ10000 | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         VIEW               |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |          PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|* 11 |       TABLE ACCESS FULL    | T2       |     1 |     5 |   114   (0)| 00:00:01 |        |      |            |
--------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- If we use a simple join as driving row source again a HASH re-distribution before the FILTER gets added
-- As a result the dreaded HASH JOIN BUFFERED will be used instead of the plain HASH JOIN
-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE                 |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)          | :TQ10003 |     1 |    10 |            |          |  Q1,03 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE              |          |     1 |    10 |            |          |  Q1,03 | PCWP |            |
|*  5 |      FILTER                     |          |       |       |            |          |  Q1,03 | PCWP |            |
|   6 |       PX RECEIVE                |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        PX SEND HASH             | :TQ10002 | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  8 |         HASH JOIN BUFFERED      |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |           PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  11 |            STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  12 |             PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  13 |              TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  14 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  15 |           PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  16 |            PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  17 |             TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 18 |       TABLE ACCESS FULL         | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Disabling the FILTER skew handling behaviour means no re-distribution before the FILTER, and hence no HASH JOIN BUFFERED
-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE               |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10002 |     1 |    10 |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |          |     1 |    10 |            |          |  Q1,02 | PCWP |            |
|*  5 |      FILTER                   |          |       |       |            |          |  Q1,02 | PCWC |            |
|*  6 |       HASH JOIN               |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|   9 |          STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  10 |           PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  12 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  13 |         PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  14 |          PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  15 |           TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 16 |       TABLE ACCESS FULL       | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------------
So it looks like if the row source driving the parallel FILTER operator is complex (in this case by complex I mean not a simple table) the optimizer will always add a HASH distribution unconditionally before the FILTER. It it obvious that such a re-distribution adds overhead - it requires resources to perform. What is even worse is that in general the rules is: The more redistributions the more likely the dreaded buffering will be added to the execution plans, as can be seen from the example above, where the HASH JOIN turns into a HASH JOIN BUFFERED due to the HASH distribution by default added by the optimizer after the join and before the FILTER. By disabling the filter "skew" handling this in my opinion unnecessary redistribution doesn't show up and hence the HASH JOIN without buffering can be used in this example.

Summary


The new parallel FILTER operator comes with different distribution methods available to the optimizer. However, at present the way the optimizer determines automatically if and how to re-distribute the data seems to be questionable to me.

The skew handling is based on the size of the driving object - for very small objects a re-distribution gets added before the FILTER. For row sources driving the filter that are no simple tables the skew handling seems to add a re-distribution unconditionally.

For the reasons outlined at present I would recommend considering to disable the filter skew handling by setting the parameter "_px_filter_skew_handling" to "false", of course not without getting the blessing from Oracle Support before doing so - this should allow minimising the number of re-distributions added to an execution plan. Losing the capability of handling the "skew" caused by very small objects in my opinion is negligible in most cases.

No comments: