Wednesday, December 30, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 3: The Optimizer And Distribution Methods

As mentioned in the first and second part of this instalment the different available distribution methods of the new parallel FILTER are selected automatically by the optimizer - in this last post of this series I want to focus on that optimizer behaviour.

It looks like there are two new optimizer related parameters that control the behaviour of the new feature: "_px_filter_parallelized" is the overall switch to enable/disable the new parallel filter capability - and defaults to "true" in 12c, and "_px_filter_skew_handling" influences how the optimizer determines the distribution methods - the parameter naming suggests that it somehow has to do with some kind of "skew" - note that the internal parameter that handles the new automatic join skew handling is called "_px_join_skew_handling" - rather similar in name.

But even after playing around with the feature for quite a while I couldn't come up with a good test case where the optimizer chose a different distribution method based on the typical data distribution skew patterns - so that the expression used for the FILTER lookup had some more popular values than others. So I got in touch with Yasin Baskan - product manager for Parallel Execution at Oracle, asking what kind of skew is meant to see a difference in behaviour.

As it turns out "skew" means something different in this context here. When the mentioned parameter "_px_filter_skew_handling" is set to "true" (default value in 12c) the optimizer will choose a different distribution method based on the size of object driving the filter. According to my tests this effectively means: If the object is such small that only one granule (usually 13 blocks) per PX slave can be assigned the optimizer will use automatically a HASH distribution, otherwise - if the object is larger than this threshold - no re-distribution will be selected. I wasn't able to come up with an example where the optimizer automatically comes up with the other available distribution method, which is RANDOM / ROUND-ROBIN (see previous post). To demonstrate the point, here is a small example:
create table t2 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't2')

create table t3 as select * from dba_objects where rownum <= 90000;

exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Default plan, no redistribution before parallel FILTER
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------

exec dbms_stats.set_table_stats(null, 't3', numblks => 52)

-- Setting stats of T3 to 52 (13 * DOP) blocks or smaller - HASH distribution will be used, 53 blocks or greater => no redistribution
-------------------------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |     1 |     5 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE           |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR          |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |     1 |     5 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |     1 |     5 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER               |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE          |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH       | :TQ10000 | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         PX BLOCK ITERATOR |          | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| T3       | 90000 |   439K|     5  (20)| 00:00:01 |  Q1,00 | PCWP |            |
|* 10 |       TABLE ACCESS FULL   | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------
So this example shows that the HASH distribution will be used by the optimizer if the object T3 driving the FILTER operation is 52 blocks or smaller, which corresponds to 13 blocks per PX slave at a degree of 4.

Now I find this behaviour pretty odd to explain - since usually you wouldn't want to use Parallel Execution on such a small object anyway. But things become even worse: Not only to me the "skew" handling based on the object size is questionable, but the behaviour can become a potential threat if the row source driving the FILTER operator no longer is a plain table but the result of a more complex operation, which can be simply a join or non-mergeable view:
-- Resetting stats to true size of table - this would mean no redistribution at a DOP of 4, see above
exec dbms_stats.gather_table_stats(null, 't3')

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        (select /*+ no_merge */ a.* from t3 a) t
        --(select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- But simply using a NO_MERGE hint on the select from the simple T3 row source results in an unnecessary HASH re-distribution
--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |     1 |    13 |  9755K  (1)| 00:06:22 |        |      |            |
|   1 |  SORT AGGREGATE            |          |     1 |    13 |            |          |        |      |            |
|   2 |   PX COORDINATOR           |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)     | :TQ10001 |     1 |    13 |            |          |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE         |          |     1 |    13 |            |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER                |          |       |       |            |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE           |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH        | :TQ10000 | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   8 |         VIEW               |          | 90000 |  1142K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |          PX BLOCK ITERATOR |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL| T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|* 11 |       TABLE ACCESS FULL    | T2       |     1 |     5 |   114   (0)| 00:00:01 |        |      |            |
--------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           --opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- If we use a simple join as driving row source again a HASH re-distribution before the FILTER gets added
-- As a result the dreaded HASH JOIN BUFFERED will be used instead of the plain HASH JOIN
-------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE                 |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR                |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)          | :TQ10003 |     1 |    10 |            |          |  Q1,03 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE              |          |     1 |    10 |            |          |  Q1,03 | PCWP |            |
|*  5 |      FILTER                     |          |       |       |            |          |  Q1,03 | PCWP |            |
|   6 |       PX RECEIVE                |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   7 |        PX SEND HASH             | :TQ10002 | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  8 |         HASH JOIN BUFFERED      |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |           PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  11 |            STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  12 |             PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  13 |              TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  14 |          PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  15 |           PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  16 |            PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  17 |             TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 18 |       TABLE ACCESS FULL         | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-------------------------------------------------------------------------------------------------------------------------

explain plan for
select /*+ monitor
           parallel(4)
           opt_param('_px_filter_skew_handling' 'false')
        */ count(*) from
        --t3 t
        --(select /*+ no_merge */ a.* from t3 a) t
        (select a.* from t3 a, t3 b where a.object_id = b.object_id) t
        where exists (select /*+ no_unnest */ 1 from t2 where t.object_id=t2.object_id);

-- Disabling the FILTER skew handling behaviour means no re-distribution before the FILTER, and hence no HASH JOIN BUFFERED
-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |     1 |    10 |    32M  (1)| 00:21:13 |        |      |            |
|   1 |  SORT AGGREGATE               |          |     1 |    10 |            |          |        |      |            |
|   2 |   PX COORDINATOR              |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)        | :TQ10002 |     1 |    10 |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE            |          |     1 |    10 |            |          |  Q1,02 | PCWP |            |
|*  5 |      FILTER                   |          |       |       |            |          |  Q1,02 | PCWC |            |
|*  6 |       HASH JOIN               |          | 90000 |   878K|   229   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HYBRID HASH   | :TQ10000 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|   9 |          STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  10 |           PX BLOCK ITERATOR   |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL  | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  12 |        PX RECEIVE             |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  13 |         PX SEND HYBRID HASH   | :TQ10001 | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  14 |          PX BLOCK ITERATOR    |          | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  15 |           TABLE ACCESS FULL   | T3       | 90000 |   439K|   114   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 16 |       TABLE ACCESS FULL       | T2       |     1 |     5 |   412   (1)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------------
So it looks like if the row source driving the parallel FILTER operator is complex (in this case by complex I mean not a simple table) the optimizer will always add a HASH distribution unconditionally before the FILTER. It it obvious that such a re-distribution adds overhead - it requires resources to perform. What is even worse is that in general the rules is: The more redistributions the more likely the dreaded buffering will be added to the execution plans, as can be seen from the example above, where the HASH JOIN turns into a HASH JOIN BUFFERED due to the HASH distribution by default added by the optimizer after the join and before the FILTER. By disabling the filter "skew" handling this in my opinion unnecessary redistribution doesn't show up and hence the HASH JOIN without buffering can be used in this example.

Summary


The new parallel FILTER operator comes with different distribution methods available to the optimizer. However, at present the way the optimizer determines automatically if and how to re-distribute the data seems to be questionable to me.

The skew handling is based on the size of the driving object - for very small objects a re-distribution gets added before the FILTER. For row sources driving the filter that are no simple tables the skew handling seems to add a re-distribution unconditionally.

For the reasons outlined at present I would recommend considering to disable the filter skew handling by setting the parameter "_px_filter_skew_handling" to "false", of course not without getting the blessing from Oracle Support before doing so - this should allow minimising the number of re-distributions added to an execution plan. Losing the capability of handling the "skew" caused by very small objects in my opinion is negligible in most cases.

Monday, December 28, 2015

New Version Of XPLAN_ASH Utility

A new version 4.22 of the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

This version primarily addresses an issue with 12c - if the HIST mode got used to pull ASH information from AWR in 12c it turned out that Oracle forgot to add the new "DELTA_READ_MEM_BYTES" columns to DBA_HIST_ACTIVE_SESS_HISTORY - although it got officially added to V$ACTIVE_SESSION_HISTORY in 12c. So now I had to implement several additional if/then/else constructs to the script to handle this inconsistency. It's the first time that the HIST view doesn't seem to reflect all columns from the V$ view - very likely an oversight rather than by design I assume.

Apart from that the I/O figures (Read Bytes / Write Bytes etc.) in the "Activity Timeline" make more sense for those cases where a process hasn't been sampled for several sample points (see below for more details).

Also in case an execution plan could not be found it is now made more obvious with a corresponding message that you might be able to pull the execution plan from AWR by using different ASH modes (MIXED / HIST).

Here are the notes from the change log:

- Fixed a funny bug that in 12c they have forgotton to add the DELTA_READ_MEM_BYTES to DBA_HIST_ACTIVE_SESS_HISTORY, so in HIST mode with 12c prior XPLAN_ASH versions could error out with invalid column name

- Change the way the I/O figures are treated in the "Activity Timeline based on ASH". Now the I/O per second is spread over the (previous) samples covered by DELTA_TIME. This should give a smoother representation of the I/O performed and much closer to what you see in Real-Time SQL Monitoring reports. The difference to prior versions is only visible in cases where a session wasn't sampled for quite a while and hence has a DELTA_TIME spanning multiple previous sample points. This also means that the I/O related columns in the "Activity Timeline based on ASH" now show only the PER SECOND values, no longer to the totals like prior versions

- Added a SET NULL "" in the configuration and initialization section for SQL*Plus environments that use a non-default SET NULL setting. This screwed up some internal switches so that XPLAN_ASH for example thought it's running in a S-ASH repository

- Added a note to the end of the output if no execution plan could be found and falling back to retrieving plan operation details from ASH. Also added the note to use MIXED or HIST ASH source option if no execution plan could be found in CURR mode, so execution plan has been purged from Shared Pool in the meanwhile

- Cloned the "cleanup" section from the end to the beginning of the script to ensure no current SQL*Plus environment settings influence the script execution. This is particularly relevant if the script execution gets cancelled before the final cleanup section is reached or some other, previous scripts left a mess behind

Sunday, December 20, 2015

IT Tage 2015 - "Analysing and troubleshooting Parallel Execution" presentation material

Thanks to all attending my presentation "Analysing and troubleshooting Parallel Execution" at the IT Tage conference 2015 in Frankfurt, Germany. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

Sunday, November 22, 2015

DOAG 2015 - "Oracle 12c Parallel Execution New Features" presentation material

Thanks to the many attendees that came to my presentation "Oracle 12c Parallel Execution New Features" at the DOAG conference 2015. You can download the presentation material here in Powerpoint of PDF format, as well as check the Slideshare upload.

Note that the Powerpoint format adds value in that sense that many of the slides come with additional explanations in the notes section.

If you are interested in more details I recommend visiting this post which links to many other posts describing the different new features in greater depth.

Friday, October 16, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 2: Distribution Methods

Picking up from the first part of this instalment I'll focus in this post on the available distribution methods for the new parallel FILTER subquery feature.

In this post I won't go into the details how the optimizer selects the distribution method automatically - this will be covered in the last part.

Here I merely describe the different available methods and how to control them using the new PQ_FILTER hint, which is also mentioned in the official documentation, although I find a bit hard to follow the description there.

There are four different options available to the PQ_FILTER hint, and only two of them actually describe a distribution method. One of them tells to not distribute the data at all, and the last one reverts to the former pre-12c plan shape:

SERIAL: This tells the optimizer to use the pre-12c plan shape where the FILTER operator is executed by the Query Coordinator. Depending on the SQL features used and combined, 12c sometimes still reverts to this plan shape and it looks like in those cases you can't force the parallel filter evaluation via the PQ_FILTER hint - the outline then contains a PQ_FILTER hint that is supposed to result in a parallel filter evaluation, but the serial plan shape gets used anyway despite the hint.

NONE: This tells the optimizer to not redistribute the data before executing the FILTER operator driving the subqueries, which means in whatever way the data is distributed the Parallel Execution Servers will execute the filter and the corresponding subqueries for the data currently processed

HASH: Redistribute the data by HASH before running the FILTER operator and the corresponding subqueries.

RANDOM: Redistribute the data by ROUND-ROBIN before running the FILTER operator and the corresponding subqueries. Despite being called RANDOM the distribution method shown in the plan is ROUND-ROBIN rather than RANDOM.

In this post I want to show an example for each of those variants and address the following questions in addition:

1. Where and how has the PQ_FILTER hint to be specified? How has the hint to be used in case of multiple FILTERs with subqueries in the execution plan?

2. In case of a HASH distribution and multiple subqueries what keys get used as input for the HASH function used for distribution?

Let's start with an example for each of the hints used to demonstrate the different plan shapes. For that purpose I'll re-use the same setup and query from the initial part of this instalment, with the index set to invisible:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+ 
           pq_filter(<DIST_METHOD>)
        */ count(*) from
        t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(SERIAL)
------------------------------------------------------------------------
| Id  | Operation              | Name     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |        |      |            |
|   1 |  SORT AGGREGATE        |          |        |      |            |
|*  2 |   FILTER               |          |        |      |            |
|   3 |    PX COORDINATOR      |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)| :TQ20000 |  Q2,00 | P->S | QC (RAND)  |
|   5 |      PX BLOCK ITERATOR |          |  Q2,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T_1      |  Q2,00 | PCWP |            |
|   7 |    PX COORDINATOR      |          |        |      |            |
|   8 |     PX SEND QC (RANDOM)| :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|   9 |      PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|* 10 |       TABLE ACCESS FULL| T_1      |  Q1,00 | PCWP |            |
------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(NONE)
-------------------------------------------------------------------------
| Id  | Operation               | Name     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |        |      |            |
|   1 |  SORT AGGREGATE         |          |        |      |            |
|   2 |   PX COORDINATOR        |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T_1      |        |      |            |
-------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(RANDOM)
---------------------------------------------------------------------------
| Id  | Operation                 | Name     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |        |      |            |
|   1 |  SORT AGGREGATE           |          |        |      |            |
|   2 |   PX COORDINATOR          |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER               |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE          |          |  Q1,01 | PCWP |            |
|   7 |        PX SEND ROUND-ROBIN| :TQ10000 |  Q1,00 | P->P | RND-ROBIN  |
|   8 |         PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| T_1      |  Q1,00 | PCWP |            |
|* 10 |       TABLE ACCESS FULL   | T_1      |        |      |            |
---------------------------------------------------------------------------

-- 12.1.0.2 plan shape with index invisible and PQ_FILTER(HASH)
---------------------------------------------------------------------------
| Id  | Operation                 | Name     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |          |        |      |            |
|   1 |  SORT AGGREGATE           |          |        |      |            |
|   2 |   PX COORDINATOR          |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)    | :TQ10001 |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE        |          |  Q1,01 | PCWP |            |
|*  5 |      FILTER               |          |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE          |          |  Q1,01 | PCWP |            |
|   7 |        PX SEND HASH       | :TQ10000 |  Q1,00 | P->P | HASH       |
|   8 |         PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   9 |          TABLE ACCESS FULL| T_1      |  Q1,00 | PCWP |            |
|* 10 |       TABLE ACCESS FULL   | T_1      |        |      |            |
---------------------------------------------------------------------------
So we can see with PQ_FILTER(SERIAL) we get the former, pre-12c plan shape, which in this case results in a parallel full table scan in the filter subquery and a decomposition into multiple DFO trees, both things that should be avoided in general.

I've omitted the cost estimates but there seems to be an odd inconsistency: The new parallel FILTER evaluation results in a higher cost estimate than the old serial one, simply due to the fact that the full table scan cost is lowered by the chunked parallel table scan (PX BLOCK ITERATOR) in the old plan shape, whereas the complete full table scans performed in the Parallel Execution Servers is treated as serial full table scan cost-wise. Nevertheless by default the plan with the higher costs is selected by the optimizer in 12c.

Let's address the first question how to specify the hint: As we can see from the simple example here when using the format of the hint mentioned in the official documentation it should be put into the main (driving) query, not into the filter subqueries, which makes kind of sense, since there could be multiple subqueries but only one FILTER operation in the main query driving them.

Let's have a look at a slightly more complicated query:
explain plan for
select /*+
           qb_name(main)
           -- When merging the two query blocks there will be a single FILTER in the main query block
           --pq_filter(@main hash)
           -- Global form of the PQ_FILTER hint
           no_merge(@qb1)
           no_merge(@qb2)
           pq_filter(@qb1 random)
           pq_filter(@qb2 hash)
       */
       *
       from
(
select /*+
           qb_name(qb1)
           -- Alternatively use local form of the hints
           --pq_filter(random)
           --no_merge
        */ *
        from
        t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) a,
(
select /*+
           qb_name(qb2)
           -- Alternatively use local form of the hints
           --pq_filter(hash)
           --no_merge
        */ *
        from t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) b
where b.id = a.id
;

-- Plan shape with the hints as specified
-------------------------------------------------------------------------------
| Id  | Operation                     | Name     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |        |      |            |
|   1 |  PX COORDINATOR               |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)         | :TQ10004 |  Q1,04 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED         |          |  Q1,04 | PCWP |            |
|   4 |     PX RECEIVE                |          |  Q1,04 | PCWP |            |
|   5 |      PX SEND HYBRID HASH      | :TQ10002 |  Q1,02 | P->P | HYBRID HASH|
|   6 |       STATISTICS COLLECTOR    |          |  Q1,02 | PCWC |            |
|   7 |        BUFFER SORT            |          |  Q1,02 | PCWP |            |
|   8 |         VIEW                  |          |  Q1,02 | PCWP |            |
|*  9 |          FILTER               |          |  Q1,02 | PCWP |            |
|  10 |           PX RECEIVE          |          |  Q1,02 | PCWP |            |
|  11 |            PX SEND ROUND-ROBIN| :TQ10000 |  Q1,00 | P->P | RND-ROBIN  |
|  12 |             PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  13 |              TABLE ACCESS FULL| T_1      |  Q1,00 | PCWP |            |
|* 14 |           TABLE ACCESS FULL   | T_1      |        |      |            |
|  15 |     PX RECEIVE                |          |  Q1,04 | PCWP |            |
|  16 |      PX SEND HYBRID HASH      | :TQ10003 |  Q1,03 | P->P | HYBRID HASH|
|  17 |       BUFFER SORT             |          |  Q1,03 | PCWP |            |
|  18 |        VIEW                   |          |  Q1,03 | PCWP |            |
|* 19 |         FILTER                |          |  Q1,03 | PCWP |            |
|  20 |          PX RECEIVE           |          |  Q1,03 | PCWP |            |
|  21 |           PX SEND HASH        | :TQ10001 |  Q1,01 | P->P | HASH       |
|  22 |            PX BLOCK ITERATOR  |          |  Q1,01 | PCWC |            |
|  23 |             TABLE ACCESS FULL | T_1      |  Q1,01 | PCWP |            |
|* 24 |          TABLE ACCESS FULL    | T_1      |        |      |            |
-------------------------------------------------------------------------------
So in this example here we can see that multiple FILTER operations with subqueries are possible - in principle one per query block. Here I've used the global form of the hint, which includes a query block name to assign the hint to the corresponding FILTER operation. The same plan shape could be achieved by using the local hints within the query blocks. If no NO_MERGE hint gets used, both query blocks will be merged and there will be only a single FILTER operation - which then can be influenced by a corresponding PQ_FILTER hint in the main/outer query block.

Regarding the second question, what keys get used as input for the HASH distribution, we can check the corresponding "Projection information" section of the DBMS_XPLAN output, and we can see there that it is pretty straightforward and as expected: All columns / expressions used in the filter subqueries for correlation will be used as input. Using the variation of above query with merged query blocks and HASH distribution of the single FILTER with two subqueries results in the following:
explain plan for
select /*+
           qb_name(main)
           pq_filter(@main hash)
           --no_merge(@qb1)
           --no_merge(@qb2)
           --pq_filter(@qb1 random)
           --pq_filter(@qb2 hash)
       */
       *
       from
(
select /*+
           qb_name(qb1)
           --pq_filter(random)
           --no_merge
        */ *
        from
        t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) a,
(
select /*+
           qb_name(qb2)
           --pq_filter(hash)
           --no_merge
        */ *
        from t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id)
) b
where b.id = a.id
;

--------------------------------------------------------------------------------
| Id  | Operation                      | Name     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |        |      |            |
|   1 |  PX COORDINATOR                |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)          | :TQ10003 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    BUFFER SORT                 |          |  Q1,03 | PCWP |            |
|*  4 |     FILTER                     |          |  Q1,03 | PCWP |            |
|   5 |      PX RECEIVE                |          |  Q1,03 | PCWP |            |
|   6 |       PX SEND HASH             | :TQ10002 |  Q1,02 | P->P | HASH       |
|*  7 |        HASH JOIN BUFFERED      |          |  Q1,02 | PCWP |            |
|   8 |         PX RECEIVE             |          |  Q1,02 | PCWP |            |
|   9 |          PX SEND HYBRID HASH   | :TQ10000 |  Q1,00 | P->P | HYBRID HASH|
|  10 |           STATISTICS COLLECTOR |          |  Q1,00 | PCWC |            |
|  11 |            PX BLOCK ITERATOR   |          |  Q1,00 | PCWC |            |
|  12 |             TABLE ACCESS FULL  | T_1      |  Q1,00 | PCWP |            |
|  13 |         PX RECEIVE             |          |  Q1,02 | PCWP |            |
|  14 |          PX SEND HYBRID HASH   | :TQ10001 |  Q1,01 | P->P | HYBRID HASH|
|  15 |           PX BLOCK ITERATOR    |          |  Q1,01 | PCWC |            |
|  16 |            TABLE ACCESS FULL   | T_1      |  Q1,01 | PCWP |            |
|* 17 |      TABLE ACCESS FULL         | T_1      |        |      |            |
|* 18 |      TABLE ACCESS FULL         | T_1      |        |      |            |
--------------------------------------------------------------------------------

Column Projection Information (identified by operation id):
-----------------------------------------------------------
...
   6 - (#keys=2) "T"."ID"[NUMBER,22], "T"."ID"[NUMBER,22], "T"."FILLER"[VARCHAR2,100], "T"."FILLER"[VARCHAR2,100]
...
So both T.IDs from each subquery are used as keys "(#keys=2)" for the hash function.

In the last part of this instalment I'll focus on how the optimizer chooses the distribution method for the parallel filter operation.

Sunday, August 16, 2015

Parallel Projection

A recent case at a client reminded me of something that isn't really new but not so well known - Oracle by default performs evaluation at the latest possible point in the execution plan.

So if you happen to have expressions in the projection of a simple SQL statement that runs parallel it might be counter-intuitive that by default Oracle won't evaluate the projection in the Parallel Slaves but in the Query Coordinator - even if it was technically possible - because the latest possible point is the SELECT operation with the ID = 0 of the plan, which is always performed by the Query Coordinator.

Of course, if you make use of expressions that can't be evaluated in parallel or aren't implemented for parallel evaluation, then there is no other choice than doing this in the Query Coordinator.

The specific case in question was a generic export functionality that allowed exporting report results to some CSV or Excel like format, and some of these reports had a lot of rows and complex - in that case CPU intensive - expressions in their projection clause.

When looking at the run time profile of such an export query it became obvious that although it was a (very simple) parallel plan, all of the time was spent in the Query Coordinator, effectively turning this at runtime into a serial execution.

This effect can be reproduced very easily:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel cache;

-- Run some CPU intensive expressions in the projection 
-- of a simple parallel Full Table Scan
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
;

-- The plan is clearly parallel
--------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |  2000K|   192M|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------

-- But the runtime profile looks more serial
-- although the Parallel Slaves get used to run the Full Table Scan
-- All time spent in the operation ID = 0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation            | Name     | Execs | A-Rows| ReadB | ReadReq | Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                     | Parallel Execution Skew ASH      | Activity Graph ASH          | Top 5 Activity ASH  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT     |          |    5  | 2000K |       |         |     3 |   136 |   120 | #################### |   1:sqlplus.exe(120)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]       | ################################ | @@@@@@@@@@@@@@@@@@@ ( 98%)  | ON CPU(120)         |
|   1 |   0 |  PX COORDINATOR      |          |    5  | 2000K |       |         |   119 |     1 |     1 |                 #    |   1:sqlplus.exe(1)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]         |                                  |                     ( .8%)  | ON CPU(1)           |
|   2 |   1 |   PX SEND QC (RANDOM)| :TQ10000 |    4  | 2000K |       |         |    66 |    11 |     2 |          ##          |   2:P00B(1)[508K],P00A(1)[490K],P008(0)[505K],P009(0)[497K],sqlplus.exe(0)[0] |                                  |                     (1.6%)  | PX qref latch(2)    |
|   3 |   2 |    PX BLOCK ITERATOR |          |    4  | 2000K |       |         |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
|*  4 |   3 |     TABLE ACCESS FULL| T_1      |   52  | 2000K |   23M |     74  |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fortunately there is a simple and straightforward way to make use of the Parallel Slaves for evaluation of projection expressions that can be evaluated in parallel - simply add a suitable NO_MERGE hint for the query block that you want the projection to be evaluated for in the Parallel Slaves.

If you don't want to have side effects on the overall plan shape by not merging views you could always wrap the original query in an outer SELECT and not merging the now inner query block. There seems to be a rule that the projection of a view always get evaluated at the VIEW operator, and if we check the execution plan we can see that the VIEW operator is marked parallel:
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select /*+ no_merge(x) */ * from (
select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
) x
;

-- View operator is marked parallel
-- This is were the projection clause of the VIEW will be evaluated
---------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  2000K|    11G|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    VIEW               |          |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|   4 |     PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------

-- Runtime profile now shows effective usage of Parallel Slaves 
-- for doing the CPU intensive work
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation             | Name     | Execs | A-Rows| Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                         | Parallel Execution Skew ASH| Activity Graph ASH          | Top 5 Activity ASH               |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT      |          |    5  | 2000K |       |       |       |                      |   0:sqlplus.exe(0)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]             |                            |                             |                                  |
|   1 |   0 |  PX COORDINATOR       |          |    5  | 2000K |    17 |    63 |    10 |    # ##   #    ####  |   1:sqlplus.exe(10)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]            | ####                       | *                   (5.6%)  | resmgr:cpu quantum(10)           |
|   2 |   1 |   PX SEND QC (RANDOM) | :TQ10000 |    4  | 2000K |     5 |    61 |    10 | ## #  ## ## ## #     |   3:P002(5)[544K],P001(4)[487K],P000(1)[535K],P003(0)[434K],sqlplus.exe(0)[0]     | #                          |                     (5.6%)  | ON CPU(7),resmgr:cpu quantum(3)  |
|   3 |   2 |    VIEW               |          |    4  | 2000K |     2 |    82 |    69 | #################### |   4:P003(42)[434K],P001(35)[487K],P000(26)[535K],P002(22)[544K],sqlplus.exe(0)[0] | ############               | @@@@@@@@@@@@@@@@@@@ ( 70%)  | ON CPU(125)                      |
|   4 |   3 |     PX BLOCK ITERATOR |          |    4  | 2000K |       |       |       |                      |   0:P002(0)[544K],P000(0)[535K],P001(0)[487K],P003(0)[434K],sqlplus.exe(0)[0]     |                            |                             |                                  |
|*  5 |   4 |      TABLE ACCESS FULL| T_1      |   52  | 2000K |     3 |    78 |    29 | ###### ####### # ### |   4:P000(11)[535K],P002(8)[544K],P001(8)[487K],P003(7)[434K],sqlplus.exe(0)[0]    | ###                        | *****               ( 19%)  | resmgr:cpu quantum(30),ON CPU(4) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At runtime the duration of the query now gets reduced significantly and we can see the Parallel Slaves getting used when the VIEW operator gets evaluated. Although the overall CPU time used is similar to the previous example, the duration of the query execution is less since this CPU time is now spent in parallel in the slaves instead in the Query Coordinator.

Summary


By default Oracle performs evaluation at the latest possible point of the execution plan. Sometimes you can improve runtime by actively influencing when the projection will be evaluated by preventing view merging and introducing a VIEW operator that will be used to evaluate the projection clause.

The optimizer so far doesn't seem to incorporate such possibilities in its evaluations of possible plan shapes, so this is something you need to do manually up to and including Oracle 12c (version 12.1.0.2 as of time of writing this).

Sunday, July 26, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 1: Introduction

12c introduces another interesting new Parallel Execution feature - the parallel evaluation of FILTER subqueries. In pre-12c FILTER subqueries always had to be evaluated in the Query Coordinator. This had several consequences, in particular the data driving the FILTER subquery always had to flow through the Query Coordinator, and hence represented a forced serial execution part of a parallel execution plan. This limitation also meant that depending on the overall plan shape the parallel plan was possibly decomposed into multiple DFO trees, leading to other side effects I've outlined in some of my other publications already.

In 12c now the FILTER subquery can be evaluated in the Parallel Slaves, and the driving data no longer needs to be processed in the Query Coordinator. However, the resulting plan shape can be a little bit confusing. Let's have a look at a simple example:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+ 
           --optimizer_features_enable('11.2.0.4')
        */ count(*) from
        t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 11.2.0.4 plan shape with index invisible
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     6 |   440M  (2)| 04:47:04 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     6 |            |          |        |      |            |
|*  2 |   FILTER               |          |       |       |            |          |        |      |            |
|   3 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)| :TQ20000 |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | P->S | QC (RAND)  |
|   5 |      PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | PCWP |            |
|   7 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   8 |     PX SEND QC (RANDOM)| :TQ10000 |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   9 |      PX BLOCK ITERATOR |          |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | PCWC |            |
|* 10 |       TABLE ACCESS FULL| T_1      |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
  10 - filter("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index invisible
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     6 |  1588M  (2)| 17:14:09 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     6 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     6 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     6 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T_1      |     1 |     6 |   798   (2)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   8 - filter("T_1"."ID"=:B1)

-- 11.2.0.4 plan shape with index visible
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     6 |  5973K  (1)| 00:03:54 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     6 |            |          |        |      |            |
|*  2 |   FILTER               |          |       |       |            |          |        |      |            |
|   3 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)| :TQ10000 |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   5 |      PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  7 |    INDEX RANGE SCAN    | T_1_IDX  |     1 |     6 |     3   (0)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   7 - access("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index visible
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     6 |  5973K  (1)| 00:03:54 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     6 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     6 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     6 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       INDEX RANGE SCAN  | T_1_IDX  |     1 |     6 |     3   (0)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   8 - access("T_1"."ID"=:B1)

I've included two variations of the setup, one without available index for evaluating the FILTER subquery and one with index.

The pre-12c plan shape without index makes the former limitation particularly obvious: The FILTER operator is above the PX COORDINATOR and marked serial, and the table scan in the FILTER subquery gets parallelized as separate DFO tree (indicated among others by the two PX COORDINATOR operators), which means that each time this separate DFO tree starts, a separate set of Parallel Slave will be allocated/deallocated, adding possibly a lot of overhead to a probably already inefficient execution plan anyway - assuming the FILTER subquery needs to be evaluated many times.

In 12c the FILTER operator is marked parallel and the need for a separate DFO tree is gone. What might be confusing with this plan shape is that the operations of the FILTER subquery are not marked parallel. In my opinion this is misleading and should actually be marked parallel, because at runtime the operations will be performed by the Parallel Slaves, and in case of a Full Table Scan each slave will run the entire full table scan (so no PX ITERATOR for dividing the scan into chunks / granules), which is comparable to what happens when a parallel Nested Loop join runs or the new PQ_REPLICATE feature gets used - and in those cases the operations are marked parallel:
-- 11.2.0.4 / 12.1.0.2 plan shape with index invisible
-- and subquery unnested using NL SEMI join
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |    12 |   442M  (2)| 04:48:03 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |    12 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |    12 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |    12 |            |          |  Q1,00 | PCWP |            |
|   5 |      NESTED LOOPS SEMI  |          |  2000K|    22M|   442M  (2)| 04:48:03 |  Q1,00 | PCWP |            |
|   6 |       PX BLOCK ITERATOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T_1      |  2000K|    11M|   796   (2)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   8 - filter("T"."ID"="T_1"."ID")

Summary


So the operators of the FILTER subquery can now be run in the slaves, which is the main point of this feature, although being represented in a confusing way in the execution plan. Avoiding the potential decomposition into multiple DFO trees is another possible side effect. Decreased query duration should be possible if the evaluation of the FILTER subquery requires significant time and can now be run in the Parallel Slaves instead of serial execution through the Query Coordinator.

Note that depending on the plan shape and SQL features used, it's still possible that 12c reverts to the old serial FILTER subquery evaluation plan shape, so the new feature doesn't get used always.

There is more to say about this feature. In the next part of this instalment I'll focus on the different distribution methods possible with the new parallel FILTER operator - there is a new PQ_FILTER hint that allows controlling the distribution, but there are also some interesting points to make about how the optimizer seems to make its choice which distribution method to use automatically. In the examples shown here there's no separate distribution for the FILTER, by the way, but this can look differently, as I'll show in the next part.

Sunday, June 28, 2015

Video Tutorial: XPLAN_ASH Active Session History - Part 6

The next part of the video tutorial explaining the XPLAN_ASH Active Session History functionality continuing the actual walk-through of the script output.

More parts to follow.

Monday, June 22, 2015

12c Parallel Execution New Features: 1 SLAVE distribution

When certain SQL features get used in pre-12c versions that force non-parallel evaluation, like using ROWNUM or certain Analytic Functions like LAG/LEAD, then - depending on the overall plan shape - Oracle can start to decompose the parallel execution plan into several so called DFO trees (If you want learn more about DFO trees and DFOs I recommend watching my online tutorial on my Youtube channel).

Now having multiple DFO trees in a single parallel execution plan comes with several side effects that are confusing and complicate matters unnecessarily, like each DFO tree allocates its own PX slave set(s), and so each one can potenially end up with a different DOP, which means you can have more than one DOP in a single parallel execution plan.

Depending on the overall plan shape this might also mean that a DFO tree can get started multiple times, and again this means that each time it is started / completed PX slaves need to be allocated and de-allocated, potentially causing a significant overhead coordinating all that activity that is not directly related to the actual execution.

This also means that having multiple DFO trees can lead to a situation where (a lot) more PX slaves are allocated than expected, in case multiple DFO trees are active at the same time - which again means that if you believe you can limit the number of PX slaves allocated by a single parallel execution using Resource Manager directives you might be wrong.

Since all these are undesirable side effects, starting with release 12c Oracle has put effort into new features that minimize the need for such a decomposition into multiple DFO trees. One of these new features is the so called "1 SLAVE" distribution method that can get used if such a non-parallel evaluation is required.

Quite similar to the recently described "PX SELECTOR" operator the "1 SLAVE" distribution uses a single PX slave out of a slave set to execute the non-parallel operations instead of the Query Coordinator. The main difference this makes is that the parallel and serial operations now are still part of the same DFO tree instead of having parts of the execution plan executed by the Query Coordinator and different DFO trees before and after such serial operations.

Let's have a look at a simple example to demonstrate the new feature. I use here three identical tables, just for the sake of being able to differentiate the tables in the plan output - in principle re-using a single table three times would be sufficient.
create table t2
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't2')

create table t4
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't4')

create table t6
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't6')

explain plan for
select /*+ no_merge(x) */
       *
from
     (
       select /*+ parallel(t6 4)
                  --optimizer_features_enable('11.2.0.4')
              */
               *
       from
            (
              select /*+ parallel(t2 4) */
                     --lag(id) over (order by id)    as v1_rn
                     rownum    as v1_rn
                   , t2.id     as v1_id
                   , t2.filler as v1_filler
              from
                     t2
            ) v1
          , (
              select /*+ parallel(t4 2) */
                     --lag(id) over (order by id)    as v2_rn
                     rownum    as v2_rn
                   , t4.id     as v2_id
                   , t4.filler as v2_filler
              from
                     t4
            ) v2
          , t6
       where
              v1_id = v2_id
       and    v1_id = t6.id
     ) x
where
      rownum > 1
;

-- 11.2.0.4 plan shape
----------------------------------------------------------------------------------
| Id  | Operation                        | Name     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |        |      |            |
|   1 |  COUNT                           |          |        |      |            |
|*  2 |   FILTER                         |          |        |      |            |
|   3 |    PX COORDINATOR                |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)          | :TQ30002 |  Q3,02 | P->S | QC (RAND)  |
|   5 |      VIEW                        |          |  Q3,02 | PCWP |            |
|*  6 |       HASH JOIN                  |          |  Q3,02 | PCWP |            |
|   7 |        PX RECEIVE                |          |  Q3,02 | PCWP |            |
|   8 |         PX SEND HASH             | :TQ30001 |  Q3,01 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR       |          |  Q3,01 | PCWC |            |
|  10 |           TABLE ACCESS FULL      | T6       |  Q3,01 | PCWP |            |
|  11 |        BUFFER SORT               |          |  Q3,02 | PCWC |            |
|  12 |         PX RECEIVE               |          |  Q3,02 | PCWP |            |
|  13 |          PX SEND HASH            | :TQ30000 |        | S->P | HASH       |
|* 14 |           HASH JOIN              |          |        |      |            |
|  15 |            VIEW                  |          |        |      |            |
|  16 |             COUNT                |          |        |      |            |
|  17 |              PX COORDINATOR      |          |        |      |            |
|  18 |               PX SEND QC (RANDOM)| :TQ10000 |  Q1,00 | P->S | QC (RAND)  |
|  19 |                PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  20 |                 TABLE ACCESS FULL| T2       |  Q1,00 | PCWP |            |
|  21 |            VIEW                  |          |        |      |            |
|  22 |             COUNT                |          |        |      |            |
|  23 |              PX COORDINATOR      |          |        |      |            |
|  24 |               PX SEND QC (RANDOM)| :TQ20000 |  Q2,00 | P->S | QC (RAND)  |
|  25 |                PX BLOCK ITERATOR |          |  Q2,00 | PCWC |            |
|  26 |                 TABLE ACCESS FULL| T4       |  Q2,00 | PCWP |            |
----------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(ROWNUM>1)
   6 - access("V1_ID"="T6"."ID")
  14 - access("V1_ID"="V2_ID")

-- 12.1.0.2 plan shape
---------------------------------------------------------------------------------
| Id  | Operation                       | Name     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |        |      |            |
|   1 |  COUNT                          |          |        |      |            |
|*  2 |   FILTER                        |          |        |      |            |
|   3 |    PX COORDINATOR               |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)         | :TQ10004 |  Q1,04 | P->S | QC (RAND)  |
|   5 |      VIEW                       |          |  Q1,04 | PCWP |            |
|*  6 |       HASH JOIN BUFFERED        |          |  Q1,04 | PCWP |            |
|   7 |        PX RECEIVE               |          |  Q1,04 | PCWP |            |
|   8 |         PX SEND HASH            | :TQ10002 |  Q1,02 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR      |          |  Q1,02 | PCWC |            |
|  10 |           TABLE ACCESS FULL     | T6       |  Q1,02 | PCWP |            |
|  11 |        PX RECEIVE               |          |  Q1,04 | PCWP |            |
|  12 |         PX SEND HASH            | :TQ10003 |  Q1,03 | S->P | HASH       |
|* 13 |          HASH JOIN BUFFERED     |          |  Q1,03 | SCWC |            |
|  14 |           VIEW                  |          |  Q1,03 | SCWC |            |
|  15 |            COUNT                |          |  Q1,03 | SCWP |            |
|  16 |             PX RECEIVE          |          |  Q1,03 | SCWP |            |
|  17 |              PX SEND 1 SLAVE    | :TQ10000 |  Q1,00 | P->S | 1 SLAVE    |
|  18 |               PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  19 |                TABLE ACCESS FULL| T2       |  Q1,00 | PCWP |            |
|  20 |           VIEW                  |          |  Q1,03 | SCWC |            |
|  21 |            COUNT                |          |  Q1,03 | SCWP |            |
|  22 |             PX RECEIVE          |          |  Q1,03 | SCWP |            |
|  23 |              PX SEND 1 SLAVE    | :TQ10001 |  Q1,01 | P->S | 1 SLAVE    |
|  24 |               PX BLOCK ITERATOR |          |  Q1,01 | PCWC |            |
|  25 |                TABLE ACCESS FULL| T4       |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter(ROWNUM>1)
   6 - access("V1_ID"="T6"."ID")
  13 - access("V1_ID"="V2_ID")
Let's start with the 11.2.0.4 plan shape: We can see from multiple occurrences of the PX COORDINATOR operator, the TQ column and the corresponding TQ entries in the NAME column that this parallel execution plan consists of three DFO trees. When analyzing the actual run time activity it would also become obvious that the different DFO trees actually run at a different DOP and use different PX slave sets. In this particular case here it would even become obvious that two DFO trees are active at the same time - so it's pretty much demonstrating all the undesirable side effects of having multiple DFO trees that I mentioned above.

The HASH JOIN between V1 and V2 runs serially, whereas the HASH JOIN between T6 and the join result of V1 and V2 runs parallel. Since we have a Serial->Parallel distribution between these two HASH JOINs, an additional BUFFER SORT operation gets added - as outlined in the PX SELECTOR note.

If we now look at the 12.1.0.2 plan shape we notice that the execution plan consists of a single DFO tree, and hence all the side effects I've outlined are gone. The serial evaluation is done by a single PX slave of one set, again indicated by the SCWC/SCWP decorator, similar to the PX SELECTOR operator.

However, the plan shape also demonstrates one possible disadvantage of the new feature: Since now everything is part of a single DFO tree we end up with more re-distributions per DFO tree, and hence might have an increased demand for the dreaded BUFFERing to prevent multiple re-distributions being active at the same time - here this can be seen in operation ID 13 - the HASH JOIN executed serially by one PX slave is actually turned into a HASH JOIN BUFFERED (so the worst case for a (not really) "parallel" hash join - executed only by a single PX slave and needs to buffer the right hand row source in addition), and also the second HASH JOIN is turned into its BUFFERED variant. This means compared to the 11.2.0.4 plan shape that contains no HASH JOIN BUFFERED but an additional BUFFER SORT operation, the 12.1.0.2 plan shape with this data pattern actually requires more PGA / TEMP space than the 11.2.0.4 plan shape due to the double buffering now necessary.

Footnote


The new 1 SLAVE distribution doesn't get used always in 12c. If you run just the join between V1 and V2 for example, then the old plan shape will be used, and there are again multiple DFO trees. Furthermore, in this particular case, when you start changing the DOP used in the PARALLEL hints you also might end up with a plan shape where one view uses the 1 SLAVE distribution whereas the other one uses the old plan shape with Query Coordinator activity - I haven't investigated further why this happens.

If Analytic Functions get used, you might also see a "1 SLAVE (ORDER)" variation of the distribution that enforces a certain order when re-distributing the data, similar to the "PX SEND QC (ORDER)" operator.

Sunday, June 14, 2015

12c Parallel Execution New Features: PX SELECTOR

Continuing my series on new 12c Parallel Execution features: I've already mentioned the new PX SELECTOR operator as part of the new Concurrent UNION ALL feature where it plays a key role. However, in general starting from 12c this new operator usually will get used when it comes to executing a serial part of the execution plan, like a full scan of an object not marked parallel, or an index based operation that can't be parallelized.

In pre-12c such serial parts get executed by the Query Coordinator itself, and the new PX SELECTOR changes that so that one of the PX slaves of a PX slave set is selected to execute that serial part.

There is not much left to say about that functionality, except that it doesn't get used always - there are still plan shapes possible in 12c, depending on the SQL constructs used and combined, that show the pre-12c plan shape where the Query Coordinator executes the serial part.

Let's have a look at a simple example to see in more detail what difference the new operator makes to the overall plan shape and runtime behaviour:
create table t1 as select * from dba_objects;

exec dbms_stats.gather_table_stats(null, 't1')

alter table t1 parallel;

create table t2 as select * from dba_objects;

exec dbms_stats.gather_table_stats(null, 't2')

create index t2_idx on t2 (object_name);

select /*+ optimizer_features_enable('11.2.0.4') */ 
        * 
from 
        t1
      , t2 
where 
        t1.object_id = t2.object_id 
and     t2.object_name like 'BLUB%'
;

-- 11.2.0.4 plan shape
-----------------------------------------------------------------------------------
| Id  | Operation                         | Name     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |          |        |      |            |
|   1 |  PX COORDINATOR                   |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)             | :TQ10001 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                      |          |  Q1,01 | PCWP |            |
|   4 |     BUFFER SORT                   |          |  Q1,01 | PCWC |            |
|   5 |      PX RECEIVE                   |          |  Q1,01 | PCWP |            |
|   6 |       PX SEND BROADCAST           | :TQ10000 |        | S->P | BROADCAST  |
|   7 |        TABLE ACCESS BY INDEX ROWID| T2       |        |      |            |
|*  8 |         INDEX RANGE SCAN          | T2_IDX   |        |      |            |
|   9 |     PX BLOCK ITERATOR             |          |  Q1,01 | PCWC |            |
|* 10 |      TABLE ACCESS FULL            | T1       |  Q1,01 | PCWP |            |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")
   8 - access("T2"."OBJECT_NAME" LIKE 'BLUB%')
       filter("T2"."OBJECT_NAME" LIKE 'BLUB%')
  10 - filter(SYS_OP_BLOOM_FILTER(:BF0000,"T1"."OBJECT_ID"))

-- 12.1.0.2 plan shape
--------------------------------------------------------------------------------------------
| Id  | Operation                                  | Name     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                           |          |        |      |            |
|   1 |  PX COORDINATOR                            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)                      | :TQ10001 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                               |          |  Q1,01 | PCWP |            |
|   4 |     JOIN FILTER CREATE                     | :BF0000  |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                            |          |  Q1,01 | PCWP |            |
|   6 |       PX SEND BROADCAST                    | :TQ10000 |  Q1,00 | S->P | BROADCAST  |
|   7 |        PX SELECTOR                         |          |  Q1,00 | SCWC |            |
|   8 |         TABLE ACCESS BY INDEX ROWID BATCHED| T2       |  Q1,00 | SCWC |            |
|*  9 |          INDEX RANGE SCAN                  | T2_IDX   |  Q1,00 | SCWP |            |
|  10 |     JOIN FILTER USE                        | :BF0000  |  Q1,01 | PCWP |            |
|  11 |      PX BLOCK ITERATOR                     |          |  Q1,01 | PCWC |            |
|* 12 |       TABLE ACCESS FULL                    | T1       |  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("T1"."OBJECT_ID"="T2"."OBJECT_ID")
   9 - access("T2"."OBJECT_NAME" LIKE 'BLUB%')
       filter("T2"."OBJECT_NAME" LIKE 'BLUB%')
  12 - filter(SYS_OP_BLOOM_FILTER(:BF0000,"T1"."OBJECT_ID"))
The pre-12c plan shape here shows two significant things that I want to emphasize:

First this plan shape only requires a single PX slave set since the Query Coordinator takes over the part that needs to be re-distributed, so although we have a plan shape that requires re-distribution there's only a single PX slave set involved. In case there is at least one operation that gets executed in parallel and requires re-distribution there always will be two PX slave sets.

Second the plan shape demonstrates that parts of a Parallel Execution plan that get executed serially by the Query Coordinator require an additional BUFFER SORT operation. The HASH JOIN operation itself is blocking while it is consuming the left row source for building the hash table, so there is no true requirement to add another BUFFER SORT after the PX RECEIVE operation, but it looks like a pretty strict rule that any serial activity that involves the Query Coordinator adda a BUFFER SORT operation after re-distribution - I assume the reasoning for this is that the Query Coordinator isn't available for "coordinating" the PX slaves as along as it is actively involved in executing serial operations, hence the need to block any other parallel activity.

This normally shouldn't be too relevant to performance since you should only execute operations serially that are tiny and not worth to run parallel, so buffering them shouldn't add much overhead, but it's just another reason why you see additional BUFFER SORT operations in parallel plans that are not there in serial-only plans.

The 12c plan shape shows the new PX SELECTOR operator that executes now the serial part of the execution plan instead of the Query Coordinator. This also adds new decorators in the IN-OUT column called "SCWC" and "SCWP" respectivley, which you won't find in pre-12c plans - they are probably meant to read "Serial Combined With Child/Parent", similar to "PCWC/PCWP".

The good thing about the new PX SELECTOR is that the need for an additional BUFFER SORT operator is now gone.

However, one side-effect of the new operator for this particular plan shape here is that now a second PX slave set is allocated, although only one PX slave actually will get used at runtime. Note that for other plan shapes that need two PX slave sets anyway this doesn't matter.

Another good thing about the new PX SELECTOR operator is that it avoids an odd bug that sometimes happens with Serial->Parallel redistributions when the Query Coordinator is involved. This bug causes some delay to the overall execution that usually isn't too relevant since it only adds approx 1-2 seconds delay (but it can occur several times per execution so these seconds can add up) and therefore is rarely noticed when a Parallel Execution might take several seconds / minutes typically. I might cover this bug in a separate blog post.

Unrelated to the PX SELECTOR operator, the 12c plan shape also demonstrates that in 12c the way Bloom filters are shown in the plan has been improved. The 11.2.0.4 version includes the same Bloom filter as you can see from the "Predicate Information" section of the plan but doesn't make it that obvious from the plan shape that it is there (and sometimes in pre-12c it even doesn't show up in the "Predicate Information" section but is still used)

Friday, May 29, 2015

Temp Table Transformation Cardinality Estimates - 2

Continuing from the previous part - which was about the Temp Table Transformation and join cardinality estimates - using the same simple table setup here is a slight variation of the previously used query to demonstrate the potential impact on single table cardinality estimates:
explain plan for
with
cte as (
select /* inline */ id from t1 t
where 1 = 1
)
select /*+
           no_merge(a) no_merge(b)
       */ * from cte a, cte b
where a.id = b.id
and a.id > 990 and b.id > 990
;

-- 11.2.0.x Plan with TEMP transformation
--------------------------------------------------------------------------------
| Id  | Operation                  | Name                      | Rows  | Bytes |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                           |  1000 | 26000 |
|   1 |  TEMP TABLE TRANSFORMATION |                           |       |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D661C_275FD9 |       |       |
|   3 |    TABLE ACCESS FULL       | T1                        |  1000 |  4000 |
|*  4 |   HASH JOIN                |                           |  1000 | 26000 |
|*  5 |    VIEW                    |                           |  1000 | 13000 |
|   6 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D661C_275FD9 |  1000 |  4000 |
|*  7 |    VIEW                    |                           |  1000 | 13000 |
|   8 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D661C_275FD9 |  1000 |  4000 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("A"."ID"="B"."ID")
   5 - filter("A"."ID">990)
   7 - filter("B"."ID">990)

-- 11.2.0.x Plan with INLINE hint
----------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |
----------------------------------------------------
|   0 | SELECT STATEMENT    |      |    10 |   260 |
|*  1 |  HASH JOIN          |      |    10 |   260 |
|   2 |   VIEW              |      |    10 |   130 |
|*  3 |    TABLE ACCESS FULL| T1   |    10 |    40 |
|   4 |   VIEW              |      |    10 |   130 |
|*  5 |    TABLE ACCESS FULL| T1   |    10 |    40 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("A"."ID"="B"."ID")
   3 - filter("ID">990)
   5 - filter("ID">990)
Again it's obvious that the Temp Table Transformation can have significant impact on the single table cardinality estimates.

In particular:

- Although the same filter is applied in both cases to the rowsources A and B, in case of the Temp Table Transformation it doesn't reduce the cardinality. So it's not uncommon to end up with significant cardinality overestimates in case the transformation gets used

- For Exadata environments particularly bad is that the filter isn't pushed into the TABLE ACCESS FULL operator, but only applied in the VIEW operator above, which means that it can't be offloaded - all the data needs to be sent from the Storage Cells to the Compute Nodes and filtered there. Not a very efficient way to operate on Exadata

The behaviour is still the same in 12c.

Monday, May 25, 2015

Temp Table Transformation Cardinality Estimates - 1

Having published recently two notes about the Temp Table Transformation highlighting the heuristics based decision and other weaknesses, for example regarding the projection of columns, it's time to publish some more notes about it.

The transformation can also have significant impact on cardinality estimates, both join and single table cardinality.

Looking at the difference in the join cardinality estimates of following simple example:
create table t1
as
select
        rownum as id
      , mod(rownum, 10) + 1 as id2
      , rpad('x', 100) as filler
from
        dual
connect by
        level <= 1000
;

exec dbms_stats.gather_table_stats(null, 't1')

alter session set tracefile_identifier = 'temp_trans_join_card';

alter session set events '10053 trace name context forever, level 1';

explain plan for
with
cte as (
select /* inline */ id + 1 as id from t1 t
where 1 = 1
)
select /*+
           --opt_estimate(@"SEL$2" join("A"@"SEL$2" "B"@"SEL$2") rows=1000)
           no_merge(a) no_merge(b)
       */ * from cte a, cte b
where a.id = b.id
;

alter session set events '10053 trace name context off';

-- 11.2.0.x Plan with TEMP transformation
--------------------------------------------------------------------------------
| Id  | Operation                  | Name                      | Rows  | Bytes |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                           |     1 |    26 |
|   1 |  TEMP TABLE TRANSFORMATION |                           |       |       |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9D660C_27269C |       |       |
|   3 |    TABLE ACCESS FULL       | T1                        |  1000 |  4000 |
|*  4 |   HASH JOIN                |                           |     1 |    26 |
|   5 |    VIEW                    |                           |  1000 | 13000 |
|   6 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D660C_27269C |  1000 |  4000 |
|   7 |    VIEW                    |                           |  1000 | 13000 |
|   8 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9D660C_27269C |  1000 |  4000 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("A"."ID"="B"."ID")

-- 11.2.0.x Plan with INLINE hint
----------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |
----------------------------------------------------
|   0 | SELECT STATEMENT    |      | 10000 |   253K|
|*  1 |  HASH JOIN          |      | 10000 |   253K|
|   2 |   VIEW              |      |  1000 | 13000 |
|   3 |    TABLE ACCESS FULL| T1   |  1000 |  4000 |
|   4 |   VIEW              |      |  1000 | 13000 |
|   5 |    TABLE ACCESS FULL| T1   |  1000 |  4000 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("A"."ID"="B"."ID")
the following becomes obvious:

- There are vastly different cardinality estimates possible depending on whether the transformation gets used or not. In principle due to the NO_MERGE hints used, the transformation shouldn't have any impact on the estimates, but it does

- Looking at the optimizer trace file some information seems to get lost when the transformation gets used, in particular column related statistics

- This information loss, like in the example here, can lead to join cardinality estimates of 0 (rounded to 1 in the plan output)

- And even worse, at least in pre-12c versions, such a 0 cardinality estimate can't be corrected via OPT_ESTIMATE hints, since somehow the correction gets ignored/lost although being mentioned in the optimizer trace:
11.2.0.1:
  Column (#1): ID(
    AvgLen: 22 NDV: 1 Nulls: 0 Density: 0.000000
  Column (#1): ID(
    AvgLen: 22 NDV: 1 Nulls: 0 Density: 0.000000
Join Card:  0.000000 = = outer (1000.000000) * inner (1000.000000) * sel (0.000000)
>> Join Card adjusted from 0.000000  to: 1000.000000, prelen=2
Adjusted Join Cards: adjRatio=1.00 cardHjSmj=0.000000 cardHjSmjNPF=0.000000 cardNlj=0.000000 cardNSQ=1000.000000 cardNSQ_na=0.000000
Join Card - Rounded: 1 Computed: 0.00
The behaviour regarding the OPT_ESTIMATE hint changes in 12c, but then there are other oddities introduced in 12c that are not there in pre-12c - have a look at the "Query Block" section when using the INLINE variant of the query - there are two identical fully qualified object names, clearly a bug, making hinting using global hint syntax impossible for that query block.

Although my simple example here can be corrected via extended statistics on the join column expression used in the CTE query my point here is that depending on whether the transformation gets used or not vastly different and extreme cardinality estimates are possible - and those extreme cases even can't be corrected in pre-12c versions.

For example I recently had a real life case where two columns were joined that had a significant number of NULL values, one coming from a temp table transformation row source. Without the transformation the join cardinality estimates were reasonable, but the transformation again lead to such a 0 cardinality estimate (that couldn't be corrected via a (correctly specified) OPT_ESTIMATE hint), ruining the whole plan.