Sunday, August 16, 2015

Parallel Projection

A recent case at a client reminded me of something that isn't really new but not so well known - Oracle by default performs evaluation at the latest possible point in the execution plan.

So if you happen to have expressions in the projection of a simple SQL statement that runs parallel it might be counter-intuitive that by default Oracle won't evaluate the projection in the Parallel Slaves but in the Query Coordinator - even if it was technically possible - because the latest possible point is the SELECT operation with the ID = 0 of the plan, which is always performed by the Query Coordinator.

Of course, if you make use of expressions that can't be evaluated in parallel or aren't implemented for parallel evaluation, then there is no other choice than doing this in the Query Coordinator.

The specific case in question was a generic export functionality that allowed exporting report results to some CSV or Excel like format, and some of these reports had a lot of rows and complex - in that case CPU intensive - expressions in their projection clause.

When looking at the run time profile of such an export query it became obvious that although it was a (very simple) parallel plan, all of the time was spent in the Query Coordinator, effectively turning this at runtime into a serial execution.

This effect can be reproduced very easily:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel cache;

-- Run some CPU intensive expressions in the projection 
-- of a simple parallel Full Table Scan
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
;

-- The plan is clearly parallel
--------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |  2000K|   192M|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------

-- But the runtime profile looks more serial
-- although the Parallel Slaves get used to run the Full Table Scan
-- All time spent in the operation ID = 0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation            | Name     | Execs | A-Rows| ReadB | ReadReq | Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                     | Parallel Execution Skew ASH      | Activity Graph ASH          | Top 5 Activity ASH  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT     |          |    5  | 2000K |       |         |     3 |   136 |   120 | #################### |   1:sqlplus.exe(120)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]       | ################################ | @@@@@@@@@@@@@@@@@@@ ( 98%)  | ON CPU(120)         |
|   1 |   0 |  PX COORDINATOR      |          |    5  | 2000K |       |         |   119 |     1 |     1 |                 #    |   1:sqlplus.exe(1)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]         |                                  |                     ( .8%)  | ON CPU(1)           |
|   2 |   1 |   PX SEND QC (RANDOM)| :TQ10000 |    4  | 2000K |       |         |    66 |    11 |     2 |          ##          |   2:P00B(1)[508K],P00A(1)[490K],P008(0)[505K],P009(0)[497K],sqlplus.exe(0)[0] |                                  |                     (1.6%)  | PX qref latch(2)    |
|   3 |   2 |    PX BLOCK ITERATOR |          |    4  | 2000K |       |         |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
|*  4 |   3 |     TABLE ACCESS FULL| T_1      |   52  | 2000K |   23M |     74  |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fortunately there is a simple and straightforward way to make use of the Parallel Slaves for evaluation of projection expressions that can be evaluated in parallel - simply add a suitable NO_MERGE hint for the query block that you want the projection to be evaluated for in the Parallel Slaves.

If you don't want to have side effects on the overall plan shape by not merging views you could always wrap the original query in an outer SELECT and not merging the now inner query block. There seems to be a rule that the projection of a view always get evaluated at the VIEW operator, and if we check the execution plan we can see that the VIEW operator is marked parallel:
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select /*+ no_merge(x) */ * from (
select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
) x
;

-- View operator is marked parallel
-- This is were the projection clause of the VIEW will be evaluated
---------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  2000K|    11G|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    VIEW               |          |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|   4 |     PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------

-- Runtime profile now shows effective usage of Parallel Slaves 
-- for doing the CPU intensive work
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation             | Name     | Execs | A-Rows| Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                         | Parallel Execution Skew ASH| Activity Graph ASH          | Top 5 Activity ASH               |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT      |          |    5  | 2000K |       |       |       |                      |   0:sqlplus.exe(0)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]             |                            |                             |                                  |
|   1 |   0 |  PX COORDINATOR       |          |    5  | 2000K |    17 |    63 |    10 |    # ##   #    ####  |   1:sqlplus.exe(10)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]            | ####                       | *                   (5.6%)  | resmgr:cpu quantum(10)           |
|   2 |   1 |   PX SEND QC (RANDOM) | :TQ10000 |    4  | 2000K |     5 |    61 |    10 | ## #  ## ## ## #     |   3:P002(5)[544K],P001(4)[487K],P000(1)[535K],P003(0)[434K],sqlplus.exe(0)[0]     | #                          |                     (5.6%)  | ON CPU(7),resmgr:cpu quantum(3)  |
|   3 |   2 |    VIEW               |          |    4  | 2000K |     2 |    82 |    69 | #################### |   4:P003(42)[434K],P001(35)[487K],P000(26)[535K],P002(22)[544K],sqlplus.exe(0)[0] | ############               | @@@@@@@@@@@@@@@@@@@ ( 70%)  | ON CPU(125)                      |
|   4 |   3 |     PX BLOCK ITERATOR |          |    4  | 2000K |       |       |       |                      |   0:P002(0)[544K],P000(0)[535K],P001(0)[487K],P003(0)[434K],sqlplus.exe(0)[0]     |                            |                             |                                  |
|*  5 |   4 |      TABLE ACCESS FULL| T_1      |   52  | 2000K |     3 |    78 |    29 | ###### ####### # ### |   4:P000(11)[535K],P002(8)[544K],P001(8)[487K],P003(7)[434K],sqlplus.exe(0)[0]    | ###                        | *****               ( 19%)  | resmgr:cpu quantum(30),ON CPU(4) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At runtime the duration of the query now gets reduced significantly and we can see the Parallel Slaves getting used when the VIEW operator gets evaluated. Although the overall CPU time used is similar to the previous example, the duration of the query execution is less since this CPU time is now spent in parallel in the slaves instead in the Query Coordinator.

Summary


By default Oracle performs evaluation at the latest possible point of the execution plan. Sometimes you can improve runtime by actively influencing when the projection will be evaluated by preventing view merging and introducing a VIEW operator that will be used to evaluate the projection clause.

The optimizer so far doesn't seem to incorporate such possibilities in its evaluations of possible plan shapes, so this is something you need to do manually up to and including Oracle 12c (version 12.1.0.2 as of time of writing this).

Sunday, July 26, 2015

12c Parallel Execution New Features: Parallel FILTER Subquery Evaluation - Part 1: Introduction

12c introduces another interesting new Parallel Execution feature - the parallel evaluation of FILTER subqueries. In pre-12c FILTER subqueries always had to be evaluated in the Query Coordinator. This had several consequences, in particular the data driving the FILTER subquery always had to flow through the Query Coordinator, and hence represented a forced serial execution part of a parallel execution plan. This limitation also meant that depending on the overall plan shape the parallel plan was possibly decomposed into multiple DFO trees, leading to other side effects I've outlined in some of my other publications already.

In 12c now the FILTER subquery can be evaluated in the Parallel Slaves, and the driving data no longer needs to be processed in the Query Coordinator. However, the resulting plan shape can be a little bit confusing. Let's have a look at a simple example:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel 4;

create index t_1_idx on t_1 (id) invisible;

explain plan for
select /*+ 
           --optimizer_features_enable('11.2.0.4')
        */ count(*) from
        t_1 t
        where exists (select /*+ no_unnest */ null from t_1 where t.id = t_1.id);

-- 11.2.0.4 plan shape with index invisible
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     6 |   440M  (2)| 04:47:04 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     6 |            |          |        |      |            |
|*  2 |   FILTER               |          |       |       |            |          |        |      |            |
|   3 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)| :TQ20000 |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | P->S | QC (RAND)  |
|   5 |      PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q2,00 | PCWP |            |
|   7 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   8 |     PX SEND QC (RANDOM)| :TQ10000 |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   9 |      PX BLOCK ITERATOR |          |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | PCWC |            |
|* 10 |       TABLE ACCESS FULL| T_1      |     1 |     6 |   222   (2)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
  10 - filter("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index invisible
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     6 |  1588M  (2)| 17:14:09 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     6 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     6 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     6 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T_1      |     1 |     6 |   798   (2)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   8 - filter("T_1"."ID"=:B1)

-- 11.2.0.4 plan shape with index visible
----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     6 |  5973K  (1)| 00:03:54 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     6 |            |          |        |      |            |
|*  2 |   FILTER               |          |       |       |            |          |        |      |            |
|   3 |    PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   4 |     PX SEND QC (RANDOM)| :TQ10000 |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   5 |      PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  7 |    INDEX RANGE SCAN    | T_1_IDX  |     1 |     6 |     3   (0)| 00:00:01 |        |      |            |
----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   7 - access("T_1"."ID"=:B1)

-- 12.1.0.2 plan shape with index visible
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |     6 |  5973K  (1)| 00:03:54 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |     6 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |     6 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |     6 |            |          |  Q1,00 | PCWP |            |
|*  5 |      FILTER             |          |       |       |            |          |  Q1,00 | PCWC |            |
|   6 |       PX BLOCK ITERATOR |          |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       INDEX RANGE SCAN  | T_1_IDX  |     1 |     6 |     3   (0)| 00:00:01 |        |      |            |
-----------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - filter( EXISTS (SELECT /*+ NO_UNNEST */ 0 FROM "T_1" "T_1" WHERE "T_1"."ID"=:B1))
   8 - access("T_1"."ID"=:B1)

I've included two variations of the setup, one without available index for evaluating the FILTER subquery and one with index.

The pre-12c plan shape without index makes the former limitation particularly obvious: The FILTER operator is above the PX COORDINATOR and marked serial, and the table scan in the FILTER subquery gets parallelized as separate DFO tree (indicated among others by the two PX COORDINATOR operators), which means that each time this separate DFO tree starts, a separate set of Parallel Slave will be allocated/deallocated, adding possibly a lot of overhead to a probably already inefficient execution plan anyway - assuming the FILTER subquery needs to be evaluated many times.

In 12c the FILTER operator is marked parallel and the need for a separate DFO tree is gone. What might be confusing with this plan shape is that the operations of the FILTER subquery are not marked parallel. In my opinion this is misleading and should actually be marked parallel, because at runtime the operations will be performed by the Parallel Slaves, and in case of a Full Table Scan each slave will run the entire full table scan (so no PX ITERATOR for dividing the scan into chunks / granules), which is comparable to what happens when a parallel Nested Loop join runs or the new PQ_REPLICATE feature gets used - and in those cases the operations are marked parallel:
-- 11.2.0.4 / 12.1.0.2 plan shape with index invisible
-- and subquery unnested using NL SEMI join
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |     1 |    12 |   442M  (2)| 04:48:03 |        |      |            |
|   1 |  SORT AGGREGATE         |          |     1 |    12 |            |          |        |      |            |
|   2 |   PX COORDINATOR        |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)  | :TQ10000 |     1 |    12 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE      |          |     1 |    12 |            |          |  Q1,00 | PCWP |            |
|   5 |      NESTED LOOPS SEMI  |          |  2000K|    22M|   442M  (2)| 04:48:03 |  Q1,00 | PCWP |            |
|   6 |       PX BLOCK ITERATOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T_1      |  2000K|    11M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |       TABLE ACCESS FULL | T_1      |  2000K|    11M|   796   (2)| 00:00:01 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   8 - filter("T"."ID"="T_1"."ID")

Summary


So the operators of the FILTER subquery can now be run in the slaves, which is the main point of this feature, although being represented in a confusing way in the execution plan. Avoiding the potential decomposition into multiple DFO trees is another possible side effect. Decreased query duration should be possible if the evaluation of the FILTER subquery requires significant time and can now be run in the Parallel Slaves instead of serial execution through the Query Coordinator.

Note that depending on the plan shape and SQL features used, it's still possible that 12c reverts to the old serial FILTER subquery evaluation plan shape, so the new feature doesn't get used always.

There is more to say about this feature. In the next part of this instalment I'll focus on the different distribution methods possible with the new parallel FILTER operator - there is a new PQ_FILTER hint that allows controlling the distribution, but there are also some interesting points to make about how the optimizer seems to make its choice which distribution method to use automatically. In the examples shown here there's no separate distribution for the FILTER, by the way, but this can look differently, as I'll show in the next part.