Saturday, March 29, 2014

New Version Of XPLAN_ASH Utility

A minor update 4.01 to the XPLAN_ASH utility is available for download.

As usual the latest version can be downloaded here.

These are the notes from the change log:

- More info for RAC Cross Instance Parallel Execution: Many sections now show a GLOBAL aggregate info in addition to instance-specific data

- The Parallel Execution Server Set detection and ASSUMED_DEGREE info now makes use of the undocumented PX_STEP_ID and PX_STEPS_ARG info (bit mask part of the PX_FLAGS column) on 11.2.0.2+

- Since version 4.0 added from 11.2.0.2 on the PX *MAX* DOP in the "SQL statement execution ASH Summary" based on the new PX_FLAGS column of ASH it makes sense to add a PX *MIN* DOP in the summary to see at one glance if different DOPs were used or not

- The "Active DOPs" column in the "Activity Timeline based on ASH" was extended/modified: The number in parantheses is no longer the simple count of samples but the Average Active Sessions (AAS) per DFO / bucket.

From 11.2.0.2 it now shows also the DOP of the DFO in brackets, so the
output could look now like this:

1[16] (14.5)

which means DFO 1 at a DOP of 16 had an AAS value of 14.5 for this time
bucket. If there are multiple DFOs active in the time bucket, they are
separated by commas:

1[16] (3.5),2[4] (1.5)

which means DFO 1 at a DOP of 16 had an AAS value of 3.5 and DFO 2 at a
DOP of 4 had an AAS value of 1.5 for this time bucket

A new version 4.1 is already underway that includes new 12c features, so stay tuned.

Tuesday, March 18, 2014

Parallel Execution Overhead

I've started a new mini series about "Parallel Execution Skew" at AllThingsOracle.com. In order to avoid bloating the articles too much there I'll post some accompanying notes here on my blog.

If you follow the initial post you'll see that the sample query demonstrated there scales almost perfectly with Parallel Execution - the serial execution takes 57 seconds on my test system, whereas the a Parallel Execution at a DOP of 4 takes something between 14 and 15 seconds, so almost four times faster.

Here are again the corresponding two table join queries along with their execution plans that I haven't included in the "Parallel Execution Skew" article - the plans are generated using Oracle database version 12c which you can see from the new HYBRID HASH distribution method for parallel plans:

Serial query:
select count(t_2_filler) from (
select  /*+ monitor
            no_parallel
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
        */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_2
where
        t_2.fk_id_uniform = t_1.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Execution plan:
------------------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |   213 |       | 24508   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE     |      |     1 |   213 |       |            |          |
|*  2 |   HASH JOIN         |      |   100K|    20M|   225M| 24508   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| T_1  |  2000K|   202M|       |   796   (2)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| T_2  |  2000K|   204M|       |  1289   (2)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
       filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','    
                  \1',1,1,'c')>= REGEXP_REPLACE ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','
                      \1',1,1,'c'))
Parallel Query:
select count(t_2_filler) from (
select  /*+ monitor
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
            pq_distribute(t_2 hash hash)
        */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_2
where
        t_2.fk_id_uniform = t_1.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Execution plan:
------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     1 |   213 |       |  6186   (1)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE              |          |     1 |   213 |       |            |          |        |      |            |
|   2 |   PX COORDINATOR             |          |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)       | :TQ10002 |     1 |   213 |       |            |          |  Q1,02 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE           |          |     1 |   213 |       |            |          |  Q1,02 | PCWP |            |
|*  5 |      HASH JOIN               |          |   100K|    20M|    56M|  6186   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   6 |       PX RECEIVE             |          |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX SEND HYBRID HASH   | :TQ10000 |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|   8 |         STATISTICS COLLECTOR |          |       |       |       |            |          |  Q1,00 | PCWC |            |
|   9 |          PX BLOCK ITERATOR   |          |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL  | T_1      |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |       PX RECEIVE             |          |  2000K|   204M|       |   358   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |        PX SEND HYBRID HASH   | :TQ10001 |  2000K|   204M|       |   358   (2)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  13 |         PX BLOCK ITERATOR    |          |  2000K|   204M|       |   358   (2)| 00:00:01 |  Q1,01 | PCWC |            |
|  14 |          TABLE ACCESS FULL   | T_2      |  2000K|   204M|       |   358   (2)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
       filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c')>= REGEXP_REPLACE 
              ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c'))


Note: If you're interesting in learning more about reading Parallel Execution plans, Jonathan Lewis recently completed a series of posts about that topic

Looking at a corresponding Real-Time SQL Monitoring report (see the post at AllThingsOracle.com) of the Parallel Execution it can also be seen that it took approx. 58 seconds of Database Time (spread across four worker processes), so really pretty close to the duration / Database Time of the serial execution.

However in real-life cases that have to process large amounts of data and use more complex execution plans you'll usually see Parallel Execution not scaling perfectly, even if the work distribution is not the problem as discussed in the article series.

To some degree this is because Parallel Execution comes with overhead, and therefore the database actually has to work more than with a comparable serial execution.

The purpose of this post is to demonstrate this using the set-up used in the article series, by simply extending the two table join used there to a three table join.

Here is the corresponding serial query:

select count(t_2_filler) from (
select  /*+ monitor
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
            use_hash(a)
            no_swap_join_inputs(a)
            opt_estimate(join (t_1 t_2) scale_rows=20)
            no_parallel
         */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_1 a
      , t_2
where
        t_2.fk_id_uniform = t_1.id
and     t_2.id = a.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);

Execution plan:
-------------------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |   225 |       | 49560   (1)| 00:00:02 |
|   1 |  SORT AGGREGATE      |      |     1 |   225 |       |            |          |
|*  2 |   HASH JOIN          |      |  2006K|   430M|   441M| 49560   (1)| 00:00:02 |
|*  3 |    HASH JOIN         |      |  2003K|   418M|   225M| 25172   (1)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T_1  |  2000K|   202M|       |   796   (2)| 00:00:01 |
|   5 |     TABLE ACCESS FULL| T_2  |  2000K|   215M|       |  1289   (2)| 00:00:01 |
|   6 |    TABLE ACCESS FULL | T_1  |  2000K|    11M|       |   795   (2)| 00:00:01 |
-------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("T_2"."ID"="A"."ID")
   3 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
       filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','     
                 \1',1,1,'c')>= REGEXP_REPLACE ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','  
                    \1',1,1,'c'))
And here is the parallel query:
select count(t_2_filler) from (
select  /*+ monitor
            leading(t_1 t_2)
            use_hash(t_2)
            no_swap_join_inputs(t_2)
            pq_distribute(t_2 hash hash)
            use_hash(a)
            no_swap_join_inputs(a)
            pq_distribute(a hash hash)
            opt_estimate(join (t_1 t_2) scale_rows=20)
         */
        t_1.id as t_1_id
      , t_1.filler as t_1_filler
      , t_2.id as t_2_id
      , t_2.filler as t_2_filler
from    t_1
      , t_1 a
      , t_2
where
        t_2.fk_id_uniform = t_1.id
and     t_2.id = a.id
and     regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);
Execution plan:
----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |     1 |   225 |       | 12473   (1)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                  |          |     1 |   225 |       |            |          |        |      |            |
|   2 |   PX COORDINATOR                 |          |       |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)           | :TQ10004 |     1 |   225 |       |            |          |  Q1,04 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE               |          |     1 |   225 |       |            |          |  Q1,04 | PCWP |            |
|*  5 |      HASH JOIN                   |          |  2006K|   430M|   110M| 12473   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   6 |       PX RECEIVE                 |          |  2003K|   418M|       |  6354   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   7 |        PX SEND HYBRID HASH       | :TQ10002 |  2003K|   418M|       |  6354   (1)| 00:00:01 |  Q1,02 | P->P | HYBRID HASH|
|   8 |         STATISTICS COLLECTOR     |          |       |       |       |            |          |  Q1,02 | PCWC |            |
|*  9 |          HASH JOIN BUFFERED      |          |  2003K|   418M|    56M|  6354   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |           PX RECEIVE             |          |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  11 |            PX SEND HYBRID HASH   | :TQ10000 |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  12 |             STATISTICS COLLECTOR |          |       |       |       |            |          |  Q1,00 | PCWC |            |
|  13 |              PX BLOCK ITERATOR   |          |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  14 |               TABLE ACCESS FULL  | T_1      |  2000K|   202M|       |   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  15 |           PX RECEIVE             |          |  2000K|   215M|       |   358   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|  16 |            PX SEND HYBRID HASH   | :TQ10001 |  2000K|   215M|       |   358   (2)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  17 |             PX BLOCK ITERATOR    |          |  2000K|   215M|       |   358   (2)| 00:00:01 |  Q1,01 | PCWC |            |
|  18 |              TABLE ACCESS FULL   | T_2      |  2000K|   215M|       |   358   (2)| 00:00:01 |  Q1,01 | PCWP |            |
|  19 |       PX RECEIVE                 |          |  2000K|    11M|       |   221   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|  20 |        PX SEND HYBRID HASH       | :TQ10003 |  2000K|    11M|       |   221   (1)| 00:00:01 |  Q1,03 | P->P | HYBRID HASH|
|  21 |         PX BLOCK ITERATOR        |          |  2000K|    11M|       |   221   (1)| 00:00:01 |  Q1,03 | PCWC |            |
|  22 |          TABLE ACCESS FULL       | T_1      |  2000K|    11M|       |   221   (1)| 00:00:01 |  Q1,03 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
 
   5 - access("T_2"."ID"="A"."ID")
   9 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
       filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c')>= REGEXP_REPLACE 
              ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','        \1',1,1,'c'))
Note: The query uses an undocumented OPT_ESTIMATE hint to correct the bad join cardinality estimate caused by the complex REGEXP_REPLACE expression as part of the join condition - since the execution plan is pretty much dictated by the hints this is not strictly necessary in this case. For curiosity you could remove the OPT_ESTIMATE and PQ_DISTRIBUTE(A HASH HASH) hints. If you do so you'll notice that the optimizer uses then a BROADCAST distribution for the join result of T_1 and T_2 due to the incorrect cardinality estimate - so choosing the appropriate distribution method is another area where the cardinality estimates can cause a lot of trouble, in particular for under-estimates that lead to large row sources being distributed via broadcast.

On my test system the serial execution still takes approx. 59 seconds, so not much longer than the serial two table join, whereas the Parallel Execution takes between 28 and 30 seconds at a DOP of 4 - clearly not a four times improvement as for the simple two table join (14-15 seconds), although the work was distributed evenly across the worker processes.

We can see this confirmed by looking at a corresponding Real-Time SQL Monitoring report (click on the picture to enlarge):


Notice how for the majority of the time four Parallel Execution Servers were active, so work distribution is not the problem here. We can get a clue that some overhead was the problem here by looking at the "Time & Wait Statistics" section in the "Overview" at the top of the report: It shows a "Database Time" of 2 minutes (of which almost 100% are CPU time), so clearly the Parallel Execution had to work much more than the serial execution, which just took close to one minute.

What seems to happen here is that the overhead of the HASH JOIN BUFFERED operation that actually processes the probe row source of the hash join twice, once to buffer it (and by doing so buffering only rows that will survive the join according to the internal bitmap generated as part of the hash table build process (so false positives are possible), see the HASH JOIN BUFFERED post for more background information), and a second time to perform the actual join by processing the previously buffered data, causes the costly regular expression that is part of the join condition to be evaluated twice as many times as the serial execution has to do.

Now this particular case is an exaggeration of the overhead due to the CPU intensive expression - nevertheless it is a good demonstration that you shouldn't always expect Parallel Execution to scale perfectly, even with almost perfect work distribution.

Analysing Parallel Execution Skew - Video Tutorial

Along the new mini series "Parallel Execution Skew" at "AllThingsOracle.com" that provides some information what you can do if you happen to have a parallel SQL execution that is affected by work distribution problems I'm publishing a series of video tutorials that explain how you can actually detect and measure whether a SQL execution is affected by skew or not.

Originally this tutorial was planned as one part (Part 5 actually) of the XPLAN_ASH video tutorial series, however so far I've only managed to publish just the inital two parts of that series, and these are already a bit outdated as I've released new versions of the XPLAN_ASH tool with significant changes and enhancements since then.

Since this tutorial covers a lot of information and I don't want to bother you watching a single multi-hour video, I'm splitting this tutorial into four parts:
  • Brief introduction
  • Explaining Data Flow Operations (DFOs)
  • Analysing Parallel Execution Skew without Diagnostic / Tunink Pack license
  • Analysing Parallel Execution Skew with Diagnostic / Tunink Pack license
So here is the initial part on my Youtube channel, a brief introduction what the "Parallel Execution Skew" problem is about.

The links mentioned at the beginning of the video are:
Webinar recording "Analysing and Troubleshooting Parallel Execution"
OTN mini series "Understanding Parallel Execution"