I've started a new mini series about
"Parallel Execution Skew" at
AllThingsOracle.com. In order to avoid bloating the articles too much there I'll post some accompanying notes here on my blog.
If you follow the
initial post you'll see that the sample query demonstrated there scales almost perfectly with Parallel Execution - the
serial execution takes
57 seconds on my test system, whereas the a Parallel Execution at a
DOP of 4 takes something between
14 and 15 seconds, so almost four times faster.
Here are again the corresponding two table join queries along with their execution plans that I haven't included in the "Parallel Execution Skew" article - the plans are generated using Oracle database version 12c which you can see from the new HYBRID HASH distribution method for parallel plans:
Serial query:
select count(t_2_filler) from (
select /*+ monitor
no_parallel
leading(t_1 t_2)
use_hash(t_2)
no_swap_join_inputs(t_2)
*/
t_1.id as t_1_id
, t_1.filler as t_1_filler
, t_2.id as t_2_id
, t_2.filler as t_2_filler
from t_1
, t_2
where
t_2.fk_id_uniform = t_1.id
and regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);
Execution plan:
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 213 | | 24508 (1)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 213 | | | |
|* 2 | HASH JOIN | | 100K| 20M| 225M| 24508 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL| T_1 | 2000K| 202M| | 796 (2)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T_2 | 2000K| 204M| | 1289 (2)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','
\1',1,1,'c')>= REGEXP_REPLACE ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','
\1',1,1,'c'))
Parallel Query:
select count(t_2_filler) from (
select /*+ monitor
leading(t_1 t_2)
use_hash(t_2)
no_swap_join_inputs(t_2)
pq_distribute(t_2 hash hash)
*/
t_1.id as t_1_id
, t_1.filler as t_1_filler
, t_2.id as t_2_id
, t_2.filler as t_2_filler
from t_1
, t_2
where
t_2.fk_id_uniform = t_1.id
and regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);
Execution plan:
------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 213 | | 6186 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | 213 | | | | | | |
| 2 | PX COORDINATOR | | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 1 | 213 | | | | Q1,02 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 213 | | | | Q1,02 | PCWP | |
|* 5 | HASH JOIN | | 100K| 20M| 56M| 6186 (1)| 00:00:01 | Q1,02 | PCWP | |
| 6 | PX RECEIVE | | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,02 | PCWP | |
| 7 | PX SEND HYBRID HASH | :TQ10000 | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 8 | STATISTICS COLLECTOR | | | | | | | Q1,00 | PCWC | |
| 9 | PX BLOCK ITERATOR | | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 10 | TABLE ACCESS FULL | T_1 | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | PCWP | |
| 11 | PX RECEIVE | | 2000K| 204M| | 358 (2)| 00:00:01 | Q1,02 | PCWP | |
| 12 | PX SEND HYBRID HASH | :TQ10001 | 2000K| 204M| | 358 (2)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 13 | PX BLOCK ITERATOR | | 2000K| 204M| | 358 (2)| 00:00:01 | Q1,01 | PCWC | |
| 14 | TABLE ACCESS FULL | T_2 | 2000K| 204M| | 358 (2)| 00:00:01 | Q1,01 | PCWP | |
------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$',' \1',1,1,'c')>= REGEXP_REPLACE
("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$',' \1',1,1,'c'))
Note: If you're interesting in learning more about reading Parallel Execution plans, Jonathan Lewis recently completed a series of posts about that topic
Looking at a corresponding Real-Time SQL Monitoring report (see the
post at AllThingsOracle.com) of the Parallel Execution it can also be seen that it took approx.
58 seconds of Database Time (spread across
four worker processes), so really pretty close to the duration / Database Time of the serial execution.
However in real-life cases that have to process large amounts of data and use more complex execution plans you'll usually see Parallel Execution not scaling perfectly, even if the work distribution is not the problem as discussed in the article series.
To some degree this is because Parallel Execution comes with
overhead, and therefore the database actually has to work more than with a comparable serial execution.
The purpose of this post is to demonstrate this using the set-up used in the article series, by simply extending the
two table join used there to a
three table join.
Here is the corresponding serial query:
select count(t_2_filler) from (
select /*+ monitor
leading(t_1 t_2)
use_hash(t_2)
no_swap_join_inputs(t_2)
use_hash(a)
no_swap_join_inputs(a)
opt_estimate(join (t_1 t_2) scale_rows=20)
no_parallel
*/
t_1.id as t_1_id
, t_1.filler as t_1_filler
, t_2.id as t_2_id
, t_2.filler as t_2_filler
from t_1
, t_1 a
, t_2
where
t_2.fk_id_uniform = t_1.id
and t_2.id = a.id
and regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);
Execution plan:
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 225 | | 49560 (1)| 00:00:02 |
| 1 | SORT AGGREGATE | | 1 | 225 | | | |
|* 2 | HASH JOIN | | 2006K| 430M| 441M| 49560 (1)| 00:00:02 |
|* 3 | HASH JOIN | | 2003K| 418M| 225M| 25172 (1)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T_1 | 2000K| 202M| | 796 (2)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T_2 | 2000K| 215M| | 1289 (2)| 00:00:01 |
| 6 | TABLE ACCESS FULL | T_1 | 2000K| 11M| | 795 (2)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("T_2"."ID"="A"."ID")
3 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$','
\1',1,1,'c')>= REGEXP_REPLACE ("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$','
\1',1,1,'c'))
And here is the parallel query:
select count(t_2_filler) from (
select /*+ monitor
leading(t_1 t_2)
use_hash(t_2)
no_swap_join_inputs(t_2)
pq_distribute(t_2 hash hash)
use_hash(a)
no_swap_join_inputs(a)
pq_distribute(a hash hash)
opt_estimate(join (t_1 t_2) scale_rows=20)
*/
t_1.id as t_1_id
, t_1.filler as t_1_filler
, t_2.id as t_2_id
, t_2.filler as t_2_filler
from t_1
, t_1 a
, t_2
where
t_2.fk_id_uniform = t_1.id
and t_2.id = a.id
and regexp_replace(t_2.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') >= regexp_replace(t_1.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
);
Execution plan:
----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 225 | | 12473 (1)| 00:00:01 | | | |
| 1 | SORT AGGREGATE | | 1 | 225 | | | | | | |
| 2 | PX COORDINATOR | | | | | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10004 | 1 | 225 | | | | Q1,04 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | 1 | 225 | | | | Q1,04 | PCWP | |
|* 5 | HASH JOIN | | 2006K| 430M| 110M| 12473 (1)| 00:00:01 | Q1,04 | PCWP | |
| 6 | PX RECEIVE | | 2003K| 418M| | 6354 (1)| 00:00:01 | Q1,04 | PCWP | |
| 7 | PX SEND HYBRID HASH | :TQ10002 | 2003K| 418M| | 6354 (1)| 00:00:01 | Q1,02 | P->P | HYBRID HASH|
| 8 | STATISTICS COLLECTOR | | | | | | | Q1,02 | PCWC | |
|* 9 | HASH JOIN BUFFERED | | 2003K| 418M| 56M| 6354 (1)| 00:00:01 | Q1,02 | PCWP | |
| 10 | PX RECEIVE | | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,02 | PCWP | |
| 11 | PX SEND HYBRID HASH | :TQ10000 | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | P->P | HYBRID HASH|
| 12 | STATISTICS COLLECTOR | | | | | | | Q1,00 | PCWC | |
| 13 | PX BLOCK ITERATOR | | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | PCWC | |
| 14 | TABLE ACCESS FULL | T_1 | 2000K| 202M| | 221 (1)| 00:00:01 | Q1,00 | PCWP | |
| 15 | PX RECEIVE | | 2000K| 215M| | 358 (2)| 00:00:01 | Q1,02 | PCWP | |
| 16 | PX SEND HYBRID HASH | :TQ10001 | 2000K| 215M| | 358 (2)| 00:00:01 | Q1,01 | P->P | HYBRID HASH|
| 17 | PX BLOCK ITERATOR | | 2000K| 215M| | 358 (2)| 00:00:01 | Q1,01 | PCWC | |
| 18 | TABLE ACCESS FULL | T_2 | 2000K| 215M| | 358 (2)| 00:00:01 | Q1,01 | PCWP | |
| 19 | PX RECEIVE | | 2000K| 11M| | 221 (1)| 00:00:01 | Q1,04 | PCWP | |
| 20 | PX SEND HYBRID HASH | :TQ10003 | 2000K| 11M| | 221 (1)| 00:00:01 | Q1,03 | P->P | HYBRID HASH|
| 21 | PX BLOCK ITERATOR | | 2000K| 11M| | 221 (1)| 00:00:01 | Q1,03 | PCWC | |
| 22 | TABLE ACCESS FULL | T_1 | 2000K| 11M| | 221 (1)| 00:00:01 | Q1,03 | PCWP | |
----------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("T_2"."ID"="A"."ID")
9 - access("T_2"."FK_ID_UNIFORM"="T_1"."ID")
filter( REGEXP_REPLACE ("T_2"."FILLER",'^\s+([[:alnum:]]+)\s+$',' \1',1,1,'c')>= REGEXP_REPLACE
("T_1"."FILLER",'^\s+([[:alnum:]]+)\s+$',' \1',1,1,'c'))
Note: The query uses an undocumented OPT_ESTIMATE hint to correct the bad join cardinality estimate caused by the complex REGEXP_REPLACE expression as part of the join condition - since the execution plan is pretty much dictated by the hints this is not strictly necessary in this case. For curiosity you could remove the OPT_ESTIMATE and PQ_DISTRIBUTE(A HASH HASH) hints. If you do so you'll notice that the optimizer uses then a BROADCAST distribution for the join result of T_1 and T_2 due to the incorrect cardinality estimate - so choosing the appropriate distribution method is another area where the cardinality estimates can cause a lot of trouble, in particular for under-estimates that lead to large row sources being distributed via broadcast.
On my test system the serial execution still takes
approx. 59 seconds, so not much longer than the serial two table join, whereas the Parallel Execution takes between
28 and 30 seconds at a
DOP of 4 - clearly not a four times improvement as for the simple two table join (14-15 seconds), although the work was distributed evenly across the worker processes.
We can see this confirmed by looking at a corresponding Real-Time SQL Monitoring report (click on the picture to enlarge):
Notice how for the majority of the time four Parallel Execution Servers were active, so work distribution is not the problem here. We can get a clue that some overhead was the problem here by looking at the
"Time & Wait Statistics" section in the
"Overview" at the top of the report: It shows a "Database Time" of
2 minutes (of which almost 100% are CPU time), so clearly the Parallel Execution had to work much more than the serial execution, which just took close to one minute.
What seems to happen here is that the overhead of the
HASH JOIN BUFFERED operation that actually processes the
probe row source of the hash join twice, once to buffer it (and by doing so buffering only rows that will survive the join according to the internal bitmap generated as part of the hash table build process (so false positives are possible), see the HASH JOIN BUFFERED post for more background information), and a second time to perform the actual join by processing the previously buffered data, causes the costly regular expression that is part of the join condition to be evaluated twice as many times as the serial execution has to do.
Now this particular case is an exaggeration of the overhead due to the CPU intensive expression - nevertheless it is a good demonstration that you shouldn't always expect Parallel Execution to scale perfectly, even with almost perfect work distribution.