Introduction
A few years ago Jonathan Lewis
published a blog post that described one of the interesting side effects of Oracle's Parallel Execution implementation: Sometimes operations that usually are
non-blocking will be turned into
blocking ones. Mostly these are represented by additional
BUFFER SORT operation that show up in the parallel version of an execution plan from 10g on (pre-10g does the same internally but doesn't show in the execution plan), but there is a special case which is the
HASH JOIN BUFFERED operation that gets used with the
HASH data distribution of the join row sources.
Jonathan came to the conclusion that the HASH JOIN BUFFERED turns into a blocking operation by buffering
the result set of the join by looking at different trace files (the 10104 hash join trace file and 10046 extended trace) and the fact that when the join produced no result (no matches between the two row sources) the obvious spill to disk didn't happen anymore.
He showed this using an example that simply joined two tables using Parallel Execution returning the result of the join to the client. Another interesting point of this example is that the BUFFERed operation takes place although it is
not entirely obvious why if you follow the explanation that "at most two Parallel Slave Sets can be active per Data Flow Operation" and hence sometimes both sets are busy and a third set would be required to receive the data produced by the other two. This is however not the case with this simple example, as the data produced by the two sets simply needs to be
returned to the client by sending it to the Query Coordinator.
While I prepared my "Parallel Execution" seminar I wanted to demonstrate this via a simple test case where two rather small tables are joined but result in a quite
large result set. Since the HASH JOIN BUFFERED is supposed to buffer this large result set
before returning it to the client this should make this special behaviour quite obvious.
Test Case Results
However the test case showed some interesting results that seemed to suggest that Jonathan's conclusions weren't entirely correct:
1. Although the 10104 hash join trace file reads "BUFFER(Compile) output of the join for Parallel Query" and "BUFFER(Execution) output of the join for PQ" it looks like the HASH JOIN BUFFERED operation in fact does
buffer the second row source rather than the result of the join operation. I'll demonstrate below why I believe this is so
2. Although I could reproduce Jonathan's test case, in particular that no spill to disk takes place when the two row sources do not match, I believe that the point of the HASH JOIN BUFFERED is that it can take advantage of the fact that the hash table for the first row source is already built when the second row source is accessed. So in principle it looks like the operation
only buffers data from the second row source that has a match in the hash table. Data that doesn't match isn't buffered - that's the special functionality of the HASH JOIN BUFFERED that makes it different from separate BUFFER SORT operations that buffer
unconditionally and allows to explain why no obvious buffering takes place if the two row sources don't match.
3. When looking into the 10104 HASH JOIN trace file it becomes obvious that the spilling to disk of the odd 9th partition as described in Jonathan's post takes place
before the actual probe phase seems to begin (kxhfSetPhase: phase=PROBE_2), which again I believe suggests that it cannot be the result set that gets buffered, since this will only be produced as soon as the probe phase begins
4. The implementation restriction of Oracle's Parallel Execution that requires these additional, artificial blocking operations does not seem to be "at most two Parallel Slave Sets can be active at the same time", but more precisely it seems to be
"At most one data distribution can be active at the same time"
This
includes the final data distribution to the
Query Coordinator process in case of queries and explains why the
simple case of a two table join using HASH distribution results in a BUFFERED operation: The
PX SEND HASH operation for the second row source would have to be active at the same time as the
PX SEND QC operation returning data to the Query Coordinator as the HASH JOIN by default is only blocking when consuming the first row source to build the hash table but it isn't blocking when processing the second row source probing the hash table.
Since it doesn't seem to be supported to have two PX SEND operations active at the same time, some
artificial blocking operation needs to be introduced, in this case the
HASH JOIN BUFFERED, that first consumes the
second row source completely before starting the actual probe phase. By doing so, the PX SEND operation used to distribute the second row source to the Parallel Slaves performing the hash join
is no longer active when the actual probe phase starts and therefore the result set can be produced and sent to the Query Coordinator using the then only active PX SEND QC operation.
The following formatted execution plan highlights the two
PX SEND operations that would have to be active at the same time if there wasn't a
blocking operation in between:
-------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | |
| 1 | COUNT | | 1 | | | |
|* 2 | FILTER | | 1 | | | |
| 3 | PX COORDINATOR | | 1 | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10002 | 0 | Q1,02 | P->S | QC (RAND) | <---------
| 5 | VIEW | | 2 | Q1,02 | PCWP | |
|* 6 | HASH JOIN BUFFERED | | 2 | Q1,02 | PCWP | | <--------- Artificial blocking
| 7 | PX RECEIVE | | 2 | Q1,02 | PCWP | |
| 8 | PX SEND HASH | :TQ10000 | 0 | Q1,00 | P->P | HASH |
| 9 | PX BLOCK ITERATOR | | 2 | Q1,00 | PCWC | |
|* 10 | TABLE ACCESS FULL| T2 | 26 | Q1,00 | PCWP | |
| 11 | PX RECEIVE | | 2 | Q1,02 | PCWP | |
| 12 | PX SEND HASH | :TQ10001 | 0 | Q1,01 | P->P | HASH | <---------
| 13 | PX BLOCK ITERATOR | | 2 | Q1,01 | PCWC | |
|* 14 | TABLE ACCESS FULL| T4 | 26 | Q1,01 | PCWP | |
-------------------------------------------------------------------------------------
Test Case Details
Here are the preparation steps for the simple test case I used:
create table t2
compress
as
select
rownum as id
, mod(rownum, 1000) + 1 as fk
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000000
;
exec dbms_stats.gather_table_stats(null, 't2')
-- Create a copy of T2
create table t4
compress as
select * from t2;
-- Add rows to T4 that don't match T2
-- This shows that the BUFFERED operation discards those non-matching rows
-- rather than buffering them when using T4 as second row source
insert /*+ append */
into t4
select
1000000 + rownum as id
, 1000 + mod(rownum, 1000) + 1 as fk
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000000
;
commit;
exec dbms_stats.gather_table_stats(null, 't4')
alter table t2 parallel 2;
alter table t4 parallel 2;
And this is basically the simple script that gets executed:
set echo on timing on time on
alter session set workarea_size_policy = manual;
alter session set hash_area_size = 80000000;
alter session set sort_area_size = 80000000;
alter session set statistics_level = all;
alter session set tracefile_identifier = 'HJ_trace1';
alter session set events '10104 trace name context forever, level 10';
--explain plan for
--select max(a_filler), max(b_filler) from (
select /* HJ_trace1 */ * from (
select /*+ no_merge use_hash(a b) no_cpu_costing leading(a b) pq_distribute(b, hash, hash) */
a.filler as a_filler, b.filler as b_filler
from
t2 a
, t4 b
where
-- a.id = b.id
a.fk = b.fk
)
where rownum > 1
;
alter session set events '10104 trace name context off';
set linesize 200 pagesize 0
select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS ALL ADVANCED'));
I've used manual workarea sizing to make the test repeatable.
Note that I've added a
comment to the query that should make the query text
unique in order to generate a new parent cursor for each test run (so you would need to modify this comment for each test run). The only reason for this is the
limitation of DBMS_XPLAN.DISPLAY_CURSOR with Parallel Execution as outlined in one of my
previous posts, otherwise the "ALLSTATS ALL" option of DISPLAY_CURSOR would aggregate the statistics over
all executions rather than only the
last parallel execution.
Because I also tested some other costing related issues I disabled CPU costing for this test, however the results should be exactly the same when
enabling CPU costing.
So there are basically two sets of data: T2 is 1,000K rows and approx. 100MB in size, and T4 is twice the size and rows, however only 100MB out of the 200MB represent data that matches T2 on either ID or FK.
The result set is approx. 200 bytes per row, so for example 200MB if 1,000K rows are produced.
The tables are
compressed using BASIC compression, which results in this case in a very good compression ratio as the FILLER column is 100 bytes in size but only has
one distinct value and therefore can benefit a lot from the symbol replacement performed by BASIC compression.
The point of this compression is that it makes in this particular case very obvious that while you can
benefit a lot from compression at storage, I/O level and Buffer Cache in general, at SQL execution time Oracle has to process
uncompressed row sources (and not compressed blocks), so all the workareas that have to be used for hash tables, sorts or simple buffering won't benefit from the compression but have to be big enough for the uncompressed row source data. Likewise any data that gets distributed using PX SEND also represents uncompressed data volume.
There are two important variations possible to this test case query:
1. Instead of joining on ID which produces
1,000K rows join the two tables on the
FK column which results in a
huge result set of 1,000M rows (each row matches 1,000 rows from the other row source)
2. Use this variation of the query that doesn't require an
artificial blocking due to the fact that a
true blocking operation gets used:
select max(a_filler), max(b_filler) from (
select /*+ no_merge use_hash(a b) no_cpu_costing leading(a b) pq_distribute(b, hash, hash) */
a.filler as a_filler, b.filler as b_filler
from
t2 a
, t4 b
where
-- a.id = b.id
a.fk = b.fk
)
;
This way the impact of the
BUFFERED operation can be easily separated from other potential activity like a too small workarea for an optimal hash join. If the operation completes without TEMP I/O activity when using a true blocking operation but spills to TEMP when running in BUFFERED mode then the I/O activity very likely comes from the additional buffering of data.
Detailed Analysis
Small Result Set
Running the variation in
BUFFERED mode where only 1,000K rows get produced, the following can be seen from the DBMS_XPLAN.DISPLAY_CURSOR output (assuming you don't run this
cross-instance in RAC as DBMS_XPLAN can only show relevant statistics for the local instance), using a block size of 8KB and a 32bit version of Oracle:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M | Max-Tmp |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 3053 | | | | 0 |00:00:27.10 | 10 | 0 | 0 | | | | |
| 1 | COUNT | | 1 | | | | | | | 0 |00:00:27.10 | 10 | 0 | 0 | | | | |
|* 2 | FILTER | | 1 | | | | | | | 0 |00:00:27.10 | 10 | 0 | 0 | | | | |
| 3 | PX COORDINATOR | | 1 | | | | | | | 1000K|00:00:25.45 | 10 | 0 | 0 | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1001K| 99M| 3053 | Q1,02 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 5 | VIEW | | 2 | 1001K| 99M| 3053 | Q1,02 | PCWP | | 1000K|00:00:48.29 | 0 | 13208 | 13208 | | | | |
|* 6 | HASH JOIN BUFFERED | | 2 | 1001K| 203M| 3053 | Q1,02 | PCWP | | 1000K|00:00:44.91 | 0 | 13208 | 13208 | 130M| 8379K| 2/0/0| 54272 |
| 7 | PX RECEIVE | | 2 | 1000K| 101M| 171 | Q1,02 | PCWP | | 1000K|00:00:04.07 | 0 | 0 | 0 | | | | |
| 8 | PX SEND HASH | :TQ10000 | 0 | 1000K| 101M| 171 | Q1,00 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 9 | PX BLOCK ITERATOR | | 2 | 1000K| 101M| 171 | Q1,00 | PCWC | | 1000K|00:00:06.15 | 2066 | 2014 | 0 | | | | |
|* 10 | TABLE ACCESS FULL| T2 | 26 | 1000K| 101M| 171 | Q1,00 | PCWP | | 1000K|00:00:02.29 | 2066 | 2014 | 0 | | | | |
| 11 | PX RECEIVE | | 2 | 2000K| 204M| 352 | Q1,02 | PCWP | | 1000K|00:00:16.85 | 0 | 0 | 0 | | | | |
| 12 | PX SEND HASH | :TQ10001 | 0 | 2000K| 204M| 352 | Q1,01 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 13 | PX BLOCK ITERATOR | | 2 | 2000K| 204M| 352 | Q1,01 | PCWC | | 2000K|00:00:12.68 | 4212 | 4158 | 0 | | | | |
|* 14 | TABLE ACCESS FULL| T4 | 26 | 2000K| 204M| 352 | Q1,01 | PCWP | | 2000K|00:00:05.09 | 4212 | 4158 | 0 | | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The statement got executed with a Parallel Degree of 2. Notice that only 2K (16MB) resp. 4K (32MB) blocks were processed for reading the two row sources, but the hash join had to read/write two times 50MB (100MB uncompressed data volume in total). If it was the result set that got buffered I would expect it to read/write 200MB in total.
If you repeat this variation with the
true blocking operation then it shouldn't spill to disk at all, as indicated above by the
O=2 (two optimal hash joins) in the column "O/1/M", which is interesting on its own, since it confirms that the HASH JOIN operated in
optimal mode albeit the fact that it spilled to disk due to the BUFFERED operation.
Large Result Set
Running the variation
in BUFFERED mode that joins on FK and therefore produces
a huge result set looks like this (cancelled after a short while, so not run to completion in this case here):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 58314 | | | | 0 |00:00:00.01 | 10 | 0 | 0 | | | |
| 1 | COUNT | | 1 | | | | | | | 0 |00:00:00.01 | 10 | 0 | 0 | | | |
|* 2 | FILTER | | 1 | | | | | | | 0 |00:00:00.01 | 10 | 0 | 0 | | | |
| 3 | PX COORDINATOR | | 1 | | | | | | | 3694K|00:00:35.06 | 10 | 0 | 0 | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1000M| 96G| 58314 | Q1,02 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 5 | VIEW | | 2 | 1000M| 96G| 58314 | Q1,02 | PCWP | | 3694K|00:01:03.14 | 0 | 508 | 12827 | | | |
|* 6 | HASH JOIN BUFFERED | | 2 | 1000M| 195G| 58314 | Q1,02 | PCWP | | 3694K|00:00:49.43 | 0 | 508 | 12827 | 133M| 8387K| 2/0/0|
| 7 | PX RECEIVE | | 2 | 1000K| 100M| 171 | Q1,02 | PCWP | | 1000K|00:00:09.37 | 0 | 0 | 0 | | | |
| 8 | PX SEND HASH | :TQ10000 | 0 | 1000K| 100M| 171 | Q1,00 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 9 | PX BLOCK ITERATOR | | 2 | 1000K| 100M| 171 | Q1,00 | PCWC | | 1000K|00:00:06.77 | 2066 | 2014 | 0 | | | |
|* 10 | TABLE ACCESS FULL| T2 | 26 | 1000K| 100M| 171 | Q1,00 | PCWP | | 1000K|00:00:02.87 | 2066 | 2014 | 0 | | | |
| 11 | PX RECEIVE | | 2 | 2000K| 200M| 352 | Q1,02 | PCWP | | 2000K|00:00:16.63 | 0 | 0 | 0 | | | |
| 12 | PX SEND HASH | :TQ10001 | 0 | 2000K| 200M| 352 | Q1,01 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | |
| 13 | PX BLOCK ITERATOR | | 2 | 2000K| 200M| 352 | Q1,01 | PCWC | | 2000K|00:00:12.80 | 4212 | 4158 | 0 | | | |
|* 14 | TABLE ACCESS FULL| T4 | 26 | 2000K| 200M| 352 | Q1,01 | PCWP | | 2000K|00:00:05.25 | 4212 | 4158 | 0 | | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
There are a few interesting points to mention:
1. If the query is executed without the outer query that filters all data, the first rows are returned
pretty quickly. If it was the result set that got buffered, this shouldn't be the case here, instead a huge TEMP space usage should be observed until finally the result set is returned to the parent operation/client
2. The second row source is
consumed completely before the join operation is completed and a steady TEMP read activity can be observed while the data is returned to the client
3. The data volume written to TEMP corresponds roughly to what was written to TEMP in the first example, and stays like that during the whole execution. It doesn't increase any more during the join operation.
The difference in TEMP usage of the first example might come from the fact that I've used 75000000 as workarea size for some of my initial test runs and therefore the output above comes from such an early run.
So this variation pretty clearly shows that it is not the result set that gets buffered. It looks like the second row source is what gets buffered, as it is already consumed completely before the join is completed.
Large Result Set, true blocking operation
Repeating the same variant
with a true blocking operation that doesn't require the BUFFERED mode of the HASH JOIN, looks like that (again not run to completion here):
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | O/1/M |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 58441 | | | | 0 |00:00:00.05 | 10 | 0 | | | |
| 1 | SORT AGGREGATE | | 1 | 1 | 104 | | | | | 0 |00:00:00.05 | 10 | 0 | | | |
| 2 | PX COORDINATOR | | 1 | | | | | | | 0 |00:00:00.05 | 10 | 0 | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1 | 104 | | Q1,02 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | | | |
| 4 | SORT AGGREGATE | | 2 | 1 | 104 | | Q1,02 | PCWP | | 0 |00:00:00.01 | 0 | 0 | | | |
| 5 | VIEW | | 2 | 1000M| 96G| 58441 | Q1,02 | PCWP | | 11M|00:01:14.09 | 0 | 0 | | | |
|* 6 | HASH JOIN | | 2 | 1000M| 195G| 58441 | Q1,02 | PCWP | | 11M|00:00:37.73 | 0 | 0 | 133M| 8133K| 2/0/0|
| 7 | PX RECEIVE | | 2 | 1000K| 100M| 171 | Q1,02 | PCWP | | 1000K|00:00:09.03 | 0 | 0 | | | |
| 8 | PX SEND HASH | :TQ10000 | 0 | 1000K| 100M| 171 | Q1,00 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | | | |
| 9 | PX BLOCK ITERATOR | | 2 | 1000K| 100M| 171 | Q1,00 | PCWC | | 1000K|00:00:06.69 | 2066 | 2014 | | | |
|* 10 | TABLE ACCESS FULL| T2 | 26 | 1000K| 100M| 171 | Q1,00 | PCWP | | 1000K|00:00:02.89 | 2066 | 2014 | | | |
| 11 | PX RECEIVE | | 2 | 2000K| 200M| 352 | Q1,02 | PCWP | | 11309 |00:00:00.05 | 0 | 0 | | | |
| 12 | PX SEND HASH | :TQ10001 | 0 | 2000K| 200M| 352 | Q1,01 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | | | |
| 13 | PX BLOCK ITERATOR | | 2 | 2000K| 200M| 352 | Q1,01 | PCWC | | 12602 |00:00:00.13 | 29 | 46 | | | |
|* 14 | TABLE ACCESS FULL| T4 | 2 | 2000K| 200M| 352 | Q1,01 | PCWP | | 12602 |00:00:00.09 | 29 | 46 | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Notice the difference:
No TEMP activity, and the second row source gets
gradually consumed as the join processes and data is returned to the parent operations.
BUFFERED vs. BUFFER SORT
In order to demonstrate the
cunning optimization the HASH JOIN BUFFERED can perform, let's use a different distribution method that results in the second row source getting distributed via
BROADCAST and therefore
requires in this case here an artificial,
separate BUFFER SORT operation when receiving the broadcasted data (as otherwise again two PX SEND operations would be active at the same time):
select /* HJ_trace1 */ * from (
select /*+ no_merge use_hash(a b) no_cpu_costing leading(a b) pq_distribute(b, none, broadcast) */
a.filler as a_filler, b.filler as b_filler
from
t2 a
, t4 b
where
a.id = b.id
-- a.fk = b.fk
)
where rownum > 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M | Max-Tmp |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2922 | | | | 0 |00:00:51.64 | 10 | 0 | 0 | | | | |
| 1 | COUNT | | 1 | | | | | | | 0 |00:00:51.64 | 10 | 0 | 0 | | | | |
|* 2 | FILTER | | 1 | | | | | | | 0 |00:00:51.64 | 10 | 0 | 0 | | | | |
| 3 | PX COORDINATOR | | 1 | | | | | | | 1000K|00:00:50.10 | 10 | 0 | 0 | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10001 | 0 | 1001K| 99M| 2922 | Q1,01 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 5 | VIEW | | 2 | 1001K| 99M| 2922 | Q1,01 | PCWP | | 1000K|00:01:37.72 | 2070 | 58364 | 56350 | | | | |
|* 6 | HASH JOIN | | 2 | 1001K| 203M| 2922 | Q1,01 | PCWP | | 1000K|00:01:34.58 | 2070 | 58364 | 56350 | 129M| 8125K| 2/0/0| |
| 7 | PX BLOCK ITERATOR | | 2 | 1000K| 101M| 171 | Q1,01 | PCWC | | 1000K|00:00:05.48 | 2066 | 2014 | 0 | | | | |
|* 8 | TABLE ACCESS FULL | T2 | 26 | 1000K| 101M| 171 | Q1,01 | PCWP | | 1000K|00:00:02.30 | 2066 | 2014 | 0 | | | | |
| 9 | BUFFER SORT | | 2 | | | | Q1,01 | PCWC | | 4000K|00:01:02.00 | 4 | 56350 | 56350 | 495M| 7329K| 2/0/0| 221K|
| 10 | PX RECEIVE | | 2 | 2000K| 204M| 352 | Q1,01 | PCWP | | 4000K|00:00:21.14 | 0 | 0 | 0 | | | | |
| 11 | PX SEND BROADCAST | :TQ10000 | 0 | 2000K| 204M| 352 | Q1,00 | P->P | BROADCAST | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 12 | PX BLOCK ITERATOR | | 2 | 2000K| 204M| 352 | Q1,00 | PCWC | | 2000K|00:00:12.81 | 4212 | 4158 | 0 | | | | |
|* 13 | TABLE ACCESS FULL| T4 | 26 | 2000K| 204M| 352 | Q1,00 | PCWP | | 2000K|00:00:04.83 | 4212 | 4158 | 0 | | | | |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Notice how now the
complete T4 row source was buffered twice (that's the side effect of broadcasting it to each Parallel Slave, in this case with degree 2), resulting in more than
four times more TEMP space usage than in the BUFFERED variant. So the separate BUFFER SORT operation obviously wasn't able to avoid the buffering of the data from T4 that doesn't match T2 (in which case it should have buffered only approx. 100MB of data twice), whereas the HASH JOIN BUFFERED simply discarded that data from T4 immediately without bothering to buffer it.
It is also interesting to note that the BUFFER SORT operation was reported as "optimal", although it obviously spilled to disk (roughly two times 221MB, the 221K is the old defect of DBMS_XPLAN.DISPLAY_CURSOR to report the TEMP usage using the wrong unit)
Very Small Result Set
I also did a complementary test where the result set generated
is much smaller than the second row source, just in case there is another cunning optimization in place that could decide to buffer either the second row source or the result set depending on which is estimated to be smaller.
select /* HJ_trace2 */ * from (
select /*+ no_merge use_hash(a b) no_cpu_costing leading(a b) pq_distribute(b, hash, hash) */
a.id as a_id, b.id as b_id
from
t2 a
, t4 b
where
a.id = b.id
a.filler = b.filler
)
where rownum > 1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | O/1/M | Max-Tmp |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | | 2922 | | | | 0 |00:00:27.73 | 10 | 0 | 0 | | | | |
| 1 | COUNT | | 1 | | | | | | | 0 |00:00:27.73 | 10 | 0 | 0 | | | | |
|* 2 | FILTER | | 1 | | | | | | | 0 |00:00:27.73 | 10 | 0 | 0 | | | | |
| 3 | PX COORDINATOR | | 1 | | | | | | | 1000K|00:00:26.08 | 10 | 0 | 0 | | | | |
| 4 | PX SEND QC (RANDOM) | :TQ10002 | 0 | 1001K| 24M| 2922 | Q1,02 | P->S | QC (RAND) | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 5 | VIEW | | 2 | 1001K| 24M| 2922 | Q1,02 | PCWP | | 1000K|00:00:51.28 | 0 | 12319 | 12319 | | | | |
|* 6 | HASH JOIN BUFFERED | | 2 | 1001K| 203M| 2922 | Q1,02 | PCWP | | 1000K|00:00:47.69 | 0 | 12319 | 12319 | 130M| 8379K| 2/0/0| 51200 |
| 7 | PX RECEIVE | | 2 | 1000K| 101M| 171 | Q1,02 | PCWP | | 1000K|00:00:09.69 | 0 | 0 | 0 | | | | |
| 8 | PX SEND HASH | :TQ10000 | 0 | 1000K| 101M| 171 | Q1,00 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 9 | PX BLOCK ITERATOR | | 2 | 1000K| 101M| 171 | Q1,00 | PCWC | | 1000K|00:00:06.87 | 2066 | 2014 | 0 | | | | |
|* 10 | TABLE ACCESS FULL| T2 | 26 | 1000K| 101M| 171 | Q1,00 | PCWP | | 1000K|00:00:02.91 | 2066 | 2014 | 0 | | | | |
| 11 | PX RECEIVE | | 1 | 2000K| 204M| 352 | Q1,02 | PCWP | | 2000K|00:00:16.74 | 0 | 0 | 0 | | | | |
| 12 | PX SEND HASH | :TQ10001 | 0 | 2000K| 204M| 352 | Q1,01 | P->P | HASH | 0 |00:00:00.01 | 0 | 0 | 0 | | | | |
| 13 | PX BLOCK ITERATOR | | 2 | 2000K| 204M| 352 | Q1,01 | PCWC | | 2000K|00:00:12.43 | 4212 | 4158 | 0 | | | | |
|* 14 | TABLE ACCESS FULL| T4 | 26 | 2000K| 204M| 352 | Q1,01 | PCWP | | 2000K|00:00:04.77 | 4212 | 4158 | 0 | | | | |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
So although the result set is estimated only to be
24MB in size in this case the amount of data that spilled to disk is still
roughly 100MB, which seems to suggest that it is
always the second row source that gets buffered.
10104 Trace File Snippet
Finally, a sample extract from the 10104 hash join trace file showing that spilling to disk takes place
before the PROBE_2 phase begins:
.
.
.
kxhfWrite: hash-join is spilling to disk
kxhfWrite: Writing dba=130176 slot=56 part=8
kxhfIsDone: waiting slot=56 lbs=000000001CCB7308
kxhfWrite: Writing dba=129920 slot=57 part=8
kxhfIsDone: waiting slot=57 lbs=000000001CCB7308
.
.
.
kxhfIsDone: waiting slot=61 lbs=000000001CD37308
kxhfWrite: Writing dba=118528 slot=62 part=8
*** 2012-11-15 18:27:12.706
qerhjFetch: max probe row length (mpl=0)
kxhfSetPhase: phase=PROBE_2
qerhjFetch: PQ Partition rows:500826 clusters:56 in-memory slots 9
kxhfIsDone: waiting slot=62 lbs=000000001CD37308
kxhfWrite: Writing dba=118272 slot=63 part=8
kxhfWrite: Writing dba=118144 slot=64 part=8
kxhfRead(): Reading dba=130432 into slot=62
kxhfIsDone: waiting slot=62 lbs=000000001CD37308
kxhfIsDone: waiting slot=63 lbs=000000001CD37308
kxhfRead(): Reading dba=130304 into slot=63
.
.
.
Footnote
The tests were performed on recent (11.2) releases as well as 10.2.0.3 which is one of the versions Jonathan used for testing in his original post. All versions tested showed the
same behaviour, so it doesn't look like the buffering of the second row source is a change that got introduced in recent releases.
Furthermore please note that my
OTN mini series on Parallel Execution that I wrote already a couple of months ago but was only published recently doesn't include this knowledge here and therefore explains the BUFFERED operation and the reasons for the blocking operations partially incorrectly.
Summary
The test case results show that the HASH JOIN BUFFERED operation seems to buffer the second row source. In principle it operates like a BUFFER SORT operation on the second row source but takes advantage of the fact that it only needs to buffer data that matches data in the first row source.
The limitation why the artificial blocking operations are introduced seems to revolve around the fact that at most a single PX SEND operation can be active concurrently.