Monday, January 5, 2015

Unnecessary BUFFER SORT Operations - Parallel Concatenation Transformation

When using Parallel Execution, depending on the plan shape and the operations used, Oracle sometimes needs to turn non-blocking operations into blocking operations, which means in this case that the row source no longer passes its output data directly to the parent operation but buffers some data temporarily in PGA memory / TEMP. This is either accomplished via the special HASH JOIN BUFFERED operation, or simply by adding BUFFER SORT operations to the plan.

The reason for such a behaviour in parallel plans is the limitation of Oracle Parallel Execution that allows only a single data redistribution to be active concurrently. You can read more about that here.

However, sometimes the optimizer adds unnecessary BUFFER SORT operations to parallel execution plans, and one of the most obvious examples is when the so called "concatenation" query transformation is applied by the optimizer and Parallel Execution is involved.

UPDATE Please note: As mentioned below by Martin (thanks) what I call here "concatenation transformation" typically is called "OR expansion transformation" in CBO speak, and this term probably much better describes what the transformation is about. So whenever I wrote here "concatenation transformation" this can be substituted with "OR expansion transformation".

To understand the issue, first of all, what is the concatenation transformation about?

Whenever there are predicates combined with OR there is the possibility to rewrite the different conditions as separate queries unioned together.

In order to ensure that the result of the rewritten query doesn't contain any unwanted duplicates, the different branches of the UNIONed statement need to filter out any data fulfillinh conditions of previous branches - this is probably where originally the (at first sight) odd (and in the meanwhile documented) LNNVL function came into existence.

The predicates can be either single-table filters, where the concatenation might open up different access paths to the same table (like different indexes), or it might be predicates combining multiple tables, like joins or subqueries.

Here is a short example of the latter (the parallel hints are commented out but are used in the further examples to demonstrate the issue with Parallel Execution) - using version 12.1.0.2:

select 
       max(id) 
from 
(
  select /* parallel(t1 8) parallel(t2 8) */
         t2.* 
  from 
         t1
       , t2
  where 
         (t1.id = t2.id or t1.id = t2.id2)
);
In this example the join condition using an OR prevents any efficient join method between T1 and T2 when not re-writing the statement - Oracle can only resort to a NESTED LOOP join with a repeated full table scan of one of the tables, which is reflected in a rather high estimated cost:

----------------------------------------------------------------------------
| Id  | Operation           | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |      |     1 |    16 |  2177M  (2)| 23:37:34 |
|   1 |  SORT AGGREGATE     |      |     1 |    16 |            |          |
|   2 |   NESTED LOOPS      |      |  3999K|    61M|  2177M  (2)| 23:37:34 |
|   3 |    TABLE ACCESS FULL| T2   |  2000K|    19M|  1087   (1)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| T1   |     2 |    12 |  1089   (2)| 00:00:01 |
----------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - filter("T1"."ID"="T2"."ID" OR "T1"."ID"="T2"."ID2")
The same statement could be expressed by the following manual rewrite:

select max(id) from (
  select /* parallel(t1 8) parallel(t2 8) */
         t2.* 
  from 
         t1
       , t2
  where 
         t1.id = t2.id2
  ---------
  union all
  ---------
  select /* parallel(t1 8) parallel(t2 8) */
         t2.* 
  from 
         t1
       , t2
  where 
         t1.id = t2.id
  and    lnnvl(t1.id = t2.id2)
);
Notice the LNNVL function in the second branch of the UNION ALL that filters out any rows fulfilling the condition used in the first branch.

Also note that using UNION instead of UNION ALL plus LNNVL(s) to filter out any duplicate rows is also potentially incorrect as each query branch might produce duplicate rows that need to be retained as they are also part of the original query result.

At the expense of visiting the tables multiple times we now get at least efficient join methods in each branch (and hence a significantly lower cost estimate):

--------------------------------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |    13 |       | 11945   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE       |      |     1 |    13 |       |            |          |
|   2 |   VIEW                |      |  2100K|    26M|       | 11945   (1)| 00:00:01 |
|   3 |    UNION-ALL          |      |       |       |       |            |          |
|*  4 |     HASH JOIN         |      |  2000K|    30M|    34M|  5972   (1)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| T1   |  2000K|    11M|       |  1086   (1)| 00:00:01 |
|   6 |      TABLE ACCESS FULL| T2   |  2000K|    19M|       |  1087   (1)| 00:00:01 |
|*  7 |     HASH JOIN         |      |   100K|  1562K|    34M|  5972   (1)| 00:00:01 |
|   8 |      TABLE ACCESS FULL| T1   |  2000K|    11M|       |  1086   (1)| 00:00:01 |
|   9 |      TABLE ACCESS FULL| T2   |  2000K|    19M|       |  1087   (1)| 00:00:01 |
--------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   4 - access("T1"."ID"="T2"."ID2")
   7 - access("T1"."ID"="T2"."ID")
       filter(LNNVL("T1"."ID"="T2"."ID2"))
And in fact, when not preventing the concatenation transformation (NO_EXPAND hint), the optimizer comes up with the following execution plan for the original statement:

-------------------------------------------------------------------------------------
| Id  | Operation            | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |      |     1 |    16 |       | 11945   (1)| 00:00:01 |
|   1 |  SORT AGGREGATE      |      |     1 |    16 |       |            |          |
|   2 |   CONCATENATION      |      |       |       |       |            |          |
|*  3 |    HASH JOIN         |      |  2000K|    30M|    34M|  5972   (1)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| T1   |  2000K|    11M|       |  1086   (1)| 00:00:01 |
|   5 |     TABLE ACCESS FULL| T2   |  2000K|    19M|       |  1087   (1)| 00:00:01 |
|*  6 |    HASH JOIN         |      |   100K|  1562K|    34M|  5972   (1)| 00:00:01 |
|   7 |     TABLE ACCESS FULL| T1   |  2000K|    11M|       |  1086   (1)| 00:00:01 |
|   8 |     TABLE ACCESS FULL| T2   |  2000K|    19M|       |  1087   (1)| 00:00:01 |
-------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("T1"."ID"="T2"."ID2")
   6 - access("T1"."ID"="T2"."ID")
       filter(LNNVL("T1"."ID"="T2"."ID2"))
The only difference between those two plans for the manual and automatic rewrite is the CONCATENATION operator instead of UNION ALL, and that the subquery isn't merged in case of the UNION ALL (additional VIEW operator).

So far everything works as expected and you have seen the effect and rationale of the concatenation transformation.

If we run now the original statement using Parallel Execution (turn comments into hints), depending on the exact version used the resulting execution plans show various inefficiencies.

For reference, this is the parallel execution plan I get from 12.1.0.2 when using above manual rewrite:

------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |          |     1 |    13 |   606   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                |          |     1 |    13 |            |          |        |      |            |
|   2 |   PX COORDINATOR               |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)         | :TQ10004 |     1 |    13 |            |          |  Q1,04 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE             |          |     1 |    13 |            |          |  Q1,04 | PCWP |            |
|   5 |      VIEW                      |          |  2100K|    26M|   606   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|   6 |       UNION-ALL                |          |       |       |            |          |  Q1,04 | PCWP |            |
|*  7 |        HASH JOIN               |          |  2000K|    30M|   303   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|   8 |         PX RECEIVE             |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   9 |          PX SEND HYBRID HASH   | :TQ10000 |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  10 |           STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  11 |            PX BLOCK ITERATOR   |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  12 |             TABLE ACCESS FULL  | T1       |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  13 |         PX RECEIVE             |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|  14 |          PX SEND HYBRID HASH   | :TQ10001 |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  15 |           PX BLOCK ITERATOR    |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | PCWC |            |
|  16 |            TABLE ACCESS FULL   | T2       |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|* 17 |        HASH JOIN               |          |   100K|  1562K|   303   (2)| 00:00:01 |  Q1,04 | PCWP |            |
|  18 |         PX RECEIVE             |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|  19 |          PX SEND HYBRID HASH   | :TQ10002 |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,02 | P->P | HYBRID HASH|
|  20 |           STATISTICS COLLECTOR |          |       |       |            |          |  Q1,02 | PCWC |            |
|  21 |            PX BLOCK ITERATOR   |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,02 | PCWC |            |
|  22 |             TABLE ACCESS FULL  | T1       |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  23 |         PX RECEIVE             |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|  24 |          PX SEND HYBRID HASH   | :TQ10003 |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,03 | P->P | HYBRID HASH|
|  25 |           PX BLOCK ITERATOR    |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,03 | PCWC |            |
|  26 |            TABLE ACCESS FULL   | T2       |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,03 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   7 - access("T1"."ID"="T2"."ID2")
  17 - access("T1"."ID"="T2"."ID")
       filter(LNNVL("T1"."ID"="T2"."ID2"))
This is a pretty straightforward parallel plan, with the only possibly noteable exception of the new 12c "HYBRID HASH" distribution feature being used.

Now let's have a look at the resulting execution plan when the concatenation transformation gets used:

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |          |     1 |    16 |   606   (2)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE                      |          |     1 |    16 |            |          |        |      |            |
|   2 |   PX COORDINATOR                     |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM)               | :TQ20003 |     1 |    16 |            |          |  Q2,03 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                   |          |     1 |    16 |            |          |  Q2,03 | PCWP |            |
|   5 |      CONCATENATION                   |          |       |       |            |          |  Q2,03 | PCWP |            |
|*  6 |       HASH JOIN                      |          |  2000K|    30M|   303   (2)| 00:00:01 |  Q2,03 | PCWP |            |
|   7 |        PX RECEIVE                    |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q2,03 | PCWP |            |
|   8 |         PX SEND HYBRID HASH          | :TQ20001 |  2000K|    11M|   151   (1)| 00:00:01 |  Q2,01 | P->P | HYBRID HASH|
|   9 |          STATISTICS COLLECTOR        |          |       |       |            |          |  Q2,01 | PCWC |            |
|  10 |           BUFFER SORT                |          |     1 |    16 |            |          |  Q2,01 | PCWP |            |
|  11 |            PX BLOCK ITERATOR         |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q2,01 | PCWC |            |
|  12 |             TABLE ACCESS FULL        | T1       |  2000K|    11M|   151   (1)| 00:00:01 |  Q2,01 | PCWP |            |
|  13 |        PX RECEIVE                    |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q2,03 | PCWP |            |
|  14 |         PX SEND HYBRID HASH          | :TQ20002 |  2000K|    19M|   151   (1)| 00:00:01 |  Q2,02 | P->P | HYBRID HASH|
|  15 |          BUFFER SORT                 |          |     1 |    16 |            |          |  Q2,02 | PCWP |            |
|  16 |           PX BLOCK ITERATOR          |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q2,02 | PCWC |            |
|  17 |            TABLE ACCESS FULL         | T2       |  2000K|    19M|   151   (1)| 00:00:01 |  Q2,02 | PCWP |            |
|  18 |       BUFFER SORT                    |          |       |       |            |          |  Q2,03 | PCWC |            |
|  19 |        PX RECEIVE                    |          |   100K|  1562K|   303   (2)| 00:00:01 |  Q2,03 | PCWP |            |
|  20 |         PX SEND ROUND-ROBIN          | :TQ20000 |   100K|  1562K|   303   (2)| 00:00:01 |        | S->P | RND-ROBIN  |
|  21 |          BUFFER SORT                 |          |     1 |    16 |            |          |        |      |            |
|  22 |           PX COORDINATOR             |          |       |       |            |          |        |      |            |
|  23 |            PX SEND QC (RANDOM)       | :TQ10002 |   100K|  1562K|   303   (2)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|  24 |             BUFFER SORT              |          |     1 |    16 |            |          |  Q1,02 | PCWP |            |
|* 25 |              HASH JOIN BUFFERED      |          |   100K|  1562K|   303   (2)| 00:00:01 |  Q1,02 | PCWP |            |
|  26 |               PX RECEIVE             |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  27 |                PX SEND HYBRID HASH   | :TQ10000 |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | P->P | HYBRID HASH|
|  28 |                 STATISTICS COLLECTOR |          |       |       |            |          |  Q1,00 | PCWC |            |
|  29 |                  BUFFER SORT         |          |     1 |    16 |            |          |  Q1,00 | PCWP |            |
|  30 |                   PX BLOCK ITERATOR  |          |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  31 |                    TABLE ACCESS FULL | T1       |  2000K|    11M|   151   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|  32 |               PX RECEIVE             |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|  33 |                PX SEND HYBRID HASH   | :TQ10001 |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | P->P | HYBRID HASH|
|  34 |                 BUFFER SORT          |          |     1 |    16 |            |          |  Q1,01 | PCWP |            |
|  35 |                  PX BLOCK ITERATOR   |          |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | PCWC |            |
|  36 |                   TABLE ACCESS FULL  | T2       |  2000K|    19M|   151   (1)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   6 - access("T1"."ID"="T2"."ID2")
  25 - access("T1"."ID"="T2"."ID")
       filter(LNNVL("T1"."ID"="T2"."ID2"))
This looks a bit weird, and when comparing it to the plan gotten from the manual rewrite, it shows the following unnecessary differences:

- There are various BUFFER SORT operations that don't make a lot of sense, for example each parallel table scan is followed by a BUFFER SORT operation, and even the HASH JOIN BUFFERED in the lower part of the plan is followed by a BUFFER SORT (double buffering?)

- The plan is decomposed into two so called DFO trees, which you can see for example from the two PX COORDINATOR operators (operation id 2 and 22), which adds another unnecessary serial execution part to the plan and can have additional side effects I explain in one of my video tutorials.

This means that such execution plan shapes possibly will have a much higher demand for PGA memory than necessary (the BUFFER SORT operation will attempt to keep the data produced by the child row source in PGA), and also might cause additional I/O to and from TEMP. Since PGA memory consumed by one session influences also the Auto PGA allocation of other sessions this means that such executions not only affect the particular SQL execution in question but also any other concurrent executions allocating PGA memory.

Depending on the amount of data to be buffered BUFFER SORT operations closer to the root of the execution plan are more likely to have significant impact performance-wise, as they might have to buffer large amounts of data.

One very obvious sign of inefficiency are double BUFFERing operations, like a HASH JOIN BUFFERED followed by a BUFFER SORT as parent operation, which you can spot in the sample plan shown above.

Another interesting point is that the parallel plans differ from point release to point release and show different levels of inefficiencies, for example, 10.2.0.5, 11.1.0.7 and 11.2.0.1 produce different plans than 11.2.0.2, which is again different from what 11.2.0.3 & 11.2.0.4 produce - and using OPTIMIZER_FEATURES_ENABLE in newer versions to emulate older versions doesn't always reproduce the exact plans produced by the actual, older versions. So all in all this looks like a pretty messy part of the optimizer.

Furthermore the problem doesn't always show up - it seems to depend largely on the exact version and the plan shape used. For example, replacing the SELECT MAX(ID) FROM () outermost query in above example with a simple SELECT ID FROM () results in a plan where the concatenation transformation doesn't produce all those strange BUFFER SORTS - although it still produces a plan decomposed into two DFO trees in some versions.

It also interesting to note that depending on version and plan shape sometimes the manual rewrite using UNION ALL is also affected by either unluckily placed or unnecessary BUFFER SORT operations, but not to the same extent as the plans resulting from the CONCATENATION transformation.

In the next post I'll show how this inefficiency can have some interesting side effects when being triggered by / combined with other transformations.

Footnote


Table structures used in the test cases:

create table t1
compress
as
select
        (rownum * 2) + 1 as id
      , mod(rownum, 2000) + 1 as id2
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(100000) */ * from dual
connect by
        level <= 100000) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't1')

create table t2
compress
as
select * from t1;

exec dbms_stats.gather_table_stats(null, 't2')

3 comments:

Martin Preiss said...

Hi Randolf,
a very interesting article - as usual. I guess "concatenation transformation" is another name for "OR expansion": https://blogs.oracle.com/optimizer/entry/or_expansion_transformation? Both labels have their rationale - but I think "OR expansion" is very easily to remember.
Regards
Martin

Randolf said...

Hi Martin, yes of course, it's the same and OR expansion sounds much better than concatenation transformation - maybe I like it somehow more because it's kind of a tongue twister (try saying "concatenation transformation" ten times in a row :-)

Randolf

Martin Preiss said...

thanks - I would have problems to say it one time in a row :-)