Monday, August 15, 2016

Nested Loop Join Physical I/O Optimizations

Having done my mini-series on Nested Loop join logical I/O optimizations a while ago I unfortunately never managed to publish anything regarding the Nested Loop join physical I/O optimizations, which are certainly much more relevant to real-life performance.

Therefore the main purpose of this blog post is to point you to Nikolay Savvinov's (whose blog I can recommend in general) great mini-series covering various aspects of these optimizations:

Part 1
Part 2
Part 3
Summary

One point that - at least to me - isn't entirely clear when reading Nikolay's series is which specific plan shape he refers to, in particular since in 12c even more plan shapes for a Nested Loop join are possible.

Hence here is an attempt to summarize the various two table Nested Loop join plan shapes as of 12c and what kind of physical I/O optimizations one can expect from them:

1. Nested Loop Join Batching, the default in most cases since 11g and also in 12c
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    NESTED LOOPS               |                   |
|   4 |     INDEX FULL SCAN           | SYS_IOT_TOP_97632 |
|*  5 |     INDEX RANGE SCAN          | T_I6_IDX          |
|   6 |    TABLE ACCESS BY INDEX ROWID| T_I6              |
-----------------------------------------------------------
This is the plan shape that Nikolay calls "batch" or "batching" in his posts. It can provide batched I/O (mainly "db file parallel read" issuing multiple I/O requests in a single call, potentially asynchronous I/O) on both the INDEX RANGE SCAN as well as the TABLE ACCESS BY ROWID operation as described in Nikolay's blog post.

Please note that I deliberately write "can provide batched I/O", because, as Nikolay also points out in his posts, the runtime engine monitors the execution and can dynamically adjust the behaviour, which also means that it might decide to use conventional single block reads ("db file sequential read").

Please also note that in addition to the "db file parallel read" operation that submits multiple I/O requests in a single I/O submit call (see Frits Hoogland's blog post from some time ago about Linux internals of this I/O operation) the runtime engine might use "db file scattered read" multi-block I/Os under certain circumstances, in particular when performing "cache warmup prefetching".

Also note that as Nikolay points out when enabling SQL Trace or "rowsource statistics" the "Batched I/O" optimization for some reason gets disabled.

Also this plan shape at least in 12c seems to lead to inconsistencies in the Real-Time SQL Monitoring in several ways. First the number of "Executions" for the INDEX RANGE SCAN and TABLE ACCESS component of the inner row source is not necessarily consistent with the number of rows in the driving row source, second almost all activity and I/O volume seems to be contributed to the "INDEX RANGE SCAN" plan operation and not the "TABLE ACCESS" operation, even if it's the "TABLE ACCESS" that causes physical I/O.

2. Nested Loop Join Prefetch plan shape including BATCHED ROWID table access (only from 12c on)
------------------------------------------------------------------
| Id  | Operation                            | Name              |
------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                   |
|   1 |  SORT AGGREGATE                      |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|   3 |    NESTED LOOPS                      |                   |
|   4 |     INDEX FULL SCAN                  | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN                 | T_I5_IDX          |
------------------------------------------------------------------
This is the Nested Loop prefetching plan shape introduced in Oracle 9i combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c. It's the plan shape Nikolay refers to with "BATCHED ROWID" in his posts. In my (and Nikolay's) tests this provides the most aggressive batching of I/O (highest number of I/O requests submitted per call in "db file parallel read") for the TABLE ACCESS BY ROWID BATCHED, but didn't perform the same on the index ("db file sequential read", single block reads on the index segment), which doesn't mean that different test and data setups might provide that, too.
Note you should get this plan shape only with explicit hinting or disabling Nested Loop Join Batching via parameter.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
--batch_table_access_by_rowid(b)
The "batch_table_access_by_rowid" isn't strictly necessary since it's the default behaviour in 12c.

Other variants are possible, like "use_nl_with_index" instead of "use_nl" and "index" separately or "opt_param('_nlj_batching_enabled', 0)" instead of "no_nlj_batching" to disable the batching (but then batching is disabled for the whole execution plan, not just for a particular join).

3. Classic Nested Loop Join Prefetch introduced in 9i
----------------------------------------------------------
| Id  | Operation                    | Name              |
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |
|   1 |  SORT AGGREGATE              |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_I5              |
|   3 |    NESTED LOOPS              |                   |
|   4 |     INDEX FULL SCAN          | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN         | T_I5_IDX          |
----------------------------------------------------------
This is what you get in pre-12c when preventing Nested Loop Join Batching, in 12c the BATCHED table access needs to be disabled in addition.

This plan shape provides the less aggressive table prefetching as described in Nikolay's posts (he refers to this plan shape as "Prefetch"), maximum number of requests submitted per call in a "db file parallel read" operation seems to 39.

As mentioned above, it didn't provide index prefetching in my tests.

The session statistics don't mention "Batched I/O", so although the "db file parallel read" wait event is the same the internal implementation obviously is a different code path.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

4. Classic Nested Loop Join plan shape with BATCHED ROWID table access (only from 12c on)
-------------------------------------------------------------------
| Id  | Operation                             | Name              |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                   |
|   1 |  SORT AGGREGATE                       |                   |
|   2 |   NESTED LOOPS                        |                   |
|   3 |    INDEX FULL SCAN                    | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|*  5 |     INDEX RANGE SCAN                  | T_I5_IDX          |
-------------------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i) combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c.

In my tests it provides only "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID BATCHED can make use of "db file parallel read" to batch these I/O requests to the table blocks, and the session statistics show "Batched I/O" counters increasing.

However, it doesn't attempt to optimize "inter-loop" / multiple loop iterations, so if each loop iteration via the INDEX RANGE SCAN only points to a single table block, only single block reads ("db file sequential read") on the TABLE ACCESS BY INDEX ROWID BATCHED can be seen.

In my tests this plan shape didn't provide index prefetching / I/O optimizations and performed single block reads ("db file sequential read") on the INDEX RANGE SCAN operation.

For a two table join it needs explicit hinting in 12c to arrive at this plan shape, but it is the default plan shape for the "inner" joins in case of nested Nested Loop joins (multiple, consecutive Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
--batch_table_access_by_rowid(b)
See my older posts regarding Nested Loop join logical I/O optimizations why the combination of OPT_PARAM nd NO_NLJ_PREFETCH is required to arrive at this plan shape - in short, specifying the obvious NO_NLJ_PREFETCH plus NO_NLJ_BATCHING doesn't work. Other variants as above.

5. Classic Nested Loop Join plan shape
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    INDEX FULL SCAN            | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID| T_I5              |
|*  5 |     INDEX RANGE SCAN          | T_I5_IDX          |
-----------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i).

Interestingly in my tests at least in 12c it provides also "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID can make use of "db file parallel read" to submit multiple of these I/O requests to the table blocks in a single call, but the session statistics don't show "Batched I/O" counters increasing and the maximum number of requests seem to be 39, so it is less aggressive and looks very similar to the internal implementation used for the classic "table prefetching" plan shape above.

There is no sign of "inter-loop" optimizations and the INDEX RANGE SCAN seems to make use of single block reads only ("db file sequential read").

For pre-12c this is default plan shape used for the "inner" joins in case of nested Nested Loop joins (multiple Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

Multiple, consecutive Nested Loop Joins


Another point that I think it is important to mention when describing these "inter-loop" prefetching and batching Nested Loop join optimization techniques is that they only apply to the "outer-most" Nested Loop join in case of multiple, consecutive Nested Loop joins, which makes their "game changing" character that Nikolay mentions in his posts less strong to me.

For example, this is the 12c default shape of a four table join using three Nested Loop joins:
----------------------------------------------------------------------
| Id  | Operation                                | Name              |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                   |
|   1 |  SORT AGGREGATE                          |                   |
|   2 |   NESTED LOOPS                           |                   |
|   3 |    NESTED LOOPS                          |                   |
|   4 |     NESTED LOOPS                         |                   |
|   5 |      NESTED LOOPS                        |                   |
|   6 |       INDEX FULL SCAN                    | SYS_IOT_TOP_97632 |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T_I1              |
|*  8 |        INDEX RANGE SCAN                  | T_I1_IDX          |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED | T_I2              |
|* 10 |       INDEX RANGE SCAN                   | T_I2_IDX          |
|* 11 |     INDEX RANGE SCAN                     | T_I3_IDX          |
|  12 |    TABLE ACCESS BY INDEX ROWID           | T_I3              |
----------------------------------------------------------------------
As it can be seen only the outer-most Nested Loop joins gets the "batching" plan shape and can benefit from the optimizations described above. The inner Nested Loop joins show the classic plan shape, in case of 12c with the ROWID BATCHED option, so they can only benefit from the "intra-loop" optimizations described above, if a single loop iteration INDEX RANGE SCAN points to several table blocks.

Note that even when using explicit, additional NLJ_BATCHING hints for the inner tables joined I wasn't able to enforce any other plan shape.

If there are multiple blocks of (consecutive) Nested Loop joins (for example a HASH JOIN in between), then in each block the outer-most Nested Loop join gets the optimization, so multiple of those are possible per execution plan:
---------------------------------------------------------------------
| Id  | Operation                               | Name              |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                   |
|   1 |  SORT AGGREGATE                         |                   |
|   2 |   NESTED LOOPS                          |                   |
|   3 |    NESTED LOOPS                         |                   |
|*  4 |     HASH JOIN                           |                   |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_I2              |
|   6 |       INDEX FULL SCAN                   | T_I2_IDX          |
|   7 |      NESTED LOOPS                       |                   |
|   8 |       NESTED LOOPS                      |                   |
|   9 |        INDEX FULL SCAN                  | SYS_IOT_TOP_97632 |
|* 10 |        INDEX RANGE SCAN                 | T_I1_IDX          |
|  11 |       TABLE ACCESS BY INDEX ROWID       | T_I1              |
|* 12 |     INDEX RANGE SCAN                    | T_I3_IDX          |
|  13 |    TABLE ACCESS BY INDEX ROWID          | T_I3              |
---------------------------------------------------------------------
As it can be seen now that T_I2 is joined via a HASH JOIN both Nested Loop joins to T_I1 and T_I3 get the "batched" optimization plan shape.

I don't know why this limitation is there that only the outer-most in a block of Nested Loop joins gets the optimized plan shape, but it certainly can reduce the possible benefit.

Since I don't think there is a costing difference on CBO level for the different Nested Loop join plan shapes (so the decision which shape to use isn't cost-driven) it might be possible to see a significant performance difference depending on which join gets the optimization - there might be room for improvement by manually influencing the join order in case of multiple, consecutive Nested Loop joins.

No comments: