Monday, July 25, 2011

Logical I/O - Evolution: Part 2 - 9i, 10g Prefetching

In the initial part of this series I've explained some details regarding logical I/O using a Nested Loop Join as example.

To recap I've shown in particular:

- Oracle can re-visit pinned buffers without performing logical I/O

- There are different variants of consistent gets - a "normal" one involving buffer pin/unpin cycles requiring two latch acquisitions and a short-cut variant that visits the buffer while holding the corresponding "cache buffers chains" child latch ("examination") and therefore only requiring a single latch acquisition

- Although two statements use a similar execution plan and produce the same number of logical I/Os one is significantly faster and scales better than the other one

The initial part used the "classic" shape of the Nested Loop Join, but Oracle introduced in recent releases various enhancements in that area - in particular in 9i the "Table Prefetching" and in 11g the Nested Loop Join Batching using "Vector/Batched I/O".

Although these enhancements have been introduced primarily to optimize the physical I/O patterns, they could also have an influence on logical I/O.

The intention of Prefetching and Batching seems to be the same - they both are targeted towards the usually most expensive part of the Nested Loop Join: The random table lookup as part of the inner row source. By trying to "prefetch" or "batch" physical I/O operations caused by this random block access Oracle attempts to minimize the I/O waits.

I might cover the effect on physical I/O of both "Prefetching" and "Batching" in separate posts, here I'll only mention that you might see "db file scattered read" or "db file parallel read" multi-block I/O operations instead of single block "db file sequential read" operations for the random table access with those optimizations (Index prefetching is also possible, by the way). Note also that if you see the Prefetching or Batching plan shape it does not necessarily mean that it is actually going to happen at execution time - Oracle monitors the effectiveness of the Prefetching and can dynamically decide whether it will be used or not.

10.2.0.4 Table Prefetching - Random order

Let's enable table prefetching in 10.2 and re-run the original test case. The first run will use the different order variant of T1 and T2:

Inner row source Unique Index - T1 different order than T2


---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.12 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.90 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.71 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.20 | 200K|
---------------------------------------------------------------------------------------------------------------


Inner row source Non-Unique Index - T1 different order than T2


--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:05.03 | 311K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:04.40 | 311K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:03.02 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.49 | 201K|
--------------------------------------------------------------------------------------------------------------


As you'll see, in 10g even with table prefetching enabled the unique index variant does look the same and performs similar as in the original post.

This changes in 11g by the way, where the unique index variant also supports the table prefetching plan shape.

For the non-unique variant you'll see a different shape of the execution plan where the inner row source random table lookup is actually a parent operation to the Nested Loop Join (and hence will only be started once and consumes the information generated by the child Nested Loop operation).

Note that in case of nested Nested Loop Joins only the inner-most row source will make use of the Table Prefetching shape. The same applies to the 11g Nested Loop Join Batching. If you happen to have several Nested Loops Joins that are not directly nested then each of the inner-most row sources might use the Table Prefetching/Batching shape - which means that it can be used more than once as part of a single execution plan.

If you compare the Runtime profile of the non-unique index variant with the original Runtime profile without Table Prefetching you'll not see any difference in terms of logical I/O, however it becomes obvious that the overall execution is actually slightly faster (more significant with row source sampling overhead enabled). In particular the random table access requires significantly less time than in the original Runtime profile, so it seems to be more efficient, although it is still slower than the unique index variant.

Begin Update

Having focused on the logical I/O I completely forgot to mention the inconsistency in the A-Rows column (thanks to Flado who pointed this out in his comment below), which shows 200K rows for the Nested Loop operation although only 100K rows have been identified in the inner index lookup. I believe this is an inconsistency that also shows up when performing an SQL trace so it seems to be a problem with the row source statistics. In principle with this plan shape the Nested Loop Join operation seems to account for the sum of both the rows identified in the driving row source and the inner index lookup, rather than the expected number of rows identified in the inner index lookup only.

However, as mentioned below in the "Statistics" section there is another anomaly - a consistent get and "buffer is pinned count" for every row looked up in the inner table, so this might not be just coincidence but another indicator that there is really some excess work happening with Table Prefetching.

By the way - both anomalies are still present in 11.1 / 11.2 when using Table Prefetching there.

End Update

Let's have a look at the session statistics.


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..consistent gets 310,012 311,101 1,089
STAT..consistent gets from cache 310,012 311,101 1,089
STAT..session logical reads 310,012 311,101 1,089
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..buffer is not pinned count 200,001 99,997 -100,004
STAT..buffer is pinned count 189,998 290,006 100,008
STAT..consistent gets - examination 300,001 100,007 -199,994
STAT..no work - consistent read gets 10,001 211,084 201,083
LATCH.cache buffers chains 320,031 522,195 202,164


So the only significant difference in this case is the increased "buffer is pinned count" / decreased "buffer is not pinned" count statistics, although the number of logical I/O stays the same. I don't know if this really means excess work with Table Prefetching enabled or whether this is an instrumentation problem. Nevertheless with Table Prefetching enabled in this case you'll end up with both a "buffer is pinned count" and "consistent get" for each row looked up in the inner row source table operation. The number of logical I/O and latch acquisitions stays the same, so it's not obvious from the statistics why this performs better than the non-Table Prefetching case - according to the statistics it even performs more work, but may be the table random access as parent operation to the Nested Loop allows a more efficient processing requiring less CPU cycles.

10.2.0.4 Table Prefetching - Same (Random) order

Let's change the data order and use either the same "Pseudo-Random" order (by uncommenting the second "dbms_random.seed(0)" call) or order by ID - it doesn't matter with Table Prefetching in 10g.

Inner row source Unique Index - T1 and T2 same order


---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.91 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.54 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|
---------------------------------------------------------------------------------------------------------------


Inner row source Non-Unique Index - T1 and T2 same order


--------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
--------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:04.54 | 221K|
| 2 | TABLE ACCESS BY INDEX ROWID| T2 | 1 | 1 | 2 (0)| 100K|00:00:03.90 | 221K|
| 3 | NESTED LOOPS | | 1 | 100K| 202K (1)| 200K|00:00:02.82 | 211K|
| 4 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.40 | 201K|
--------------------------------------------------------------------------------------------------------------


Now we really see a difference: The unique index variant still shows the same results, but the non-unique variant saves logical I/O on the random table access - and is faster than with random order - coming closer to the unique index variant performance.

Whereas the index range scan still requires approx. 200,000 logical I/Os the random table access only requires 10,000 logical I/Os instead of 100,000.

The session statistics:


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
LATCH.cache buffers chains 320,023 342,213 22,190
STAT..consistent gets 310,012 221,110 -88,902
STAT..consistent gets from cache 310,012 221,110 -88,902
STAT..session logical reads 310,012 221,110 -88,902
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..no work - consistent read gets 10,001 121,093 111,092
STAT..buffer is not pinned count 200,001 10,006 -189,995
STAT..buffer is pinned count 189,998 379,997 189,999
STAT..consistent gets - examination 300,001 100,007 -199,994


The session statistics confirm this: The "buffer is pinned count" increases by another 90,000 for the non-unique index variant which corresponds to the 90,000 logical I/Os performed less as part of the random table access operation.

The number of latch acquisitions decreases accordingly so that we end up with a comparable number as with the unique index variant.

Scalability

If you run the non-unique index Table Prefetching variant with the concurrent execution test harness you'll see a corresponding slightly increased scalability although it still scales not as good as the unique index variant.

Summary

Table Prefetching has been introduced in Oracle 9i in order to optimize the random physical access in Nested Loop Joins, however it also seems to have a positive effect on logical I/O. The effectiveness of this optimization depends on the data order - if the data from the driving row source is in the same order as the inner row source table buffers can be kept pinned. Note that the same doesn't apply to the index lookup - even if the data is ordered by ID and consequently the same index branch and leaf blocks will be accessed repeatedly with each iteration, a buffer pinning optimization could not be observed.

In the next part we'll see what happens with this example in Oracle 11g and its new features.

4 comments:

Flado said...

Randolf,

I noticed a strange thing in both execution plans using the non-unique index: The actual number of rows reported to flow out of the NESTED LOOPS step (ID 3) is 200K. I cannot wrap my head around what is this supposed to mean: that the join produces 200K ROWIDs, but half of them could not be found in the table? How can it be?
I'd appreciate it if you could throw some light.

Cheers!
Flado

Randolf said...

Flado,

thanks a lot for pointing out this anomaly. I was so focused on the logical I/O part that I completely forgot to mention this important point.

Rather than putting my reply here I've updated the post accordingly.

Thanks,
Randolf

Jonathan Lewis said...

Randolf,
I don't know if you've ever spotted this - but I have seen cases where Oracle switches between an index unique scan on the innner table to an index range scan so that it can take advantage of pre-fetching.

This is the same query running against the same version of the database using small variations in the volume of data avaiable in the driving table.

Unfortunately it seems to be very sensitive to historical behaviour, so I've never been able to produce a stable repeatable demonstration.

Randolf said...

Jonathan,

I think the first time I came across a description of this unique scan -> range scan conversion was when reading chapter 11 of your CBO - Fundamentals.

Since I haven't been able to reproduce this during my tests I didn't mention it here since I wasn't sure if it is still reproducible in 10g.

I'm not entirely sure about your comment - did you mean to post an execution plan from 10g showing such a case meaning that you could reproduce in 10g?

If that was the case I would probably update the post.

Note that 11g supports the "Prefetching" shape for the unique scan out of the box, so there no need for conversion any longer there.

Randolf