Thursday, July 7, 2011

Logical I/O - Evolution: Part 1 - Baseline

Forward to Part 2

This is the first part in a series of blog posts that shed some light on the enhancements Oracle has introduced with the recent releases regarding the optimizations of logical I/O.http://www.blogger.com/img/blank.gif

Before we can appreciate the enhancements, though, we need to understand the baseline. This is what this blog post is about.

The example used throughout this post is based on a simple Nested Loop Join which is one area where Oracle has introduced significant enhancements.

It started its life as a comparison of using unique vs. non-unique indexes as part of a Nested Loop Join and their influence on performance and scalability.

This comparison on its own is very educating and also allows to demonstrate and explain some of the little details regarding logical I/O.

Here is the basic script that gets used. It creates two tables with a primary defined, one table using a unique index, the other one a non-unique index.

The tables are specifically crafted to have exactly 100,000 rows with 10 rows per block resulting in 10,000 blocks (using the MINIMIZE RECORDS_PER_BLOCK option). These "obvious" numbers hopefully allow for nice pattern recognition in the resulting figures. Using the default 8K block size the resulting indexes will have slightly more than 1,000 blocks.

It will run then a Nested Loop Join from one table to the other and then the other way around along with a snapshot of the session statistics using Adrian Billington's RUNSTATS package which is based on Tom Kyte's well known package of the same name. You can get it from here.

If you run this against 9i to 10.2 you'll need to disable table prefetching to get the results explained here. This can only be done by setting the static parameter "_table_lookup_prefetch_size" equal to 0 which requires to restart the instance.

11g allows to control the behaviour via various hints and parameters, see the script for more details.

In order to be in line with the baseline explanations presented here this should be executed against pre-11g since 11g introduces some significant changes that will be covered in upcoming posts.


--------------------------------------------------------------------------------
--
-- File name: unique_non_unique_index_difference.sql
--
-- Purpose: Compare the efficiency of NESTED LOOP joins via index lookup
-- between unique and non-unique indexes
--
-- Author: Randolf Geist http://oracle-randolf.blogspot.com
--
-- Prereqs: RUNSTATS_PKG by Adrian Billington / Tom Kyte
--
-- Last tested: June 2011
--
-- Versions: 10.2.0.4
-- 10.2.0.5
-- 11.1.0.7
-- 11.2.0.1
-- 11.2.0.2
--------------------------------------------------------------------------------

set echo on timing on linesize 130 pagesize 999 trimspool on tab off serveroutput off doc on

doc
From 9i to 10.2 you need to disable table prefetching
to get the "original" version of NL joins

-- Disable table prefetchting
alter system set "_table_lookup_prefetch_size" = 0 scope = spfile;

-- Back to defaults
alter system reset "_table_lookup_prefetch_size" scope = spfile sid = '*';

From 11g on this can handled via the nlj_prefetch and nlj_batching hints

But they work a bit counterintuitive when combined therefore

opt_param('_nlj_batching_enabled', 0)

is also required to get exactly the NL join optimization requested

Since this is about logical I/O, not physical I/O you need sufficient cache
defined (256M should be fine) otherwise the results will differ
when physical I/O happens
#

spool unique_non_unique_index_difference.log

drop table t1;

purge table t1;

exec dbms_random.seed(0)

-- Random order
-- Create 10 rows in a single block
create table t1
--pctfree 0
as
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 10
order by
-- id
dbms_random.value
;

-- Tell Oracle to store at most 10 rows per block
alter table t1 minimize records_per_block;

truncate table t1;

-- Populate the table, resulting in exactly 10,000 blocks with MSSM
insert /*+ append */ into t1
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 100000
order by
-- id
dbms_random.value
;

-- Force BLEVEL 2 for UNIQUE index (with 8K blocks, root->branch->leaf)
create unique index t1_idx on t1 (id) pctfree 80;

-- Avoid any side effects of dynamic sampling
-- (and perform delayed block cleanout when not using direct-path load)
exec dbms_stats.gather_table_stats(null, 't1', estimate_percent => null)

-- Add PK constraint
alter table t1 add constraint t1_pk primary key (id);

drop table t2;

purge table t2;

-- exec dbms_random.seed(0)

-- Random order (but different from T1 order)
-- Create 10 rows in a single block
create table t2
--pctfree 0
as
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 10
order by
-- id
dbms_random.value
;

-- Tell Oracle to store at most 10 rows per block
alter table t2 minimize records_per_block;

truncate table t2;

-- Populate the table, resulting in exactly 10,000 blocks with MSSM
insert /*+ append */ into t2
select
rownum as id
, rpad('x', 100) as filler
from
dual
connect by
level <= 100000
order by
-- id
dbms_random.value
;

-- Force BLEVEL 2 for NON-UNIQUE index (with 8K blocks, root->branch->leaf)
create index t2_idx on t2 (id) pctfree 80;

-- Avoid any side effects of dynamic sampling
-- (and perform delayed block cleanout when not using direct-path load)
exec dbms_stats.gather_table_stats(null, 't2', estimate_percent => null)

-- Add PK constraint based on non-unique index
alter table t2 add constraint t2_pk primary key (id);

alter session set statistics_level = all;

-- Run the commands once to cache the blocks and get a runtime profile
select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t2 a
, t1 b
where
a.id = b.id
);

select * from table(dbms_xplan.display_cursor(null, null, '+COST +OUTLINE ALLSTATS LAST'));

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t1 a
, t2 b
where
a.id = b.id
);

select * from table(dbms_xplan.display_cursor(null, null, '+COST +OUTLINE ALLSTATS LAST'));

-- Eliminate row source statistics overhead
-- for the "real" test
alter session set statistics_level = typical;

exec runstats_pkg.rs_start

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t2 a
, t1 b
where
a.id = b.id
);

exec runstats_pkg.rs_middle

select
max(b_filler), max(a_filler)
from (
select /*+ leading(a) use_nl(a b) opt_param('_nlj_batching_enabled', 0) no_nlj_prefetch(b) */
a.id as a_id, a.filler as a_filler, b.id as b_id, b.filler as b_filler
from
t1 a
, t2 b
where
a.id = b.id
);


set serveroutput on

exec runstats_pkg.rs_stop(-1)

spool off


Expectations

What are the expected results - here explained based on 10.2? This is a nested loop join of 100,000 rows to 100,000 rows table allocating each 10,000 blocks. The inner row source will be using the available index and perform a table lookup by ROWID 100,000 times.

The script specifically crafts the indexes to have a height of 3 (BLEVEL = 2) when using a default block size of 8K which means that they have a root block with a number of branch blocks on the second level and finally the leaf blocks on the third level. Note that different block sizes can lead to different index heights and therefore different results.

In terms of "buffer visits" required to complete the statement we could think of the following:

- 100,000 block visits for the outer row source running a simple full table scan. For every iteration of the loop we need to visit the buffer and read the next row that will be used for the lookup into the inner row source

- 300,000 block visits for the inner row source index lookup, since for every index lookup we need to traverse the index from root to branch to leaf

- 100,000 block visits for the inner row source table lookup by ROWID

So according to this model in total we need to "explain" 500,000 block visits for this example.

Let's have a look at the various results from the script.

1. The runtime profile

a) Running the Nested Loop Join using the "Unique Index" inner row source


Plan hash value: 3952364803

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:03.95 | 310K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:03.70 | 310K|
| 3 | TABLE ACCESS FULL | T2 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T1 | 100K| 1 | 2 (0)| 100K|00:00:02.59 | 300K|
|* 5 | INDEX UNIQUE SCAN | T1_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.14 | 200K|
---------------------------------------------------------------------------------------------------------------


If everything went as expected you'll see here the "classic" shape of a Nested Loop Join using an index lookup for the inner row source. The loop is driven by the full table scan of the T2 table and for every row produced by that row source the inner row source will be examined starting with an index unique scan in this case followed by an table access by ROWID for those rows found in the index.

Comparing the runtime profile to the model described above one significant difference becomes immediately obvious: The profile only shows 310,000 logical I/Os, not 500,000. So either above model is incorrect or Oracle has introduced some "short-cuts" that allow to avoid approx. 190,000 out of 500,000 logical I/Os. The difference of 190,000 seems to come from the index unique scan which only reports 200,000 logical I/Os instead of the expected 300,000 and the full table scan of T2 driving the nested loop. It reports only 10,000 logical I/Os instead of the 100,000 pictured above. More on those differences in a moment.

b) Running the Nested Loop Join using the "Non-Unique Index" inner row source


Plan hash value: 537985513

---------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | Cost (%CPU)| A-Rows | A-Time | Buffers |
---------------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | | 1 |00:00:06.31 | 311K|
| 2 | NESTED LOOPS | | 1 | 100K| 202K (1)| 100K|00:00:06.10 | 311K|
| 3 | TABLE ACCESS FULL | T1 | 1 | 100K| 2716 (1)| 100K|00:00:00.30 | 10010 |
| 4 | TABLE ACCESS BY INDEX ROWID| T2 | 100K| 1 | 2 (0)| 100K|00:00:04.91 | 301K|
|* 5 | INDEX RANGE SCAN | T2_IDX | 100K| 1 | 1 (0)| 100K|00:00:01.77 | 201K|
---------------------------------------------------------------------------------------------------------------


This is very interesting: First of all the cost calculation is the same, so in terms of costs estimates of the optimizer there is no difference between the unique and non-unique case.

However the runtime is significantly different: The non-unique variant is consistently slower than the unique variant.

Furthermore, another minor difference is a slightly increased number of logical I/Os that seems to be caused by the INDEX RANGE SCAN operation (201K vs. 200K).

Why this? Although we have defined a non-deferrable primary key constraint that guarantees uniqueness Oracle still searches in case of an index range scan for the next index entry that does not satisfy the access predicate, which means that for every iteration of the loop Oracle looks at the next index entry to check if it still satisfies the predicate or not. This means in case of the last index entry in each leaf block it has to actually check the next leaf block's first entry for this comparison, hence we end up with approx. number of index leaf blocks more logical I/O in this case. It is also the first part of the explanation why Oracle has to perform more work for the non-unique variant. From the runtime profile however we can tell that although we lose time at the index range scan vs. index unique scan operation, we lose even more time at the table access by ROWID operation.

Remember for a better understanding that the A-TIME and Buffer columns are cumulative - every parent operation includes the child operation runtime/logical I/Os, so in order to understand the runtime/logical I/Os of an operation itself you need to subtract the values taken from the direct descendant operation(s).

2. Session Statistics

Let's have a look at the relevant session statistics:


Statistics Name Unique Non-Unique Difference
----------------------------------------------------- -------- ----------- -----------
STAT..buffer is pinned count 189,998 189,998 0
STAT..table scan blocks gotten 10,000 10,000 0
STAT..table scan rows gotten 100,000 100,000 0
STAT..table fetch by rowid 100,000 100,002 2
STAT..buffer is not pinned count 200,001 200,005 4
STAT..consistent gets 310,012 311,110 1,098
STAT..consistent gets from cache 310,012 311,110 1,098
STAT..session logical reads 310,012 311,110 1,098
STAT..index fetch by key 100,000 2 -99,998
STAT..rows fetched via callback 100,000 2 -99,998
STAT..index scans kdiixs1 0 100,000 100,000
STAT..consistent gets - examination 300,001 100,007 -199,994
STAT..no work - consistent read gets 10,001 211,093 201,092
LATCH.cache buffers chains 320,023 522,213 202,190


a) Pinned Buffers

Now isn't that an interesting coincidence? May be not. The "buffer is pinned count" statistics quite nicely matches the missing 190,000 buffer visits. So Oracle managed to keep 190,000 times a buffer pinned instead of re-locating it in the buffer cache by hashing the database block address to find the corresponding hash bucket, grabbing the corresponding "cache buffers chains" child latch and so on.

Which buffers does Oracle keep pinned? Further modifications of the test case and investigating logical I/O details using events 10200/10202 allows to draw the conclusion that Oracle keeps the buffers of the driving table T2 pinned and the root block of the index. Pinning the root block of the index is a good idea in particular since it saves one logical I/O per loop iteration and the index root block is also quite unlikely to change frequently.

Why does Oracle not simply keep all of the buffers pinned rather than going through the hash/latch/pin exercise again and again? Very likely for various scalability/concurrency reasons, for example:

- A pinned buffer can not be removed/replaced even if it was eligible according to the LRU logic, hence it potentially prevents other buffers from being cached

- A pinned buffer can not be accessed by other sessions that want to pin it in incompatible mode (exclusive vs. shared), although multiple sessions can pin it concurrently in compatible mode (shared). Either those sessions have to queue behind (that is what a "buffer busy wait" is about) or they may be able to create a clone copy of the block and continue their work on the clone copy. Although the "clone copy" trick is a nice one, it is undesirable for several reasons:

* The "clone" copies require each a buffer from the cache effectively reducing the number of different blocks that can be held in the buffer cache. They are also the reason why an object might require much more cache than its original segment size in order to stay completely in the cache.

* They increase the "length" of the "cache buffers chains" leading to longer search times for blocks when locating the buffer in the cache and holding the "cache buffers chains" latch while doing so, hence increasing the potential for latch contention

So here is an important point: If you want to understand the work Oracle has performed in terms of buffer visits you need to consider both, the number of logical I/Os as well as the number of buffers visited without involving logical I/O - this is represented by the "buffer is pinned count" statistics.

Quite often this fact is overlooked and people only focus on the logical I/Os - which is not unreasonable - but misses the point about pinned buffers re-visited without doing logical I/O.

Note that buffer pinning is not possible across fetch calls - if the control is returned to the client the buffers will no longer be kept pinned. This is the explanation why a the "fetchsize" or "arraysize" for bulk fetches can influence the number of logical I/Os required to process a result set.

b) "consistent gets - examination"

There is another significant difference between the two runs that explains most of the remaining runtime difference between the unique and non-unique index variant: The unique index variant performs approx. 310,000 logical I/Os quite similar to the non-unique index variant, however it grabs the corresponding "cache buffers chains" child latch only 320,000 times vs. 520,000 times for the non-unique index.

How is that possible? The explanation can be found in the statistics: The Nested Loop Join when dealing with a unique index performs all logical I/Os as part of the inner row source as "short-cut" consistent gets, which are called "consistent gets - examination". Oracle uses this shortcut whenever it knows that the block visit will be of very short duration. Oracle knows that in this particular case because the unique index guarantees that there will be at most one matching row in the index structure as well as when doing the subsequent table row lookup. So there is no need to perform a "range" scan on the index, and it is guaranteed that only one single row per iteration can be returned from the index unique scan for the table lookup by ROWID.

Hence Oracle makes use of this knowledge and works on the buffer contents while holding the latch, this is what the "consistent gets - examination" statistics is about. A "normal" consistent get grabs the latch initially and releases it after having the buffer pinned. It works then on the buffer and afterwards "unpins" the buffer which requires another latch acquisition. Therefore a "non-shortcut" consistent get requires two latch acquisitions per logical I/O. This explains why we have 10,000 non-shortcut consistent gets for the driving full table scan (that are accompanied by 90,000 buffer visits avoiding logical I/O by keeping the buffer pinned) resulting in 20,000 latch acquisitions and 300,000 latch acquisitions for the remaining 300,000 "short-cut" consistent gets which makes in total 320,000 latch acquisitions for the unique index variant.

The non-unique index variant performs 200,000 "non-shortcut" logical I/Os on the inner index and the table lookup, responsible for 400,000 latch acquisitions, another 10,000 for the driving table full table scan (this part is not different from the unique index variant) good for another 20,000 latch acquisitions. But it also performs 100,000 "short-cut" consistent gets, resulting in the remaining 100,000 latch acquisitions. Modifying the test case by creating the index with a height of 2 (BLEVEL = 1) shows that Oracle uses the "short-cut" consistent gets on the branch blocks of the index, so this is another area where Oracle makes use of the "optimized" version of logical I/O even with the non-unique index variant.

Scalability

What do these subtle differences mean in terms of scalability? Well, you can download a bunch of scripts that allow to run the same test case with as many sessions concurrently as desired. It will show that there is a significant difference between the two cases: The unique index variant not only is faster in "single-user" mode but also scales much better than the non-unique index variant when performing the test concurrently (and completely cached viz. purely logical I/O based). The provided test suite could be modified to use a more realistic scenario that runs the statement multiple times in each concurrent session with a random sleep in between, but that is left as an exercise for the interested reader.

Summary

The baseline results show that the Oracle uses many built-in features to optimize logical I/O by either avoiding the logical I/O at all or using "short-cut" versions of logical I/O where applicable.

These optimizations allow the "unique index" variant to perform significantly better than the "non-unique index" variant of this particular test case. Note that this significant difference is only that significant when dealing with the pure logical I/O variant - introducing physical I/O make the difference far less impressive since the majority of the time is then spent on physical I/O, not logical I/O.

In the upcoming parts of this series I'll focus on further enhancements introduced in the recent releases like table prefetching, Nested Loop Join batching aka. as I/O batching and an obviously significantly enhanced buffer pinning mechanism introduced in Oracle 11g.

Footnote

If you study the script carefully, then you'll notice that it allows for different ordering of the data - it could be randomly ordered, randomly ordered but the same (pseudo-random) order for T1 and T2 and it could be ordered by ID.

If you run the test with this different ordering of data you'll notice no difference in the results with 10g (and table prefetching disabled), but it might give a clue where this will be heading for in the upcoming posts.

Forward to Part 2

9 comments:

  1. Very interesting post, Randolf.
    Thanks!

    ReplyDelete
  2. Excellent! Since I'm working for bank most of time and banks are not keen on upgrading, most of our databases are 9i and 10g (there is no business critical DB on 11g now).
    So I had never time to investigate 11g new features like consistent gets (fast path) and the difference between nlj prefetching and nlj batching (we turned of nlj batching on some 11g DB becuase it could return wrong results under some circumstances in 11.2.0.1).


    I'm looking forward to your next posts, It seems you are going to explain some of these topcis.

    Pavol Babel

    ReplyDelete
  3. Great analysis. Look forward to investigating your examples more in depth.
    Thanks
    Kyle Hailey

    ReplyDelete
  4. Nice & informative.
    Based on your findings, I think a large DB system with a limited cache size would perform significantly better with a reduction in I/Os (logical or not). I have systems like these, so thank you for giving me an opportunity to rejuvenate them a little.

    ReplyDelete
  5. Randolf,
    thank you for the interesting post - to which I have a - more or less closely related - question: when I create table t1 with only a single row (without the connect by part), the result of: alter table t1 minimize records_per_block; and the following: insert /*+ append */ into t1 ... is a segment with 50.000 blocks. The documentation (http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_3001.htm#i2125860) states: "Specify MINIMIZE to instruct Oracle Database to calculate the largest number of records in any block in the table and to limit future inserts so that no block can contain more than that number of records." So I would have expected a 100.000 block segment (with one block per row). Do you know a reason for the difference?

    Regards
    Martin

    ReplyDelete
  6. Hi Martin,

    I think there are two important things to consider:

    1. MINIMIZE RECORDS_PER_BLOCKS scans the row directory of all blocks for the maximum number of row entries per block - it is not counting the current number of rows in the block. So you can end up with unexpected results due to this if you have happen to have blocks with more row entries in the row directory than actual rows. I don't think that it applies to your example though - it's just an important point to consider when dealing with this option.

    2. I thought I had read about this anomaly already in Jonathan's "Practical Oracle 8i", but I can't find the reference at present. I might confuse this with some other similar strange effect. You can check the SYS.TAB$.SPARE1 column that holds the current Hakan factor for the table - bits 32768 and higher to be ignored I think to see if this gives you a clue whether Oracle doesn't allow this to go less than two (hey, that rhymes!)

    Randolf

    ReplyDelete
  7. Randolf,
    thank you for the additional details. For T1 I see a SPARE1 value of 32769 in SYS.TAB$. According to Julian Dyke's Bitmap Index Internals presentation (http://www.juliandyke.com/Presentations/BitmapIndexInternals.ppt, p. 36) the 0x8000 bit is set to indicate that "minimize records_per_block" is active. Building a table T2 with connect by level <= 2 I get the same 32769; with connect by level <= 3 I get SPARE1 = 32770 (and 3 rows per block) and with your example (connect by level <= 10) SPARE1 = 32777 (and of course 10 rows per block).

    My calculator claims: 32777 - 32768 = 9 - but the result are 10 rows per block. With 32769 (- 32768 = 1) there are 2 rows per block. Even if I use bigger rows (rpad('x', 5000) - so only one row fits into the 8K block) SPARE1 remains 32769. I assume that it's not possible to get SPARE1 = 32768 and to force a one row per block filling with minimize records_per_block (if more than one row fits into a block).

    Not really important but entertaining.

    Regards
    Martin

    ReplyDelete
  8. Hi,
    Thanks for quite an interesting post.
    There's one thing I couldn't figure out from your post is the way Oracle calculates buffer not pinned count stat.
    I've set up a similar environment (my leading table has got only one block with 10 rows; the other table has got 100k rows and 100k blocks) on my machine and tried to touch the base with that statistic but alas I can't state that I could. Although there are some findings that you and your readers might find useful.
    I've run two types of queries, first is akin to yours t1->Idx->t2, and the second one is just t1->idx.
    First off, seemed like the number of CR in trace is two fewer than in the corresponding stat. Second off, Oracle built CR copy of root index block and (the only) t1 table block two times each (before pinning it; from trace). Finally seems like oracle reported (not-)pinned count only for table(s) blocks, not for the index root block, moreover for t2 table (in first type of query) oracle reported not pinned count = 2 * number of actually accessed blocks (as I can see you've got pretty much the same picture as you've got 200k not pinned bufferes).
    Looks like the not pinned count stat is not that reliable after all.

    ReplyDelete
  9. Alexandr,

    I'm glad that finally someone took up that point - I actually expected this question much earlier.

    But I have the same problem as you outline: I've run several tests and it looked like I couldn't figure out a straightforward explanation when and why the "buffer is not pinned count" statistics get increased or not.

    In principle you could argue that it should increase each time a block is visited and it is not already pinned - but then this seems to correspond to the definition of a "logical I/O", so clearly I would expect this statistics to express something different, but it is unclear to me what exactly it is supposed to represent.

    So the term "unreliable" according to some tests seems to fit - as long as no-one is able to figure out the underlying rules this statistics is supposed to follow.

    Randolf

    ReplyDelete