Sunday, May 24, 2009

Understanding the different modes of System Statistics aka. CPU Costing and the effects of multiple blocksizes - part 4

Back to part 3

Using objects residing in multiple blocksizes

I've already mentioned it several times on my blog but I would like to take the chance here again to stress the point that the cost based optimizer does a bad job when it comes to calculating costs for full table scans of objects residing in non-default block sizes. It really looks like that this feature has been introduced to support transportable tablespaces but it obviously hasn't been tested very thoroughly when it comes to cost calculation.

Each of the different modes has its deficiencies when dealing with objects in non-default blocksizes. The somehow odd thing is that the traditional I/O costing does the best job, and all system statistics based calculations are utterly wrong.

Traditional I/O based costing

The traditional I/O based costing simply scales the MBRC up or down according to the non-default blocksize to come to the same I/O read request size. So if you e.g. have a MBRC of 8 and a default blocksize of 8KB and now calculate the cost for an object residing in a 2KB tablespace, the MBRC will be multiplied 4, which results in a MBRC of 32. The I/O cost will be different although due to the different adjustment used with the higher MBRC setting. The adjusted MBRC for 32 is 16.39 whereas the adjusted MBRC for 8 is 6.59, so the calculated cost for the full table scan of the object residing in the 2KB tablespace will be higher. Likewise the same happens when using an object in a 16KB non-default tablespace. The MBRC will be reduced accordingly to 4 to get the same I/O read size again. Since adjusted MBRC for MBRC = 4 is 4.17, the cost calculated will actually be less for the object residing the 16KB tablespace.

System statistics

Things get more complicated when using NOWORKLOAD or WORKLOAD system statistics.
To recall the formula to calculate the I/O cost of a full table scan with system statistics is:

Number of blocks / MBRC * MREADTIM / SREADTIM

And in case of NOWORKLOAD system statistics the MREADTIM and SREADTIM are synthesized using the following formula:

SREADTIM = IOSEEKTIM + DB_BLOCK_SIZE / IOTRFSPEED

MREADTIM = IOSEEKTIM + DB_BLOCK_SIZE * MBRC / IOTRFSPEED

Now if the object resides in a non-default blocksize tablespace, the following inconsistent adjustments are applied to the formula:

NOWORKLOAD system statistics:

SREADTIM as above, using _DEFAULT_ DB_BLOCK_SIZE

MREADTIM = IOSEEKTIM + _DEFAULT_ DB_BLOCK_SIZE * scaled MBRC / IOTRFSPEED

I/O cost = Number of blocks / scaled MBRC * MREADTIM / SREADTIM

So obviously something is odd in above formulas: The SREADTIM and MREADTIM values are synthesized with a mixture of a scaled MBRC (according to the block size) but a non-adjusted default DB_BLOCK_SIZE, resulting in a large variation in cost, so a full table scan in a small blocksize is much more expensive than in the default blocksize, and likewise a full table scan in a large blocksize is much cheaper. Unfortunately this doesn't reflect at all the runtime behaviour, since Oracle acutally scales the I/O read request size accordingly meaning that the runtime difference usually is negligible, but the cost calculated is dramatically different.

WORKLOAD system statistics:

MBRC as measured/set

SREADTIM as measured/set

MREADTIM as measured/set

I/O cost = Number of blocks / MBRC * MREADTIM / SREADTIM

This is even worse than above NOWORKLOAD result because the I/O cost calculated simply is different by the factor of number of blocks in non-default block size / number of blocks in default block size, e.g. an object residing in a 2KB block size will have an I/O cost four times higher than an object residing in a 8KB default blocksize, and the MBRC is not adjusted at all for the calculation.

Some calculation examples

The following examples attempt to summarize what you get if you're using an object in a non-default blocksize tablespace:

1. I/O Cost of a full table scan of a 10,000 blocks / 80MB segment residing in a default blocksize of 8KB (in case of system statistics excluding the CPU cost) and using an MBRC of 8:

traditional I/O:
round(10,000 blocks / 6.59) = 1,517

(default) NOWORKLOAD system statistics:
SREADTIM = 10 + 8192/4096 = 10 + 2 = 12
MREADTIM = 10 + 8192*8/4096 = 10 + 16 = 26

round((10,000 / 8) * 26 / 12) = 1,250 * 26 / 12 = 2,708

WORKLOAD system statistics:
same as above provided you use the same for MREADTIM and SREADTIM otherwise as measured

2. I/O Cost of a full table scan of a 40,000 blocks / 80MB segment residing in a non-default blocksize of 2KB (in case of system statistics excluding the CPU cost) and using a default blocksize of 8KB and a MBRC of 8:

traditional I/O:
MBRC used for 2KB blocksize = 8 * 8KB / 2KB = 8 * 4 = 32

adjusted MBRC for MBRC = 32: 16.39

round(40,000 blocks / 16.39) = 2,441

(default) NOWORKLOAD system statistics:
SREADTIM = 10 + 8192/4096 = 10 + 2 = 12
(should be 10 + 2048/4096 = 10 + 0.5 = 10.5)

MREADTIM = 10 + 8192*32/4096 = 10 + 64 = 74
(should be 10 + 2048*32/4096 = 10 + 16 = 26)

round((40,000 / 32) * 74 / 12) = 1,250 * 74 / 12 = 7,708
(should be round((40,000 / 32) * 26 / 10.5) = 1,250 * 26 / 10.5 = 3,095

WORKLOAD system statistics (for the sake of comparison using the NOWORKLOAD defaults):
MBRC = 8
SREADTIM = 12
MREADTIM = 26

round((40,000 / 8) * 26 / 12) = 5000 * 26 / 12 = 10,833
(should be round((40,000 / 32) * 26 / 12) = 1,250 * 26 / 12 = 2,708

3. I/O Cost of a full table scan of a 5,000 blocks / 80MB segment residing in a non-default blocksize of 16KB (in case of system statistics excluding the CPU cost) and using a default blocksize of 8KB and a MBRC of 8:

traditional I/O:
MBRC used for 16KB blocksize = 8 * 8KB / 16KB = 8 * 0.5 = 4

adjusted MBRC for MBRC = 4: 4.17

round(5,000 blocks / 4.17) = 1,199

(default) NOWORKLOAD system statistics:
SREADTIM = 10 + 8192/4096 = 10 + 2 = 12
(should be 10 + 16384/4096 = 10 + 4 = 14)

MREADTIM = 10 + 8192*4/4096 = 10 + 8 = 18
(should be 10 + 16384*4/4096 = 10 + 16 = 26)

round((5,000 / 4) * 18 / 12) = 1,250 * 18 / 12 = 1,875
(should be round((5,000 / 4) * 26 / 14) = 1,250 * 26 / 14 = 2,321

WORKLOAD system statistics:
MBRC = 8
SREADTIM = 12
MREADTIM = 26

round((5,000 / 8) * 26 / 12) = 625 * 26 / 12 = 1,354
(should be (5,000 / 4) * 26 / 12) = 1,250 * 26 / 12 = 2,708

Plan change by moving to non-default block size

And here is a final example that shows by simply moving an object to a non-default blocksize tablespace a change in the execution plan can be observed. An index scan is used instead of the full table scan since the cost of the full table scan operation is significantly increased. The SQL trace will reveal that at runtime however there is no significant difference between the full table scan executed in the default tablespace and non-standard blocksize tablespace.


SQL> exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>
SQL> create table t1
2 as
3 select
4 trunc(dbms_random.value(0,25)) as col1
5 , trunc(dbms_random.value(0,25)) as col2
6 , rpad('x', 200, 'x') as filler
7 from
8 (
9 select
10 level as id
11 from
12 dual
13 connect by
14 level <= 10000
15 );

Table created.

Elapsed: 00:00:00.49
SQL>
SQL> create index t1_idx on t1 (col1, col2)
2 ;

Index created.

Elapsed: 00:00:00.14
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.55
SQL>
SQL> explain plan for
2 select /*+ full(t1) */
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.02
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 83 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 96 | 19776 | 83 (2)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COL1"=1 AND ("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR
"COL2"=4 OR "COL2"=5 OR "COL2"=6))

14 rows selected.

Elapsed: 00:00:00.13
SQL>
SQL> explain plan for
2 select /*+ index(t1) */
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 546753835

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 96 (0)| 00:00:02 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 96 | 19776 | 96 (0)| 00:00:02 |
|* 2 | INDEX RANGE SCAN | T1_IDX | 96 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("COL1"=1)
filter("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR "COL2"=4 OR "COL2"=5 OR
"COL2"=6)

16 rows selected.

Elapsed: 00:00:00.11
SQL>
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.01
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 83 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 96 | 19776 | 83 (2)| 00:00:01 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COL1"=1 AND ("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR
"COL2"=4 OR "COL2"=5 OR "COL2"=6))

14 rows selected.

Elapsed: 00:00:00.05
SQL>
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.02
SQL>
SQL> alter session set tracefile_identifier = 'fts_8k';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> alter session set events '10046 trace name context forever, level 8';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set termout off
SQL>
SQL> select /*+ full(t1) */
2 *
3 from
4 t1
5 where
6 col1 = 1
7 and col2 in (1,2,3,4,5,6);

SQL>
SQL> set termout on
SQL>
SQL> alter session set events '10046 trace name context off';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set tracefile_identifier = 'idx_8k';

Session altered.

Elapsed: 00:00:00.02
SQL>
SQL> alter session set events '10046 trace name context forever, level 8';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set termout off
SQL>
SQL> select /*+ index(t1) */
2 *
3 from
4 t1
5 where
6 col1 = 1
7 and col2 in (1,2,3,4,5,6);

SQL>
SQL> set termout on
SQL>
SQL> alter session set events '10046 trace name context off';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> drop table t1 purge;

Table dropped.

Elapsed: 00:00:00.70
SQL>
SQL> exec dbms_random.seed(0)

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>
SQL> create table t1
2 tablespace test_2k
3 as
4 select
5 trunc(dbms_random.value(0,25)) as col1
6 , trunc(dbms_random.value(0,25)) as col2
7 , rpad('x', 200, 'x') as filler
8 from
9 (
10 select
11 level as id
12 from
13 dual
14 connect by
15 level <= 10000
16 );

Table created.

Elapsed: 00:00:00.46
SQL>
SQL> create index t1_idx on t1 (col1, col2)
2 tablespace test_2k;

Index created.

Elapsed: 00:00:00.13
SQL>
SQL> exec dbms_stats.gather_table_stats(null, 't1')

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.69
SQL>
SQL> explain plan for
2 select /*+ full(t1) */
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.07
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3617692013

--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 245 (1)| 00:00:03 |
|* 1 | TABLE ACCESS FULL| T1 | 96 | 19776 | 245 (1)| 00:00:03 |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("COL1"=1 AND ("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR
"COL2"=4 OR "COL2"=5 OR "COL2"=6))

14 rows selected.

Elapsed: 00:00:00.11
SQL>
SQL> explain plan for
2 select /*+ index(t1) */
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.06
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 546753835

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 102 (0)| 00:00:02 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 96 | 19776 | 102 (0)| 00:00:02 |
|* 2 | INDEX RANGE SCAN | T1_IDX | 96 | | 6 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("COL1"=1)
filter("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR "COL2"=4 OR "COL2"=5 OR
"COL2"=6)

16 rows selected.

Elapsed: 00:00:00.10
SQL>
SQL> explain plan for
2 select
3 *
4 from
5 t1
6 where
7 col1 = 1
8 and col2 in (1,2,3,4,5,6);

Explained.

Elapsed: 00:00:00.02
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 546753835

--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 96 | 19776 | 102 (0)| 00:00:02 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 96 | 19776 | 102 (0)| 00:00:02 |
|* 2 | INDEX RANGE SCAN | T1_IDX | 96 | | 6 (0)| 00:00:01 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("COL1"=1)
filter("COL2"=1 OR "COL2"=2 OR "COL2"=3 OR "COL2"=4 OR "COL2"=5 OR
"COL2"=6)

16 rows selected.

Elapsed: 00:00:00.10
SQL>
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.03
SQL>
SQL> alter session set tracefile_identifier = 'fts_2k';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set events '10046 trace name context forever, level 8';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> set termout off
SQL>
SQL> select /*+ full(t1) */
2 *
3 from
4 t1
5 where
6 col1 = 1
7 and col2 in (1,2,3,4,5,6);

SQL> set termout on
SQL>
SQL> alter session set events '10046 trace name context off';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set tracefile_identifier = 'idx_2k';

Session altered.

Elapsed: 00:00:00.01
SQL>
SQL> alter session set events '10046 trace name context forever, level 8';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> set termout off
SQL>
SQL> select /*+ index(t1) */
2 *
3 from
4 t1
5 where
6 col1 = 1
7 and col2 in (1,2,3,4,5,6);

SQL> set termout on
SQL>
SQL> alter session set events '10046 trace name context off';

Session altered.

Elapsed: 00:00:00.00
SQL>
SQL> spool off


And here's what we get from the generated trace files.

The 8KB FTS does these multi-block reads:

WAIT #35: nam='db file scattered read' ela= 2296 file#=8 block#=12050 blocks=8 obj#=61857 tim=107988983808
WAIT #35: nam='db file scattered read' ela= 916 file#=8 block#=12058 blocks=8 obj#=61857 tim=107988984970
WAIT #35: nam='db file scattered read' ela= 911 file#=8 block#=12066 blocks=8 obj#=61857 tim=107988986104


And the 2KB FTS does these multi-block reads:

WAIT #37: nam='db file scattered read' ela= 798 file#=6 block#=66 blocks=32 obj#=61859 tim=107991673671
WAIT #37: nam='db file scattered read' ela= 828 file#=6 block#=98 blocks=32 obj#=61859 tim=107991674822
WAIT #37: nam='db file scattered read' ela= 829 file#=6 block#=130 blocks=32 obj#=61859 tim=107991675980


And the TKPROF output confirms that there is not a relevant difference in the wait times.

The 8KB output:

********************************************************************************

select /*+ full(t1) */
*
from
t1
where
col1 = 1
and col2 in (1,2,3,4,5,6)

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.00 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.01 0.06 296 299 0 93
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.01 0.06 296 299 0 93

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61

Rows Row Source Operation
------- ---------------------------------------------------
93 TABLE ACCESS FULL T1 (cr=299 pr=296 pw=0 time=21098 us)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
db file sequential read 1 0.00 0.00
db file scattered read 37 0.02 0.05
SQL*Net message from client 2 0.00 0.00
********************************************************************************


The 2KB output:


********************************************************************************

select /*+ full(t1) */
*
from
t1
where
col1 = 1
and col2 in (1,2,3,4,5,6)

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.00 0 0 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 0.02 0.07 1251 1254 0 93
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 0.03 0.07 1251 1254 0 93

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 61

Rows Row Source Operation
------- ---------------------------------------------------
93 TABLE ACCESS FULL T1 (cr=1254 pr=1251 pw=0 time=21314 us)


Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
SQL*Net message to client 2 0.00 0.00
db file sequential read 1 0.00 0.00
db file scattered read 40 0.02 0.05
SQL*Net message from client 2 0.00 0.00
********************************************************************************


Don't be mislead by the different number of consistent gets, this is just due to the different block size. As can be seen from the raw trace files the multi-block reads in the 2KB block size are reading 32 blocks at a time, whereas in the 8KB default block size 8 blocks are read at a time, so the number of I/O requests will be similar given a segment of the same size.

3 comments:

  1. Nice analysis!

    I have never witnessed these execution plans changes, very interesting.

    I wonder if this happens in the real world? I've never seen it, but I am working with very large tables.

    >>> Don't be mislead by the different number of consistent gets, this is just due to the different block size. <<<

    Yes, but consistent gets have an expense, mostly CPU, setting locks & latches. Less gets is less CPU overhead, the question is the size of the change.

    ReplyDelete
  2. Apologies for the late reply.

    >>> Yes, but consistent gets have an expense, mostly CPU, setting locks & latches. Less gets is less CPU overhead, the question is the size of the change. <<<

    That is basically true, but the point that I tried to make here was a different one: Quite often the larger block sizes are recommended because they would offer a "faster I/O" simply due to fact that the blocks are larger.

    As the test case demonstrates this is not true, because the I/O request size for multi-block reads is always the one defined by the default block size.

    Regarding your point of less locks & latches: Usually when performing multi-block I/O most of the time is spent on waiting on the I/O and the difference in latch activity between the different block sizes is negligible, although it might be measurable.

    If you have a system that performs so many consistent gets that you get contention due to latch activity moving the objects to a larger block size will probably only offer a small relief.

    Very often this is caused by inefficient execution plans (e.g. NESTED LOOP operations with millions of iterations of random table lookups or full table scans) and tuning these plans will be the more scalable solution in most cases.

    Another point to consider when using larger block sizes: If you have a workload that also consists of random row table access by rowids (usually when performing NESTED LOOP operations and/or index scans) you might be unlucky and the table rows accessed are scattered across many blocks which required more space in the cache buffer with the larger block size, competing for the buffer cache with other operations.

    Of course you might also be lucky and the number of blocks accessed is lower with the larger block size than with a smaller block size potentially reducing the number of consistent gets (e.g. in case of a index range scan).

    So as you can see, there is no black and white, but as always: It depends!

    Regards,
    Randolf

    ReplyDelete
  3. Hi Randolf,

    Nice Posts :)

    It would be nice if you could summarize all of the formulas you used from IO cost model to CPU cost model, and some links/readables regarding these computations.

    Just for a quick reference and further reading on these areas :)

    ReplyDelete