Wednesday, July 27, 2011

Cost Is Time: Next Generation

It looks like Oracle has introduced with the Oracle 11.2.0.2 patch set a new "cost is time" model for the time estimate of the Cost-Based Optimizer (CBO).

In order to understand the implications let me summarize the evolution of the CBO in terms of cost / time estimate so far:

1. Oracle 7 and 8

The cost estimate generated by the Cost-Based Optimizer (CBO) has always been a time estimate, although expressed in a slightly obscure unit, which is number of single block reads.

The traditional I/O based costing introduced with Oracle 7 in principle counted the number of required single and multi-block reads to arrive at the final cost. A potential drawback of this approach was the missing differentiation of multi- and single-block reads - one multi-block read created the same cost as a one single-block read. Although the model used an "adjusted" multi-block read count to make full table scans more costlier than indicated by larger "db_file_multiblock_read_count" settings (and hence accounted for smaller extents and blocks already cached in the buffer cache making multi-block reads smaller than requested) it still potentially favoured full table scans over index access paths.

From Oracle 8 on one could play with the OPTIMIZER_INDEX_COST_ADJ / OPTIMIZER_INDEX_CACHING parameter to adjust this shortcoming of the costing model in particular for OLTP biased applications.

2. Oracle 9i

Oracle introduced with Oracle 9i the System Statistics along with a more sophisticated cost calculation model.

In short, System Statistics offer the following features:

- Different treatment of single-block and multi-block operations
- Time-based optimization using average timings for single- and multi-block reads
- Cost calculation includes a CPU cost component
- Gather actual hardware capabilities to base the calculations on actual system capabilities and workload pattern

More details can be found for example in my "Understanding System Statistics" blog series.

So with System Statistics the CBO actually calculates an estimated execution time - you can see this in the EXPLAIN PLAN output: With System Statistics enabled it includes a TIME column.

Simplified spoken the time estimate is simply the average time for a single block read (SREADTIM) times the number of single block reads plus the average time for a multi-block read (MREADTIM) times the number of multi-block reads plus the estimated number of cpu operations divided by the cpu operations per second (CPUSPEED / CPUSPEEDNW). So the cost with System Statistics is actually based on a time estimation.

For consistency reasons it has been decided to use the same unit as before, so the estimated time is simply divided by the SREADTIM to arrive at the same cost unit as with traditional I/O based costing which is number of single-block reads (although plans involving full segment scan operations usually arrive at different costs than the traditional costing, so consistency is hardly given anyway).

Right from the beginning in Oracle 9i the System Statistics could be gathered in WORKLOAD mode, which means that Oracle takes two snapshots of certain performance statistics and calculates the System Statistics parameters like SREADTIM, MREADTIM, MBRC etc. from the delta values.

3. Oracle 10g

Starting with Oracle 10g System Statistics were enabled by default with so called default NOWORKLOAD settings. It even allowed to generate an artificial load on the system by gathering the NOWORKLOAD System Statistics simply using a predefined I/O pattern to gather the disk transfer speed (IOTFRSPEED) and disk seek time (IOSEEKTIM) - these values are then used to derive the SREADTIM and MREADTIM values - the two most important ingredients of the enhanced cost/time calculation.

So since Oracle 9i there is a built-in functionality to measure the capabilities of the underlying hardware - from 10g on either based on a particular workload pattern or by submitting an artificial predefined load.

Furthermore Oracle provides a well-defined API as part of the DBMS_STATS package for dealing with System Statistics: They can be gathered, deleted, exported, imported, manually defined and even gathered directly into a separate statistics table to build a history of System Statistics gathered.

4. Oracle 11g

In 11g Oracle introduced the I/O calibration routine as part of the Resource Manager. Note that so far this feature didn't have a direct relationship to the Cost Based Optimizer - it could be used however to control the maximum parallel degree using the PARALLEL_IO_CAP_ENABLED parameter.

The first thing that puzzled me when dealing with that new functionality was the lack of a well-defined API to maintain the gathered information. There is a single call in the resource manager package (DBMS_RESOURCE_MANAGER.CALIBRATE_IO) to run the I/O calibration, but apart from that no additional functionality for maintenance. No way to delete the calibration results, export or import them, or even manually override.

If you want to get an understanding what this means, have a look at the MOS document "Automatic Degree of Parallelism in 11.2.0.2 [ID 1269321.1]" that, besides stating that there can be problems with the actual I/O calibration like gathering unreasonable values or not running to completion, shows you how to manipulate an internal SYS table to override the values gathered which also requires to bounce the instance in order to become effective.

I find it hard to understand why Oracle hasn't address these handling shortcomings in the meantime, particularly given the fact that with Oracle 11.2.0.2 the I/O resource calibration becomes mandatory if you want to make use of the new Auto-DOP feature that has been introduced with 11.2.0.1. Fiddling with a SYS-owned table doesn't sound like a well-designed feature to me, and the calibration functionality is not exactly "brand-new".

5. Oracle 11.2.0.2

So starting with 11.2.0.2 the new "cost is time" calculation comes into the picture. If you have values in the corresponding SYS.RESOURCE_IO_CALIBRATE$ table (that is simply externalized by the DBA_RSRC_IO_CALIBRATE view) then something really odd happens:

The cost that has been calculated according to the System Statistics model - which is already a time estimate based on three different components - time for single-block and multi-block reads as well as the estimated CPU time is now converted into a data volume by simply multiplying the resulting cost with the default block size. Dividing this data volume by the throughput as indicated by the I/O calibration results (it looks like the value MAX_PMBPS is relevant) arrives at a new estimated execution time.

Let's have a look at a working example: With default NOWORKLOAD System Statistics, 8KB default block size and a unset db_file_multiblock_read_count that results in a MultiBlockReadCount (MBRC) of 8 to be used internally for calculation of a full table scan (FTS) the time estimate for a FTS of 10,000 blocks (80MB) will be based on 1,250 multi-block reads, which are estimated to take 26ms each - this gives us a time estimate of 32.5 seconds. The CPU time associated with that full table scan operation will be added on top so that the final result will be something between 32.5 and 33 seconds. Let's stick to the 32.5 seconds - this time estimate corresponds to approx. 2,710 single-block reads by simply dividing the time by 12ms which happens to be the SREADTIM value for default NOWORKLOAD System Statistics with above configuration - this value will be close to the cost shown (minor variations are depending on the CPU speed determined).

Cost / time estimate for a FTS of a 10,000 block segment with 8KB block size, default NOWORKLOAD System Statistics and default MBRC of 8 used for cost calculation (_db_file_optimizer_read_count = 8):

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2716 (1)| 00:00:33 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 10000 | 2716 (1)| 00:00:33 |
-------------------------------------------------------------------


Now if you happen to have a value of MAX_PMBPS of 4MB/sec as I/O Resource Calibration result (I chose this very conservative value deliberately because it happens to be the same transfer rate that the default NOWORKLOAD System Statistics assumes (4096 bytes per millisec), the following new time calculation will happen instead:

2,710 will be multiplied with the 8KB default block size to arrive at a data volume, in this case approx. 21 MB

This approx. 21 MB is now divided by the 4MB/sec, to arrive at a new time estimate of approx. 5.3 seconds, rounded up to 6 seconds. Note that the original time estimate was 32.5 seconds.

Cost / time estimate for a FTS of a 10,000 block segment with 8KB block size, default NOWORKLOAD System Statistics and default MBRC of 8 used for cost calculation (_db_file_optimizer_read_count = 8) but MAX_PMBPS set to 4MB/sec:

-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2716 (1)| 00:00:06 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| T | 10000 | 2716 (1)| 00:00:06 |
-------------------------------------------------------------------


You can see this also happening in the 10053 CBO trace file:


kkeCostToTime: using io calibrate stats
maxmbps=0(MB/s) maxpmbps=4(MB/s)
block_size=8192 mb_io_count=1 mb_io_size=8192 (bytes)
tot_io_size=21(MB) time=5304(ms)


Now this approach strikes me as odd for several reasons:

- A sophisticated time estimate (remember that it even includes a CPU time component that has nothing to do with an I/O volume) is turned into data volume to arrive at a new time estimate using a rather simplistic approach

- As you can see from above example the "data volume" calculated does not correspond to the actual I/O volume that we know from the System Statistics cost/time calculation - remember that the actual segment size in this case was 80MB, not 20MB. This is of course caused by the underlying calculation of the original time estimate based on multi-block reads. So why we would turn the cost/time into some data volume that has nothing to do with the actual data volume used for the original cost/time calculation is above me

- There is already an I/O calibration routine available as part of the System Statistics functionality that can be used to arrive at more realistic time estimates based on the gathered System Statistics information - why a second one has been introduced? Furthermore this raises the question: If I'm required to run the I/O calibration to enable Auto-DOP - shouldn't I then also "calibrate" my System Statistics to arrive at a "calibrated" cost estimate? After all the new "Cost Is Time" approach uses the cost estimate for the new time estimate.

- As already outlined there is no officially documented way to properly deal with the I/O calibration results - manually poking into SYS-owned tables doesn't really count

Implications

So you probably think, why bother? The cost estimate is left untouched, only the TIME column is affected. So execution plans shouldn't change since they are still chosen based on the lowest cost estimate - and the lower the cost, the lower the new time estimate.

You'll appreciate however that the previous simple correlation between the cost and the time estimate is no longer true with 11.2.0.2 and resource calibration results available: So far you could simply divide the time estimate by the SREADTIM value to arrive at the cost, or the other way around, you could multiply the cost by the SREADTIM value to arrive at the time estimate - or use both values to arrive at the SREADTIM value - since the time divided by the cost should give you the approximate value of SREADTIM.

The point with 11.2.0.2 and the I/O resource calibration is that the new time estimate is obviously used for the Auto-DOP feature to drive two crucial decisions:

- Is the statement a candidate for parallel execution? This is controlled via the parameter PARALLEL_MIN_TIME_THRESHOLD that defaults to 10 seconds in 11.2.0.2

- If it is a candidate for parallel execution what is the optimal DOP? This is of course depending on a lot of different inputs but also seems to be based on the new time estimate - that, as just explained, arrives at a (wrong) data volume estimate in a questionable way

As a side note, Oracle at present recommends to set the value of MAX_PMBPS to 200 for Exadata environments rather than relying on the results of the actual I/O calibration - another indication that the I/O calibration results as of now are questionable.

Summary

Oracle introduced with the 11.2.0.2 patch set a new model for the estimated "Execution Time" if I/O resource calibration results are available. As outlined above the new approach seems to be questionable (at least), but will be used for crucial decisions regarding the new Auto-DOP feature. It will be interesting to see the further development in this area, whether for example the new time algorithm will be changed in upcoming releases or the influence of the I/O calibration on the CBO calculations will be extended.

If you want to make use of the new Auto-DOP feature in 11.2.0.2 you should be aware of these relationship - the MAX_PMBPS parameter drives the new time estimation and the Auto-DOP calculations.

2 comments:

Charles Hooper said...

Randolf,

Very interesting findings in your article.

I wonder if the time calculation change was due to the implementation of a work-around for the bug that causes the SREADTIM and MREADTIM statistics to be incorrectly calculated on 11.2.0.1 and 11.2.0.2? The bug is explained in the following two links:
http://antognini.ch/2010/11/workload-system-statistics-in-11g/
http://orastory.wordpress.com/2011/03/22/high-system-statistics-11gr2/

Randolf said...

Hi Charles,

I don't think that this change was driven by this bug, that can be fixed by installing patch 9842771, by the way.

The majority of systems is very likely using the default NOWORKLOAD System Statistics introduced with 10.2 - and are therefore unaffected by the mentioned bug.

I personally find it a bit odd that we now have two calibration routines, but both seem to immature:

System Statistics gathering still seems to produce sometimes results that are not really helpful in arriving at better execution plans - and I don't refer to this specific bug.

I/O Resource Calibration also seems to be subject to questionable results.

Why not have a single routine that produces reliable results?

Randolf