## Saturday, January 24, 2009

### Correlation, nocorrelation and extended stats - amendments

Riyaj Shamsudeen recently published a very interesting blog post about the 11g extended statistics feature and correlated column values.

Because I think that some of his findings were significantly influenced by the fact that he generated frequency histograms on all columns, I've repeated his test case with some more variations, in particular with and without histograms and in case of histograms, with and without the new feature "Bug 5483301: Cardinality of 1 when predicate value non-existent in frequency histogram" which was introduced in 10.2.0.4 and 11.1.0.6. More details about this change can be found in Metalink Doc ID 5483301.8.

The following results were obtained using 11.1.0.7 Enterprise Edition on Windows 32bit.

create table y1 (a number, b number, c number);

begin
for i in 1..1000 loop
for j in 1..10 loop
insert into y1 values (j,mod(j,5), mod(j,2) );
end loop;
end loop;
end;
/

commit;

REM Distribution of these column values given below.
select a, b, count(*) from y1 group by a,b order by a,b
/
A B COUNT(*)
---------- ---------- ----------
1 1 1000
2 2 1000
3 3 1000
4 4 1000
5 0 1000
6 1 1000
7 2 1000
8 3 1000
9 4 1000
10 0 1000

10 rows selected.

REM Let's also add an index to this table
create index y1_i1 on y1 (a, b);

REM The number of distinct keys is used
REM to determine selectivity if
REM a) an all-equal operation on the entire index is used
REM and obviously
REM b) NO histogram is present
select distinct_keys from user_indexes where index_name = 'Y1_I1';

DISTINCT_KEYS
-------------
10

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 254');
end;
/

REM this shows that although the index is in place
REM the DISTINCT_KEYS information is ignored
REM but the histogram is used instead
REM and we fall back to the default selectivity formula
REM see at the end of the script for a demonstration
REM without index which comes to the same result
alter session set tracefile_identifier = 'correlated1';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt shows:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
ColGroup (#1, Index) Y1_I1
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 200 Computed: 200.00 Non Adjusted: 200.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2396429
Resp_io: 12.00 Resp_cpu: 2396429
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 5.00 resc_cpu: 114847
ix_sel: 0.020000 ix_sel_with_filters: 0.020000
Cost: 5.05 Resp: 5.05 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I1
Cost: 5.05 Degree: 1 Resp: 5.05 Card: 200.00 Bytes: 0

In contrast to Riyah's interpretation I would say that it shows that the index information is not used to derive the selectivity ("Matches Full:"). Rather it looks like the presence of the frequency histogram causes a fall back to the default formula used for non-correlated columns: selectivity(A) * selectivity(B) which leads to the selectivity of 0.02 and hence cardinality of 200.

Now repeating the same without any histograms:

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 1');
end;
/

REM without histograms
REM the all-equal on entire index rule is used
REM hence the selectivity is 0.1 => 1/DISTINCT_KEYS of index
alter session set tracefile_identifier = 'correlated2';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I1
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 22.00 resc_cpu: 547522
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 22.22 Resp: 22.22 Degree: 1
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 1000.00 Bytes: 0

Now it can be seen that the index information actually gets used ("Matches Full: #1 Partial: Sel: 0.1000") and we come to the correct cardinality estimate.

Unfortunately this approach has some shortcomings, as can be seen when using out-of-range attributes:

REM unfortunately this approach
REM doesn't cope very well with "out-of-range" predicates
REM although the 10053 trace shows a detection of the "out-of-range"
REM attribute, the final index selectivity is left unchanged
REM => Room for improvement?
alter session set tracefile_identifier = 'correlated3';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

REM switching off the "new frequency histogram unpopular value"
REM bug fix doesn't change anything here obviously
REM as we don't have a histogram in place
alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'correlated4';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I1
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 22.00 resc_cpu: 547522
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 22.22 Resp: 22.22 Degree: 1
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 1000.00 Bytes: 0

Although there is a "out-of-range" detection, the final selectivity is left unchanged, which looks a bit odd, and the default formula that adjusts the selectivity based on the distance from the known low/high values doesn't apply either. Switching off above mentioned "bug fix" doesn't change this behaviour which seems to be consistent since we don't have a histogram in place.

Now again adding histograms and checking "out-of-range" predicates:

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 254');
end;
/

REM this corresponds to having histogram
REM but no index in place, since the histograms
REM take obviously precedence
REM not sure how we arrive at the cardinality of 50
REM it looks like the NewDensity is 0.5 / NUM_DISTINCT
REM and gets divided by 2 according to the formula
REM for out-of-range predicates (see case 14 below)
alter session set "_fix_control"='5483301:on';

alter session set tracefile_identifier = 'correlated5';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
ColGroup (#1, Index) Y1_I1
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Using prorated density: 0.050000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 50 Computed: 50.00 Non Adjusted: 50.00
Access Path: TableScan
Cost: 12.95 Resp: 12.95 Degree: 0
Cost_io: 12.00 Cost_cpu: 2368429
Resp_io: 12.00 Resp_cpu: 2368429
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Using prorated density: 0.050000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 2.00 resc_cpu: 34983
ix_sel: 0.005000 ix_sel_with_filters: 0.005000
Cost: 2.01 Resp: 2.01 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I1
Cost: 2.01 Degree: 1 Resp: 2.01 Card: 50.00 Bytes: 0

I'm not sure about how the optimizer arrives at the selectivity of 0.005 resp. cardinality of 50 in this particular case.

Switching off the new behaviour regarding non-existing values in frequency histograms outlined above shows that we get a cardinality of 1 in this case:

REM switch to the old behaviour
REM interestingly this influences the "NewDensity"
REM calculation and sets it effectively to the "OldDensity"
REM Now we get a cardinality of 1
REM when querying for a non-existent value
alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'correlated6';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
ColGroup (#1, Index) Y1_I1
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1 Computed: 0.05 Non Adjusted: 0.05
Access Path: TableScan
Cost: 12.94 Resp: 12.94 Degree: 0
Cost_io: 12.00 Cost_cpu: 2342474
Resp_io: 12.00 Resp_cpu: 2342474
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 2.00 resc_cpu: 15483
ix_sel: 0.000005 ix_sel_with_filters: 0.000005
Cost: 2.01 Resp: 2.01 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I1
Cost: 2.01 Degree: 1 Resp: 2.01 Card: 0.05 Bytes: 0

Now we are going to introduce the extended statistics available in 11g, and generate a frequency histogram on the new virtual column (which is unnecessary since we have uniform distribution).

REM Adding extended statistics to educate CBO about correlation between columns a and b.

SELECT
dbms_stats.create_extended_stats(
ownname=>user,
tabname => 'Y1',
extension => '(a, b)' ) AS a_b_correlation
FROM dual;

REM Extended stats depicted as above, adds a virtual column to the table with an internal function call.
REM Collecting stats on the table with histograms.

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 254');
end;
/

REM using the histogram on the newly
REM generated virtual column
REM gets us the correct cardinality
alter session set "_fix_control"='5483301:on';

alter session set tracefile_identifier = 'correlated7';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

REM no change with old "out-of-range" behaviour
alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'correlated8';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
Column (#4):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
ColGroup (#1, VC) SYS_STUNA\$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 22.00 resc_cpu: 547522
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 22.22 Resp: 22.22 Degree: 1
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 1000.00 Bytes: 0

The column group gets used and the correct cardinality is estimated.

The "bug fix" switch for the non-existent values in frequency histograms doesn't change the result - which is consistent since we're looking for a existent value.

But since we have a frequency histogram in place, once we start to query for non-existent values, the bug fix becomes relevant:

REM new rule introduced with 10.2.0.4
REM and 11.1.0.6, see bug 5483301 resp.
REM Instead of reporting a cardinality of 1
REM for a non-existent value in a frequency
REM histogram, the cardinality will be
REM 0.5 times the cardinality of the least
REM popular value in the histogram
REM => hence 1,000 * 0.5 in our case => 500

alter session set "_fix_control"='5483301:on';

alter session set tracefile_identifier = 'correlated9';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
Column (#4):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
ColGroup (#1, VC) SYS_STUNA\$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: Using prorated density: 0.049909 of col #4 as selectvity of out-of-range/non-existent value pred
#1 Partial: Sel: 0.0499
Using prorated density: 0.050000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 499 Computed: 499.09 Non Adjusted: 499.09
Access Path: TableScan
Cost: 12.95 Resp: 12.95 Degree: 0
Cost_io: 12.00 Cost_cpu: 2377409
Resp_io: 12.00 Resp_cpu: 2377409
ColGroup Usage:: PredCnt: 2 Matches Full: Using prorated density: 0.049909 of col #4 as selectvity of out-of-range/non-existent value pred
#1 Partial: Sel: 0.0499
ColGroup Usage:: PredCnt: 2 Matches Full: Using prorated density: 0.049909 of col #4 as selectvity of out-of-range/non-existent value pred
#1 Partial: Sel: 0.0499
Using prorated density: 0.050000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 12.00 resc_cpu: 281307
ix_sel: 0.049909 ix_sel_with_filters: 0.049909
Cost: 12.11 Resp: 12.11 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I1
Cost: 12.11 Degree: 1 Resp: 12.11 Card: 499.09 Bytes: 0

The new rule applies that returns us 0.5 times the cardinality of the least popular value in the histogram. Since we have uniform distribution, every popular value has a cardinality of 1,000, hence the adjusted cardinality according to the new formula is 500 (or lets say, close to 500).

Switching off the bug fix returns us to the old behaviour, which seems to be reasonable: If the value can not be found in the frequency histogram, a cardinality of 1 should be estimated:

REM switch to the old behaviour
REM interestingly this influences the "NewDensity"
REM calculation and sets it effectively to the "OldDensity"
REM Now we get a cardinality of 1
REM when querying for a non-existent value
alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'correlated10';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5
Column (#4):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
ColGroup (#1, VC) SYS_STUNA\$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0001
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
Access Path: TableScan
Cost: 12.94 Resp: 12.94 Degree: 0
Cost_io: 12.00 Cost_cpu: 2342474
Resp_io: 12.00 Resp_cpu: 2342474
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0001
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0001
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 2.00 resc_cpu: 15483
ix_sel: 0.000050 ix_sel_with_filters: 0.000050
Cost: 2.01 Resp: 2.01 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I1
Cost: 2.01 Degree: 1 Resp: 2.01 Card: 0.50 Bytes: 0

As we can see in both cases the column group is used but they get different selectivities applied, with the bug fix switched off we come to the cardinality of 1.

If we use the extended statistic without a histogram, we get the same behaviour as we got when we were using the composite index without a histogram:

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 1');
end;
/

REM extended statistics
REM without histogram
REM shows correct cardinality
REM since this is a uniform distribution
REM no histogram required
alter session set "_fix_control"='5483301:on';

alter session set tracefile_identifier = 'correlated11';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, VC) SYS_STUNA\$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 22.00 resc_cpu: 547522
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 22.22 Resp: 22.22 Degree: 1
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 1000.00 Bytes: 0

So the cardinality estimate is correct, but see what happens again when using "out-of-range" predicates:

REM But same behaviour as case 3 (index without histogram):
REM Unfortunately this approach
REM doesn't cope very well with "out-of-range" predicates
REM although the 10053 trace shows a detection of the "out-of-range"
REM attribute, the final index selectivity is left unchanged
REM => Room for improvement?
alter session set tracefile_identifier = 'correlated12';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, VC) SYS_STUNA\$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (AllEqRange)
Index: Y1_I1
resc_io: 22.00 resc_cpu: 547522
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 22.22 Resp: 22.22 Degree: 1
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 1000.00 Bytes: 0

Again the adjustment is without any consequence, similar to and consistent with the "index without histogram" case above.

Finally let's see what happens without index and without extended statistics:

begin dbms_stats.drop_extended_stats(
ownname=>user,
tabname => 'Y1',
extension => '(a, b)' );
end;
/

drop index y1_i1;

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 1');
end;
/

REM without extended stats
REM and without index
REM and without histogram
REM the standard formula for non-correlated data
REM applies, which gives us a cardinality of 200
alter session set tracefile_identifier = 'correlated13';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]

Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 200 Computed: 200.00 Non Adjusted: 200.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2396429
Resp_io: 12.00 Resp_cpu: 2396429
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 200.00 Bytes: 0

The standard formula for uncorrelated column values applies which gives us a (wrong) cardinality of 200.

If we use "out-of-range" predicates, the default formula applies:

REM out-of-range predicates correspond to the standard formula
REM outside range / high-low value, for column b this means:
REM (6 - 4) / (4 - 0) => 0.5,
REM adjust the cardinality by 50% => 200 - (200 * 0.5) => 100
alter session set tracefile_identifier = 'correlated14';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]

Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 100 Computed: 100.00 Non Adjusted: 100.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2394429
Resp_io: 12.00 Resp_cpu: 2394429
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 100.00 Bytes: 0

If we are outside the range of high value - low value, then we get a cardinality of 0 resp. 1:

REM We'll get zero if are at least outside the range by range
alter session set tracefile_identifier = 'correlated15';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=100 and b=100;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]

Using prorated density: 0.000050 of col #1 as selectvity of out-of-range/non-existent value pred
Using prorated density: 0.000050 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1 Computed: 0.00 Non Adjusted: 0.00
Access Path: TableScan
Cost: 12.94 Resp: 12.94 Degree: 0
Cost_io: 12.00 Cost_cpu: 2342474
Resp_io: 12.00 Resp_cpu: 2342474
Best:: AccessPath: TableScan
Cost: 12.94 Degree: 1 Resp: 12.94 Card: 0.00 Bytes: 0

Finally we'll explore what happens if we have a histogram in place:

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 254');
end;
/

REM finally generating histograms
REM but still without extended statistics
REM and without index
REM compare this to the case "index" + "histogram"
REM which was case 1
alter session set tracefile_identifier = 'correlated16';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5

Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 200 Computed: 200.00 Non Adjusted: 200.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2396429
Resp_io: 12.00 Resp_cpu: 2396429
Best:: AccessPath: TableScan
Cost: 12.96 Degree: 1 Resp: 12.96 Card: 200.00 Bytes: 0

Note that this case corresponds to the case "index with histogram" and shows that having the histogram in place effectively makes the index useless for selectivity/cardinality estimates. Same applies when querying for non-existent values:

REM corresponds to case 2
alter session set tracefile_identifier = 'correlated17';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

REM corresponds to case 15
alter session set tracefile_identifier = 'correlated18';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=100 and b=100;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.050000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.100000, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5

Using prorated density: 0.050000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 50 Computed: 50.00 Non Adjusted: 50.00
Access Path: TableScan
Cost: 12.95 Resp: 12.95 Degree: 0
Cost_io: 12.00 Cost_cpu: 2368429
Resp_io: 12.00 Resp_cpu: 2368429
Best:: AccessPath: TableScan
Cost: 12.95 Degree: 1 Resp: 12.95 Card: 50.00 Bytes: 0

Finally we can see that the switching off the "bug fix" for non-existent values in frequency histograms shows us a cardinality of 1 if we query for a non-existent value:

REM with old behaviour for non-existent values
REM in frequency histograms we get a cardinality of 1
alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'correlated19';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

alter session set "_fix_control"='5483301:on';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
Column (#1):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:10, NDV:10
Column (#2):
NewDensity:0.000050, OldDensity:0.000050 BktCnt:10000, PopBktCnt:10000, PopValCnt:5, NDV:5

Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1 Computed: 0.05 Non Adjusted: 0.05
Access Path: TableScan
Cost: 12.94 Resp: 12.94 Degree: 0
Cost_io: 12.00 Cost_cpu: 2342474
Resp_io: 12.00 Resp_cpu: 2342474
Best:: AccessPath: TableScan
Cost: 12.94 Degree: 1 Resp: 12.94 Card: 0.05 Bytes: 0

I'm now going to simulate what is going to happen if one decides to merge indexes, e.g. two indexes on a,b and a,b,c will render the one on a,b redundant, or if it is decided to add column c to the existing index since some queries might benefit from this additional column:

create index y1_i2 on y1 (a, b, c);

REM Distribution of these column values given below.
select a, b, c, count(*) from y1 group by a,b,c order by a,b,c
/
A B C COUNT(*)
---------- ---------- ---------- ----------
1 1 1 1000
2 2 0 1000
3 3 1 1000
4 4 0 1000
5 0 1 1000
6 1 0 1000
7 2 1 1000
8 3 0 1000
9 4 1 1000
10 0 0 1000

10 rows selected.

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 1');
end;
/

alter session set tracefile_identifier = 'correlated21';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I2
Col#: 1 2 3 CorStregth: 10.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
Access Path: index (index (FFS))
Index: Y1_I2
resc_io: 23.00 resc_cpu: 2056222
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 23.82 Resp: 23.82 Degree: 1
Cost_io: 23.00 Cost_cpu: 2056222
Resp_io: 23.00 Resp_cpu: 2056222
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Access Path: index (IndexOnly)
Index: Y1_I2
resc_io: 6.00 resc_cpu: 242729
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 6.10 Resp: 6.10 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I2
Cost: 6.10 Degree: 1 Resp: 6.10 Card: 1000.00 Bytes: 0

It's interesting to note that the optimizer using a higher "CorStregth" value of 10 and decides on the partial match ("Partial: #1 (1 2 )") to use the information from the index, although we don't use the entire index.

For the remaining test cases with out-of-range predicates we get consistent behaviour:

alter session set tracefile_identifier = 'correlated22';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=6 and b=6;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I2
Col#: 1 2 3 CorStregth: 10.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
Access Path: index (index (FFS))
Index: Y1_I2
resc_io: 23.00 resc_cpu: 2056222
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 23.82 Resp: 23.82 Degree: 1
Cost_io: 23.00 Cost_cpu: 2056222
Resp_io: 23.00 Resp_cpu: 2056222
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (IndexOnly)
Index: Y1_I2
resc_io: 6.00 resc_cpu: 242729
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 6.10 Resp: 6.10 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I2
Cost: 6.10 Degree: 1 Resp: 6.10 Card: 1000.00 Bytes: 0

And completely "out-of-range":

alter session set tracefile_identifier = 'correlated23';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=100 and b=100;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I2
Col#: 1 2 3 CorStregth: 10.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 1000 Computed: 1000.00 Non Adjusted: 1000.00
Access Path: TableScan
Cost: 12.96 Resp: 12.96 Degree: 0
Cost_io: 12.00 Cost_cpu: 2412429
Resp_io: 12.00 Resp_cpu: 2412429
Access Path: index (index (FFS))
Index: Y1_I2
resc_io: 23.00 resc_cpu: 2056222
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 23.82 Resp: 23.82 Degree: 1
Cost_io: 23.00 Cost_cpu: 2056222
Resp_io: 23.00 Resp_cpu: 2056222
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
ColGroup Usage:: PredCnt: 2 Matches Full: Partial: #1 (1 2 ) Sel: 0.1000
Using prorated density: 0.100000 of col #2 as selectvity of out-of-range/non-existent value pred
Access Path: index (IndexOnly)
Index: Y1_I2
resc_io: 6.00 resc_cpu: 242729
ix_sel: 0.100000 ix_sel_with_filters: 0.100000
Cost: 6.10 Resp: 6.10 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I2
Cost: 6.10 Degree: 1 Resp: 6.10 Card: 1000.00 Bytes: 0

Again the prorated density is left unchanged, and therefore the final selectivity/cardinality is still the same.

Now let's see what happens if the index on all three columns shows a different pattern:

truncate table y1;

begin
for i in 1..1000 loop
for j in 1..10 loop
insert into y1 values (j,mod(j,5), i);
end loop;
end loop;
end;
/

commit;

select count(*) from (
select distinct a, b, c from y1
);

COUNT(*)
----------
10000

begin
dbms_stats.delete_table_stats (
ownname =>null,
tabname=>'y1',
end;
/

begin
dbms_stats.gather_table_stats (
ownname =>null,
tabname=>'y1',
estimate_percent=>null,
method_opt =>'for all columns size 1');
end;
/

alter session set tracefile_identifier = 'correlated24';

alter session set events '10053 trace name context forever, level 1';

explain plan for select c from y1 where a=1 and b=1;

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for Y1[Y1]
ColGroup (#1, Index) Y1_I2
Col#: 1 2 3 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Table: Y1 Alias: Y1
Card: Original: 10000.000000 Rounded: 200 Computed: 200.00 Non Adjusted: 200.00
Access Path: TableScan
Cost: 14.97 Resp: 14.97 Degree: 0
Cost_io: 14.00 Cost_cpu: 2432036
Resp_io: 14.00 Resp_cpu: 2432036
Access Path: index (index (FFS))
Index: Y1_I2
resc_io: 25.00 resc_cpu: 2077586
ix_sel: 0.000000 ix_sel_with_filters: 1.000000
Access Path: index (FFS)
Cost: 25.83 Resp: 25.83 Degree: 1
Cost_io: 25.00 Cost_cpu: 2077586
Resp_io: 25.00 Resp_cpu: 2077586
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Access Path: index (IndexOnly)
Index: Y1_I2
resc_io: 2.00 resc_cpu: 54443
ix_sel: 0.020000 ix_sel_with_filters: 0.020000
Cost: 2.02 Resp: 2.02 Degree: 1
Best:: AccessPath: IndexRange
Index: Y1_I2
Cost: 2.02 Degree: 1 Resp: 2.02 Card: 200.00 Bytes: 0

The "CorStregth" value reverts to 5 and the index information is ignored. The selectivity/cardinality corresponds to the default formula used for uncorrelated column values, so in this case we lose the additional information about the correlation and fall back to default statistics.

Finally the "ZODIAC" example which actually represents correlated and skewed data is influenced by the bug fix for the non-existent values in frequency histograms as well.

Repeating the test case with the fix enabled:

select min(cnt)
from (
select a.*,
endpoint_number - lag(endpoint_number, 1, 0) over (order by endpoint_number) as cnt
from user_tab_histograms a
where table_name = 'CALENDAR' and column_name like 'SYS%'
);

MIN(CNT)
----------
7

alter session set "_fix_control"='5483301:on';

alter session set tracefile_identifier = 'zodiac1';

alter session set events '10053 trace name context forever, level 1';

explain plan for select count(*) from calendar c where zodiac = 'TAURUS' and month = 'JAN';

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for CALENDAR[C]
Column (#3):
NewDensity:0.039726, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:12, NDV:12
Column (#2):
NewDensity:0.038356, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:12, NDV:12
Column (#4):
NewDensity:0.009589, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:24, NDV:24
ColGroup (#1, VC) SYS_STUWHPY_ZSVI_W3#C\$I3EUUYB4
Col#: 2 3 CorStregth: 6.00
ColGroup Usage:: PredCnt: 2 Matches Full: Using density: 0.009589 of col #4 as selectivity of unpopular value pred
#1 Partial: Sel: 0.0096
Using density: 0.038356 of col #2 as selectivity of unpopular value pred
Table: CALENDAR Alias: C
Card: Original: 365.000000 Rounded: 4 Computed: 3.50 Non Adjusted: 3.50
Access Path: TableScan
Cost: 3.04 Resp: 3.04 Degree: 0
Cost_io: 3.00 Cost_cpu: 102543
Resp_io: 3.00 Resp_cpu: 102543
Best:: AccessPath: TableScan
Cost: 3.04 Degree: 1 Resp: 3.04 Card: 3.50 Bytes: 0

We can see that we get an estimated cardinality of 3.5 (which is again consistent with the rule introduced by the bug fix: 0.5 times the cardinality of the least popular value in the frequency histogram, the query returns 7 => 7 * 0.5 => 3.5) due to a "unpopular" value, but if you think about it, there no such thing like a "unpopular" value in a frequency histogram, basically the value should be treated as "non-existent", which is prevented by the bug fix.

Enabling previous behaviour shows the following:

alter session set "_fix_control"='5483301:off';

alter session set tracefile_identifier = 'zodiac2';

alter session set events '10053 trace name context forever, level 1';

explain plan for select count(*) from calendar c where zodiac = 'TAURUS' and month = 'JAN';

alter session set events '10053 trace name context off';

The corresponding 10053 trace file excerpt:

SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for CALENDAR[C]
Column (#3):
NewDensity:0.001370, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:12, NDV:12
Column (#2):
NewDensity:0.001370, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:12, NDV:12
Column (#4):
NewDensity:0.001370, OldDensity:0.001370 BktCnt:365, PopBktCnt:365, PopValCnt:24, NDV:24
ColGroup (#1, VC) SYS_STUWHPY_ZSVI_W3#C\$I3EUUYB4
Col#: 2 3 CorStregth: 6.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0014
Table: CALENDAR Alias: C
Card: Original: 365.000000 Rounded: 1 Computed: 0.50 Non Adjusted: 0.50
Access Path: TableScan
Cost: 3.04 Resp: 3.04 Degree: 0
Cost_io: 3.00 Cost_cpu: 101868
Resp_io: 3.00 Resp_cpu: 101868
Best:: AccessPath: TableScan
Cost: 3.04 Degree: 1 Resp: 3.04 Card: 0.50 Bytes: 0

In this case we get the expected output of 0 resp. 1 cardinality.

So in summary this is again a very interesting test case that shows that generating histograms can have a significant impact on the estimates of the optimizer - even in the particular case where the column or expression was distributed uniformly.