Oracle related stuff: Column Groups

Tuesday, April 10, 2012

Column Groups - Edge Cases

Oracle 11g added Extended Statistics support for column groups in order to detect correlated columns for filter predicates using an equal comparison.

Note that Oracle 11g also added the ability to use the number of distinct keys of a composite index as an upper limit for the cardinality estimates for matching column predicates, which means that the optimizer is now capable of detecting correlated columns without the explicit addition of Extended Statistics / Column Groups.

The "composite index" feature was also backported to 10.2.0.4 and 10.2.0.5 but needs to be activated explicitly via a FIX_CONTROL. You can read more about that on Jonathan Lewis' blog.

The "Column Group" feature (either by explicit Extended Statistics or via a matching composite index) however has a weakness in a very special case: If there is only a single distinct value in the statistics then the "out-of-range" detection of the optimizer is not working correctly.

This can be easily demonstrated with a very simple test case:

set echo on linesize 200 pagesize 999 tab off trimspool on trimout on

drop table t;

purge table t;

create table t
as
select
        rownum as id
      , mod(rownum, 1) + 1 as attr1
      , mod(rownum, 1) + 1 as attr2
      , rpad('x', 100) as filler
from
        dual
connect by
        level <= 1000000
;

select
        attr1
      , attr2
      , count(*) as cnt
from
        t
group by
        attr1
      , attr2
;

exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1')

create index t_idx on t (attr1, attr2);

explain plan for
select * from t where attr1 = 1 and attr2 = 1;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 0 and attr2 = 0;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 2 and attr2 = 2;

select * from table(dbms_xplan.display(null, null, 'BASIC PREDICATE ROWS'));

explain plan for
select * from t where attr1 = 100000 and attr2 = 100000;

select * from table(dbms_xplan.display(null, null, 'BASIC PROJECTION ROWS'));

explain plan for
select * from t where attr1 = -100000 and attr2 = 100000;

select * from table(dbms_xplan.display(null, null, 'BASIC PROJECTION ROWS'));

The relevant output is:

SQL> select
  2          attr1
  3        , attr2
  4        , count(*) as cnt
  5  from
  6          t
  7  group by
  8          attr1
  9        , attr2
 10  ;

     ATTR1      ATTR2        CNT
---------- ---------- ----------
         1          1    1000000

select * from t where attr1 = 1 and attr2 = 1;

------------------------------------------
| Id  | Operation         | Name | Rows  |
------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|
|*  1 |  TABLE ACCESS FULL| T    |  1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ATTR1"=1 AND "ATTR2"=1)

select * from t where attr1 = 0 and attr2 = 0;

------------------------------------------
| Id  | Operation         | Name | Rows  |
------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|
|*  1 |  TABLE ACCESS FULL| T    |  1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ATTR1"=0 AND "ATTR2"=0)

select * from t where attr1 = 2 and attr2 = 2;

------------------------------------------
| Id  | Operation         | Name | Rows  |
------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|
|*  1 |  TABLE ACCESS FULL| T    |  1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ATTR1"=2 AND "ATTR2"=2)

select * from t where attr1 = 100000 and attr2 = 100000;

------------------------------------------
| Id  | Operation         | Name | Rows  |
------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|
|*  1 |  TABLE ACCESS FULL| T    |  1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ATTR1"=100000 AND "ATTR2"=100000)

select * from t where attr1 = -100000 and attr2 = 100000;

------------------------------------------
| Id  | Operation         | Name | Rows  |
------------------------------------------
|   0 | SELECT STATEMENT  |      |  1000K|
|*  1 |  TABLE ACCESS FULL| T    |  1000K|
------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("ATTR1"=(-100000) AND "ATTR2"=100000)

As you can see the optimizer still thinks to return "all" rows even in obvious cases where no rows will be returned.

In 11g the problem does not appear if there is more than a single distinct value, however the 10g backport seems to produce similar silly estimates even with more than a single distinct value.

One possible workaround is to generate a histogram on the column group in 11g. This way the out-of-range detection works again as expected for the single distinct value case.

It's also possible to generate a histogram on one or more of the affected columns each to avoid the problem, but this "breaks" the column group and correlation detection and hence might lead to worse cardinality estimates in the case of more than a single distinct value.

Note that when generating histograms you need to carefully check if you run into the problems described here: If you end up with a histogram containing only a single bucket but the underlying basic column statistics show more than a single distinct value then in principle unfortunately again the same issue will show up as demonstrated in the other blog post.

Note that this inconsistency between basic column statistics (Number Of Distinct Values, NDV) and the corresponding histogram is a consequence of how 11g gathers statistics when using the AUTO_SAMPLE_SIZE default.

Maria Colgan has just published a note about that.

2 comments:

Martin PreissApril 11, 2012 at 11:05 AM
Randolf,

that's a very interesting observation. Playing a little bit with your example I see that NULL values also seem to have a strange effect on the cardinality estimates:

-- after using your create script

-- case 1: a single NULL value for attr1
update t set attr1 = null where id = 1;
-- gather statistics again
exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1')

explain plan for
select * from t where attr1 = 1 and attr2 = 1;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 500K|
|* 1 | TABLE ACCESS FULL| T | 500K|
------------------------------------------

-- case 2: 100.000 NULL values for attr1
update t set attr1 = null where id <= 100000;

explain plan for
select * from t where attr1 = 1 and attr2 = 1;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 450K|
|* 1 | TABLE ACCESS FULL| T | 450K|
------------------------------------------

-- case 3: 100.000 NULL values for attr1 and attr2
update t set attr1 = null, attr2 = null where id <= 100000;

explain plan for
select * from t where attr1 = 1 and attr2 = 1;

------------------------------------------
| Id | Operation | Name | Rows |
------------------------------------------
| 0 | SELECT STATEMENT | | 900K|
|* 1 | TABLE ACCESS FULL| T | 900K|
------------------------------------------

So there is clearly a pattern - but I don't see the purpose ...

Martin
ReplyDelete
Replies
RandolfApril 11, 2012 at 10:01 PM
Hi Martin,

thanks for posting your experiments. I didn't have time yet to look into this, but certainly something to investigate further.

Thanks,
Randolf
ReplyDelete
Replies

Add comment

Oracle related stuff

Pages

Tuesday, April 10, 2012

Column Groups - Edge Cases

2 comments:

Blog Archive

Homepage

YouTube Channel

About Me

Contact Details

OakTable Book "Expert Oracle Practices"

Blogroll