## Monday, November 14, 2011

### Star Transformation And Cardinality Estimates

If you want to make use of Oracle's cunning Star Transformation feature then you need to be aware of the fact that the star transformation logic - as the name implies - assumes that you are using a proper star schema.

Here is a nice example of what can happen if you attempt to use star transformation but your model obviously doesn't really correspond to what Oracle expects:

drop table d;

purge table d;

drop table t;

purge table t;

create table t
as
select
rownum as id
, mod(rownum, 100) + 1 as fk1
, 1000 + mod(rownum, 10) + 1 as fk2
, 2000 + mod(rownum, 100) + 1 as fk3
, rpad('x', 100) as filler
from
dual
connect by
level <= 1000000
;

exec dbms_stats.gather_table_stats(null, 't')

create bitmap index t_fk1 on t (fk1);

create bitmap index t_fk2 on t (fk2);

create bitmap index t_fk3 on t (fk3);

create table d
as
select
rownum as id
, case when rownum between 1 and 100 then 'Y' else 'N' end as is_flag_d1
, case when rownum between 1001 and 1010 then 'Y' else 'N' end as is_flag_d2
, case when rownum between 2001 and 2100 then 'Y' else 'N' end as is_flag_d3
, rpad('x', 100) as vc1
from
dual
connect by
level <= 10000
;

exec dbms_stats.gather_table_stats(null, 'd', method_opt => 'FOR ALL COLUMNS SIZE 1 FOR COLUMNS SIZE 254 IS_FLAG_D1, IS_FLAG_D2, IS_FLAG_D3');

This is a simplified example of a model where multiple, potentially small, dimensions are stored in a single physical table and the separate dimensions are represented by views that filter the corresponding dimension data from the base table.

So we have a fact table with one million rows and a "collection" dimension table that holds three dimensions, among others.

In order to enable the star transformation bitmap indexes on the foreign keys of the fact table are created.

The dimension table has histograms on the flag columns to tell the optimizer about the non-uniform distribution of the column data.

Now imagine a query where we query the fact table (and possibly do some filtering on the fact table by other means like other dimensions or direct filtering on the fact table) but need to join these three dimensions just for displaying purpose - the dimensions itself are not filtered so the join will not filter out any data.

Let's first have a look at an execution plan of such a simply query with star transformation disabled:

select /*+ no_star_transformation */
count(*)
from
t f
, (select * from d where is_flag_d1 = 'Y') d1
, (select * from d where is_flag_d2 = 'Y') d2
, (select * from d where is_flag_d3 = 'Y') d3
where
f.fk1 = d1.id
and f.fk2 = d2.id
and f.fk3 = d3.id
;

SQL> explain plan for
2 select /*+ no_star_transformation */
3 count(*)
4 from
5 t f
6 , (select * from d where is_flag_d1 = 'Y') d1
7 , (select * from d where is_flag_d2 = 'Y') d2
8 , (select * from d where is_flag_d3 = 'Y') d3
9 where
10 f.fk1 = d1.id
11 and f.fk2 = d2.id
12 and f.fk3 = d3.id
13 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display(format => 'BASIC +ROWS +PREDICATE'));
Plan hash value: 77569906

----------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | HASH JOIN | | 940K|
|* 3 | TABLE ACCESS FULL | D | 100 |
|* 4 | HASH JOIN | | 945K|
|* 5 | TABLE ACCESS FULL | D | 100 |
|* 6 | HASH JOIN | | 950K|
|* 7 | TABLE ACCESS FULL| D | 10 |
| 8 | TABLE ACCESS FULL| T | 1000K|
----------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("F"."FK3"="D"."ID")
3 - filter("IS_FLAG_D3"='Y')
4 - access("F"."FK1"="D"."ID")
5 - filter("IS_FLAG_D1"='Y')
6 - access("F"."FK2"="D"."ID")
7 - filter("IS_FLAG_D2"='Y')

So clearly the optimizer got it quite right - the join to the dimensions is not going to filter out significantly - the slight reduction in rows comes from the calculations based on the histograms generated.

But now try the same again with star transformation enabled:

SQL> explain plan for
2 select /*+ star_transformation opt_param('star_transformation_enabled', 'temp_disable') */
3 count(*)
4 from
5 t f
6 , (select * from d where is_flag_d1 = 'Y') d1
7 , (select * from d where is_flag_d2 = 'Y') d2
8 , (select * from d where is_flag_d3 = 'Y') d3
9 where
10 f.fk1 = d1.id
11 and f.fk2 = d2.id
12 and f.fk3 = d3.id
13 ;

Explained.

SQL>
SQL> select * from table(dbms_xplan.display(format => 'BASIC +ROWS +PREDICATE'));
Plan hash value: 459231705

----------------------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
| 1 | SORT AGGREGATE | | 1 |
|* 2 | HASH JOIN | | 9 |
|* 3 | HASH JOIN | | 9 |
|* 4 | HASH JOIN | | 10 |
|* 5 | TABLE ACCESS FULL | D | 10 |
| 6 | TABLE ACCESS BY INDEX ROWID | T | 10 |
| 7 | BITMAP CONVERSION TO ROWIDS| | |
| 8 | BITMAP AND | | |
| 9 | BITMAP MERGE | | |
| 10 | BITMAP KEY ITERATION | | |
|* 11 | TABLE ACCESS FULL | D | 100 |
|* 12 | BITMAP INDEX RANGE SCAN| T_FK1 | |
| 13 | BITMAP MERGE | | |
| 14 | BITMAP KEY ITERATION | | |
|* 15 | TABLE ACCESS FULL | D | 100 |
|* 16 | BITMAP INDEX RANGE SCAN| T_FK3 | |
| 17 | BITMAP MERGE | | |
| 18 | BITMAP KEY ITERATION | | |
|* 19 | TABLE ACCESS FULL | D | 10 |
|* 20 | BITMAP INDEX RANGE SCAN| T_FK2 | |
|* 21 | TABLE ACCESS FULL | D | 100 |
|* 22 | TABLE ACCESS FULL | D | 100 |
----------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("F"."FK3"="D"."ID")
3 - access("F"."FK1"="D"."ID")
4 - access("F"."FK2"="D"."ID")
5 - filter("IS_FLAG_D2"='Y')
11 - filter("IS_FLAG_D1"='Y')
12 - access("F"."FK1"="D"."ID")
15 - filter("IS_FLAG_D3"='Y')
16 - access("F"."FK3"="D"."ID")
19 - filter("IS_FLAG_D2"='Y')
20 - access("F"."FK2"="D"."ID")
21 - filter("IS_FLAG_D1"='Y')
22 - filter("IS_FLAG_D3"='Y')

What an astonishing result: Not only Oracle will try now to access all rows of the fact table by single-block random I/O, which by itself can be a disaster for larger real-life fact tables, in particular when dealing with Exadata features like Smart Scans which are only possible with multi-block direct-path reads, but furthermore if this was part of a more complex execution plan look at the cardinality estimates: They are off by five orders of magnitude - very likely a receipt for disaster for any step following afterwards.

The point here is simple: The Star Transformation calculation model obviously doesn't cope with the "collection" of dimensions in a single table very well, but assumes a dimensional model where each dimension is stored in separate table(s). If you don't adhere to that model the calculation will be badly wrong and the results possibly disastrous.

Here the Star Transformation assumes a filtering on dimension tables that are effectively no filter but this is something the current calculation model is not aware of. If you put the three dimensions in separate tables no "artificial" filter is required and hence the calculation won't be mislead.

Of course one could argue that the star transformation optimization seems to do a poor job since the normal optimization based on the same input data produces a much better estimate, but at least for the time being that's the way this transformation works and the model chosen better reflects this.

goran said...

Hi Randolf,

is there any particular reason you collected statistics for dimension like:
FOR COLUMNS SIZE 254 IS_FLAG_D1, IS_FLAG_D2, IS_FLAG_D3'

thanks,
goran

Randolf said...

Hi Goran,

it is explained in the post itself:

> The dimension table has histograms on the flag columns to tell the optimizer about the non-uniform distribution of the column data

The point here is that the column value distribution of the IS_FLAG... columns is highly skewed (Y and Ns are not uniformly distributed, there are more Ns than Ys), and therefore the optimizer should know about this.

Randolf

goran said...

Hi Randolf,

thanks for answer ... right, the column 'is_flag' is highly skewed so my first question wasn't formulated precise enough ... in this test case you know in advance that for this particular column there are two values and hence usage of 'FOR COLUMNS SIZE 254' is fine ... I wonder, do you prefer to use this approach rather than e.g. 'for columns size skewonly' or 'for columns size auto'?

thanks,
goran

Randolf said...

Hi Goran,

there is no simple answer to that question.

At least what I can say is: Do not use SIZE SKEWONLY, because you'll end up with far too many histograms, and some of these histograms will even be a threat, which means that the optimizer's estimates can be influenced negatively.

So SIZE AUTO is not too bad, but even with SIZE AUTO you can end up with histograms that are not helpful.

Ideally you would know your application and queries, and generate only histograms where you know that it is necessary and beneficial.

Depending on your data this might also include crafting histograms manually using DBMS_STATS.PREPARE_COLUMN_VALUES / SET_COLUMN_STATS, because of the inherent limits how Oracle generates histograms. This is particularly true for columns with many distinct values that are skewed.

Randolf

goran said...

Hi Randolf,

thanks for answer ... I liked your webcast on CBO you gave this week.
Any plans to give a little bit 'advanced' version of presentation?

Regarding star transformation, on some DWH projects I saw following star schema design:
time dimension had primary key in form of date transformed into number i.e. 11.11.2011 presented like number 20111111.
fact table have date column, is partitioned on date column (range partitions) and contains not real dates but time dimension primary key i.e. numbers.
This design has at some point lead to performance problems (~15% of all queries) as for some queries CBO was not able to do partition prunning properly whereas other problematic group hasn't been executed with 'star transformation' leading to expensive cartesian joins.

Independent of this, in majority of star schema designs I saw date column in fact table was defined as 'date' datatype ... so I am wondering in your experience just generally what could be the reasons and advantages using number instead of date datatype for fact table date column?

regards,
goran

Randolf said...

Hi Goran,

I would put this quite simply: Storing a DATE as a non-date in the database is a severe design error, full stop.

There are various reasons, like you usually save one byte when storing it as number etc., but all of them are not really valid ones given the side effects of such decisions, in particular to the optimizer.

To my surprise I come across such issues, in particular the DATE issue, quite frequently. It is quite often that those people that make decisions like that don't have a clue about the impact, which is a pity.

And yes, there are plans for more Webinars on the Cost-Based Optimizer, so stay tuned.

Randolf

Larry Dooley said...

Interesting comment on the date. Kimball, I believe is pretty clear on using a synthetic key, a number not the date. I haven't quite made up my mind on this. The one clear thing I see for the date dimension using a synthetic key is when a date column in the fact table is null. I don't want outer joins. So what is your take on this?

Randolf said...

Hi Larry,

my comments above were specifically about storing DATEs as NUMBERs, but then performing date related filtering and arithmetic on those NUMBER columns, e.g. something like:

col > 20110512

to express you only want to see data later than May 12th 2011.

This is a different question than whether to use synthetic keys or not. If it is a truly synthetic key you won't perform comparisons like the above mentioned.

It is this kind of data type misuse that leads to problems with the estimates of the optimizer.

Randolf

Martin Shaw said...

Hi Randolf, Just found your column - interesting stuff. The argument on dates/numbers is easily solved. My DW mentor always told me to ensure the fact table date column values have no business value - it prevents users accessing and filtering fact tables directly as opposed through the supporting dimension tables. I have tended to use a Julian date as the key (violating the strictly no business value), but viewing a Julian date, a user is probably not going to use it to directly filter the fact table. However, it is available to the designers and developers for quick references when fixing or changing the application, or unit testing code.

Larry mentioned about nulls in the date columns, no problems as the bitmaps accommodate these quite happily. However, I presume that Larry was actually meaning a date column in the fact table that is not actually a FK column to a dimension - otherwise surely there is no problem. And if that is the case, why would a designer ever allow a design to pass a fit for purpose test where a user can actually filter on a fact table directly? Surely that will break just about every major optimization strategy Oracle has evolved over the past thirty-some years?