Thursday, January 31, 2013

Exadata Smart Scan Projection Limitation

Here is an interesting limitation to Exadata Smart Scans - if more than 254 columns from a table (not HCC compressed, more on that in moment) need to be projected, Smart Scans for that particular segment will be disabled and Exadata will fall back to conventional I/O. This means that the number of columns in the projection clause can make a significant difference to performance, since only Smart Scans allow taking advantage of offloading and particularly avoiding I/O via Storage Indexes.

Now the expression "254 columns" might ring a bell, since it is the maximum number of columns that Oracle can store in a single row piece - tables consisting of more than 254 columns will have to be stored in multiple row pieces.

However, what I'm talking about here is not related to such columns residing in different row pieces of a row - Smart Scans still happily work even if columns from different row pieces are projected (which was subject to several bugs in the past), although you might end up with additional "cell single block reads" in case of truly chained rows rather than just additional logical I/O for picking up the different row pieces from the same block, also sometimes called "intra-block" chaining.

No, the limitation simply seems to be that Smart Scans - broadly speaking and ignoring edge cases - can only transport a maximum of 254 columns from a single (non-HCC) segment. Requesting more columns will simply disable Smart Scans for that segment.

Now you might say, offloading and in particular offloading column projection isn't that much relevant if you select that many columns from a table anyway, but the point is that you loose the ability to benefit from Storage Indexes and only transporting the relevant rows back to the compute nodes.

Both features can speed up the processing significantly, in particular if the number of rows selected is only a fraction of the total number of rows, and/or the cells could avoid a significant amount of I/O via Storage Indexes.

To demonstrate the point I've put together a simple test case that generates a test table with more than 254 columns - the script below generates a table of approx. 40GB uncompressed size so that a significant difference in performance could be measured.

set echo on timing on time on

-- MAX is 999, there is a ID column as first col
define num_cols = 300

define compression = nocompress
--define compression = "compress for query low"

drop table many_cols_rg;

purge table many_cols_rg;

declare
  s_sql varchar2(32767);
begin
  s_sql := q'!
create table many_cols_rg pctfree 0
&compression
parallel nologging
as
with generator1 as
(
select  /*+ materialize cardinality(1000) */
        rownum as id
        -- this makes the rowsource wider otherwise PX BLOCK ITERATOR has problems properly spreading work among the slaves
      , rpad('x', 4000) as padding
from
        dual
connect by
        level <= 1e3
),
generator2 as
(
select  /*+ materialize cardinality(10000) */
        rownum as id
        -- this makes the rowsource wider otherwise PX BLOCK ITERATOR has problems properly spreading work among the slaves
      , rpad('x', 4000) as padding
from
        dual
connect by
        level <= 1e4
)
select
        num_id as id
!';
  for i in 1..&num_cols loop
    s_sql := s_sql || ', char_id as col' || to_char(i, 'FM000');
  end loop;
  s_sql := s_sql || q'!
from
        (
        select  /*+ no_merge */
                a.id + (b.id - 1) * 1e3 as num_id
              , cast(to_char(a.id + (b.id - 1) * 1e3, 'FM0000000000') as varchar2(10)) as char_id
        from
                generator1 a
              , generator2 b
        )
  !';
  execute immediate s_sql;
end;
/

exec dbms_stats.gather_table_stats(null, 'many_cols_rg')

Assuming a Storage Index was generated on the ID column a query like the following (using Parallel Query at a DOP of 32 in the test runs here) can benefit from offloading, since in principle all I/O could be avoided via the Storage Index, and virtually no data needs to be transported from the cells to the compute nodes.

Note that it projects exactly 254 columns from different row pieces.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:28:22 SQL>
11:28:22 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:28:22 SQL>
11:28:22 SQL> select
11:28:22   2  col001,
11:28:22   3  col002,
.
.
.
11:28:23 254  col253,
11:28:23 255  col300/*,
11:28:23 256  col254*/
11:28:23 257  from many_cols_rg where id between -2 and -1;

no rows selected

Elapsed: 00:00:02.40
11:28:25 SQL>
11:28:25 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                    2,1195E+10
cell physical IO interconnect bytes returned by smart scan          3000784

Elapsed: 00:00:00.01
11:28:25 SQL>

As you can see from the snippet, it took less than 2.5 seconds to run the query on the 40GB segment, and the session statistics report 20GB avoided via the Storage Index (which seems to be an instrumentation bug as it always reports only 50% of the total segment size as a maximum, this output was taken from 11.2.0.2 Exadata BP14). Furthermore only a couple of MB were exchanged between the cells and the compute nodes.

The corresponding Real-Time SQL Monitoring report confirms the "Smart Scan":


Increasing the number of columns projected from the segment above 254 (and as outlined above it doesn't matter from which row pieces these columns come from) disables the Smart Scan, and it takes more than 9 seconds to run essentially the same query, pumping all 40GB through the compute nodes to filter all rows.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:29:14 SQL>
11:29:14 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:29:14 SQL>
11:29:14 SQL> select
11:29:14   2  col001,
11:29:14   3  col002,
.
.
.
11:29:15 254  col253,
11:29:15 255  col300,
11:29:15 256  col254
11:29:15 257  from many_cols_rg where id between -2 and -1;

no rows selected

Elapsed: 00:00:09.22
11:29:24 SQL>
11:29:24 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.01
11:29:24 SQL>

The corresponding Real-Time SQL Monitoring report confirms the fallback to "direct path reads":


Funnily, in this deliberately crafted, extreme case here, it is much faster to access the segment twice and get the remaining columns via a self-join in order to benefit from the offloading features - it only takes 4.7 seconds to run the self-join, and the session statistics confirm that both segment scans could leverage offloading and in particular avoid I/O via Storage Indexes:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:29:37 SQL>
11:29:37 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:29:37 SQL>
11:29:37 SQL> select
11:29:37   2  a.col001,
11:29:37   3  a.col002,
.
.
.
11:29:37 254  a.col253,
11:29:37 255  a.col300,
11:29:37 256  b.col254
11:29:37 257  from many_cols_rg a, many_cols_rg b
11:29:37 258  where a.id between -2 and -1 and b.id between -2 and -1
11:29:37 259  and a.id = b.id;

no rows selected

Elapsed: 00:00:04.77
11:29:42 SQL>
11:29:42 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                    4,2390E+10
cell physical IO interconnect bytes returned by smart scan          6001568

Elapsed: 00:00:00.01
11:29:42 SQL>
11:29:42 SQL>

Note: This is one of the cases where you don't want the optimizer to eliminate a self-join via a query transformation based on corresponding constraints on the join columns :-)

HCC Compression


Now the funny thing is that if you repeat the table creation script but uncomment the HCC compression, the Smart Scan happily works with up to 1,000 columns of such a compressed segment.

So obviously the general code implementation supports transporting rows with more than 254 columns from the cell to the compute nodes, but the question is why does it only do so with HCC compressed segments. It's probably a question that my client will raise with Oracle Support to find out the answer.

Footnote


At least only the number of "raw" columns projected count towards the limitation - any expressions based on columns don't count, therefore you can project actually more than 254 expressions from a segment and still benefit from Smart Scans as long as the expressions refer to a maximum of 254 "base" columns.

The same limitation could also be reproduced when using (almost) the latest available Exadata version as of writing this (11.2.0.3 BP12, almost because BP14 just came out as far as I know).

Sunday, January 13, 2013

HAVING Cardinality

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.

Usually aggregates are one of the last steps executed before the final result set is returned to the client.

However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.

In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influence the choice of other related execution steps like join orders and methods or simply the decision between an index-based access or a full table scan.

While the optimizer based on the statistics can come up with a reasonable estimate regarding the cardinality of the GROUP BY expression (the emphasis here is on *can*, it might also be wrong), it is important to understand that an additional filter on the aggregates using the HAVING clause is in principle treated like an "unknown" expression and therefore the estimates are based on built-in defaults that might not have much to do with actual filter selectivities of that HAVING expression.

Here is a simple example to demonstrate the point:

create table t
as
select
        rownum as id
      , date '2000-01-01' + mod(rownum - 1, 100) as a_date1
      , date '2000-01-01' + mod(rownum - 1, 100) as a_date2
      , rpad('x', 100) as filler
from
        dual
connect by
        level <= 1e5
;

exec dbms_stats.gather_table_stats(null, 't')

create unique index t_idx on t (id);

There is a table of 100K rows with two dates in it that have each 100 distinct values (we can ignore in this case here that the values generated are actually correlated) - the ID column is unique.

If I ask for an estimate for queries similar to the following on this table:

set echo on linesize 200 pagesize 0

explain plan for
select
        id
      , max(a_date1) as max_a_date1
      , min(a_date2) as min_a_date2
from
        t
-- 50% of data
where
        id > 50000
group by
        id
;

select * from table(dbms_xplan.display);

explain plan for
select
        id
      , max(a_date1) as max_a_date1
      , min(a_date2) as min_a_date2
from
        t
-- 50% of data
where
        id > 50000
group by
        id
-- No actual filtering
having
        max(a_date1) >= date '2000-01-01'
and     min(a_date2) <= date '2000-01-01' + 100
;

select * from table(dbms_xplan.display);

I'll get these estimates:

-- Simple GROUP BY without HAVING
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      | 50001 |
|   1 |  HASH GROUP BY     |      | 50001 |
|*  2 |   TABLE ACCESS FULL| T    | 50001 |
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("ID">50000)

-- GROUP BY plus HAVING
--------------------------------------------
| Id  | Operation           | Name | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT    |      |   126 |
|*  1 |  FILTER             |      |       |
|   2 |   HASH GROUP BY     |      |   126 |
|*  3 |    TABLE ACCESS FULL| T    | 50001 |
--------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(MAX("A_DATE1")>=TO_DATE(' 2000-01-01 00:00:00',
              'syyyy-mm-dd hh24:mi:ss') AND MIN("A_DATE2")<=TO_DATE(' 2000-04-10
              00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
   3 - filter("ID">50000)

Note how the optimizer gets the cardinality right for the first statement. The aggregation doesn't reduce the cardinality due to the uniqueness of the ID column.

Notice how the filter predicates on the aggregates in the second case actually do not filter at all as they select the whole date range available. But the optimizer simply assumes 5% selectivity per unknown range predicate resulting in 0.05 times 0.05 = 0.0025 selectivity.

I don't think that it is a bug - it simply cannot assume anything regarding the possible MIN and MAX values given any potential arbitrary other filters.

If I now wrap the aggregate query as a view into a more complex statement, it becomes obvious that the HAVING clause can also implicitly be derived by applying corresponding filters to the view:

select /*+ no_merge(x) */ * from (
select  /*+ gather_plan_statistics */
        a.*, b.id as b_id, b.filler as b_filler
from    (
          /* Aggregate inline view without having */
          select
                  id
                , max(a_date1) as max_a_date1
                , min(a_date2) as min_a_date2
          from
                  t
          group by
                  id
        ) a
      , t b
where
        /* These filter predicates will be pushed into the view */
        max_a_date1 >= date '2000-01-01'
and     min_a_date2 <= date '2000-01-01' + 100
and     a.id > 50000
        /* A join between the view and something else */
and     a.id = b.id
) x
where
        rownum > 1
;

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

----------------------------------------------------------------------------
| Id  | Operation                       | Name  | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |       |      1 |        |      0 |
|   1 |  COUNT                          |       |      1 |        |      0 |
|*  2 |   FILTER                        |       |      1 |        |      0 |
|   3 |    VIEW                         |       |      1 |    126 |  50000 |
|   4 |     NESTED LOOPS                |       |      1 |        |  50000 |
|   5 |      NESTED LOOPS               |       |      1 |    126 |  50000 |
|   6 |       VIEW                      |       |      1 |    126 |  50000 |
|*  7 |        FILTER                   |       |      1 |        |  50000 |
|   8 |         HASH GROUP BY           |       |      1 |    126 |  50000 |
|*  9 |          TABLE ACCESS FULL      | T     |      1 |  50001 |  50000 |
|* 10 |       INDEX UNIQUE SCAN         | T_IDX |  50000 |      1 |  50000 |
|  11 |      TABLE ACCESS BY INDEX ROWID| T     |  50000 |      1 |  50000 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM>1)
   7 - filter((MAX("A_DATE1")>=TO_DATE(' 2000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              MIN("A_DATE2")<=TO_DATE(' 2000-04-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
   9 - filter("ID">50000)
  10 - access("A"."ID"="B"."ID")
       filter("B"."ID">50000)

Notice how the incorrect cardinality estimate leads to a NESTED LOOP for the join operation, which is very likely a bad idea in such cases where the number of actual loop iterations is much higher than estimated.

In general such bad cardinality estimates can echo through the whole execution plan with potentially devastating results.

For my particular case here one potential workaround besides using undocumented CARDINALITY or OPT_ESTIMATE hints is to prevent the optimizer from pushing the filter into the view (and thereby avoiding the implicit generation of the HAVING clause) by wrapping the view with another view on top that includes a reference to ROWNUM:

select /*+ no_merge(x) */ * from (
select  /*+ gather_plan_statistics */
        a.*, b.id as b_id, b.filler as b_filler
from    (
          /* ROWNUM in projection prevents pushing of filters */
          select  id
                , max_a_date1
                , min_a_date2
                , rownum as rn
          from
                  (
                    /* The unchanged aggregate view */
                    select
                            id
                          , max(a_date1) as max_a_date1
                          , min(a_date2) as min_a_date2
                    from
                            t
                    group by
                            id
                  )
        ) a
      , t b
where
        max_a_date1 >= date '2000-01-01'
and     min_a_date2 <= date '2000-01-01' + 100
and     a.id > 50000
and     a.id = b.id
) x
where
        rownum > 1
;

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

---------------------------------------------------------------------
| Id  | Operation                 | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |      1 |        |      0 |
|   1 |  COUNT                    |      |      1 |        |      0 |
|*  2 |   FILTER                  |      |      1 |        |      0 |
|   3 |    VIEW                   |      |      1 |  50001 |  50000 |
|*  4 |     HASH JOIN             |      |      1 |  50001 |  50000 |
|*  5 |      VIEW                 |      |      1 |    100K|  50000 |
|   6 |       COUNT               |      |      1 |        |    100K|
|   7 |        VIEW               |      |      1 |    100K|    100K|
|   8 |         HASH GROUP BY     |      |      1 |    100K|    100K|
|   9 |          TABLE ACCESS FULL| T    |      1 |    100K|    100K|
|* 10 |      TABLE ACCESS FULL    | T    |      1 |  50001 |  50000 |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(ROWNUM>1)
   4 - access("A"."ID"="B"."ID")
   5 - filter(("MAX_A_DATE1">=TO_DATE(' 2000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
              "MIN_A_DATE2"<=TO_DATE(' 2000-04-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "A"."ID">50000))
  10 - filter("B"."ID">50000)

This way the cardinality estimate is much better for my particular case here and a more suitable HASH JOIN gets used instead.

There are however a couple of noticable drawbacks with that workaround:

- The optimizer is still clueless: If for example the filter on the aggregates actually filtered any data, it would still assume NO filtering at all (still 100K in this case here), which might be just as bad, but in the opposite direction (over-estimate instead of under-estimate)

- Any other potentially useful filters (the "ID > 50000" in my case here) will also be prevented from being pushed into the view, so in my case the GROUP BY has to operate on a much larger data set than actually necessary (100K rows instead of 50K rows) - not good

- The ROWNUM evaluation causes overhead and will be a problem when trying to run this statement using Parallel Execution, as it will cause the plan to be split into multiple DFOs (you'll find multiple PX COORDINATOR operations in such plans which can have nasty side effects, for more info read my OTN mini-series on Parallel Execution) with a serialized operation in between to determine the row number

Summary


Be careful whenever the cardinality estimates for HAVING clauses become relevant to your execution plan, as the optimizer simply applies default selectivities that can be way off.

If the aggregation is one of the final steps of execution, the cardinality estimate is probably not that relevant, but there are other cases possible where it matters a lot.

Footnote


From 11.2.0.3 on there is a new undocumented parameter "_optimizer_filter_pushdown" that defaults to "true".

When setting it to "false" the "Filter Pushdown" (FPD) transformation used above and prevented via ROWNUM will be disabled, however on a global statement / session level and not only for a particular view / query block as with the ROWNUM workaround.