## Sunday, January 13, 2013

### HAVING Cardinality

When performing aggregate GROUP BY operations an additional filter on the aggregates can be applied using the HAVING clause.

Usually aggregates are one of the last steps executed before the final result set is returned to the client.

However there are various reasons, why a GROUP BY operation might be somewhere in the middle of the execution plan operation, for example it might be part of a view that cannot be merged (or was hinted not to be merged using the NO_MERGE hint), or in the more recent releases (11g+) the optimizer decided to use the GROUP BY PLACEMENT transformation that deliberately can move the GROUP BY operation to a different execution step of the plan.

In such cases, when the GROUP BY operation will be input to some other operation, it becomes essential for the overall efficiency of the execution plan preferred by the optimizer that the cardinality estimates are in the right ballpark, as it will influence the choice of other related execution steps like join orders and methods or simply the decision between an index-based access or a full table scan.

While the optimizer based on the statistics can come up with a reasonable estimate regarding the cardinality of the GROUP BY expression (the emphasis here is on *can*, it might also be wrong), it is important to understand that an additional filter on the aggregates using the HAVING clause is in principle treated like an "unknown" expression and therefore the estimates are based on built-in defaults that might not have much to do with actual filter selectivities of that HAVING expression.

Here is a simple example to demonstrate the point:

```create table t
as
select
rownum as id
, date '2000-01-01' + mod(rownum - 1, 100) as a_date1
, date '2000-01-01' + mod(rownum - 1, 100) as a_date2
, rpad('x', 100) as filler
from
dual
connect by
level <= 1e5
;

exec dbms_stats.gather_table_stats(null, 't')

create unique index t_idx on t (id);
```

There is a table of 100K rows with two dates in it that have each 100 distinct values (we can ignore in this case here that the values generated are actually correlated) - the ID column is unique.

If I ask for an estimate for queries similar to the following on this table:

```set echo on linesize 200 pagesize 0

explain plan for
select
id
, max(a_date1) as max_a_date1
, min(a_date2) as min_a_date2
from
t
-- 50% of data
where
id > 50000
group by
id
;

select * from table(dbms_xplan.display);

explain plan for
select
id
, max(a_date1) as max_a_date1
, min(a_date2) as min_a_date2
from
t
-- 50% of data
where
id > 50000
group by
id
-- No actual filtering
having
max(a_date1) >= date '2000-01-01'
and     min(a_date2) <= date '2000-01-01' + 100
;

select * from table(dbms_xplan.display);
```

I'll get these estimates:

```-- Simple GROUP BY without HAVING
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      | 50001 |
|   1 |  HASH GROUP BY     |      | 50001 |
|*  2 |   TABLE ACCESS FULL| T    | 50001 |
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("ID">50000)

-- GROUP BY plus HAVING
--------------------------------------------
| Id  | Operation           | Name | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT    |      |   126 |
|*  1 |  FILTER             |      |       |
|   2 |   HASH GROUP BY     |      |   126 |
|*  3 |    TABLE ACCESS FULL| T    | 50001 |
--------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(MAX("A_DATE1")>=TO_DATE(' 2000-01-01 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND MIN("A_DATE2")<=TO_DATE(' 2000-04-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
3 - filter("ID">50000)
```

Note how the optimizer gets the cardinality right for the first statement. The aggregation doesn't reduce the cardinality due to the uniqueness of the ID column.

Notice how the filter predicates on the aggregates in the second case actually do not filter at all as they select the whole date range available. But the optimizer simply assumes 5% selectivity per unknown range predicate resulting in 0.05 times 0.05 = 0.0025 selectivity.

I don't think that it is a bug - it simply cannot assume anything regarding the possible MIN and MAX values given any potential arbitrary other filters.

If I now wrap the aggregate query as a view into a more complex statement, it becomes obvious that the HAVING clause can also implicitly be derived by applying corresponding filters to the view:

```select /*+ no_merge(x) */ * from (
select  /*+ gather_plan_statistics */
a.*, b.id as b_id, b.filler as b_filler
from    (
/* Aggregate inline view without having */
select
id
, max(a_date1) as max_a_date1
, min(a_date2) as min_a_date2
from
t
group by
id
) a
, t b
where
/* These filter predicates will be pushed into the view */
max_a_date1 >= date '2000-01-01'
and     min_a_date2 <= date '2000-01-01' + 100
and     a.id > 50000
/* A join between the view and something else */
and     a.id = b.id
) x
where
rownum > 1
;

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

----------------------------------------------------------------------------
| Id  | Operation                       | Name  | Starts | E-Rows | A-Rows |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |       |      1 |        |      0 |
|   1 |  COUNT                          |       |      1 |        |      0 |
|*  2 |   FILTER                        |       |      1 |        |      0 |
|   3 |    VIEW                         |       |      1 |    126 |  50000 |
|   4 |     NESTED LOOPS                |       |      1 |        |  50000 |
|   5 |      NESTED LOOPS               |       |      1 |    126 |  50000 |
|   6 |       VIEW                      |       |      1 |    126 |  50000 |
|*  7 |        FILTER                   |       |      1 |        |  50000 |
|   8 |         HASH GROUP BY           |       |      1 |    126 |  50000 |
|*  9 |          TABLE ACCESS FULL      | T     |      1 |  50001 |  50000 |
|* 10 |       INDEX UNIQUE SCAN         | T_IDX |  50000 |      1 |  50000 |
|  11 |      TABLE ACCESS BY INDEX ROWID| T     |  50000 |      1 |  50000 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
7 - filter((MAX("A_DATE1")>=TO_DATE(' 2000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
MIN("A_DATE2")<=TO_DATE(' 2000-04-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss')))
9 - filter("ID">50000)
10 - access("A"."ID"="B"."ID")
filter("B"."ID">50000)
```

Notice how the incorrect cardinality estimate leads to a NESTED LOOP for the join operation, which is very likely a bad idea in such cases where the number of actual loop iterations is much higher than estimated.

In general such bad cardinality estimates can echo through the whole execution plan with potentially devastating results.

For my particular case here one potential workaround besides using undocumented CARDINALITY or OPT_ESTIMATE hints is to prevent the optimizer from pushing the filter into the view (and thereby avoiding the implicit generation of the HAVING clause) by wrapping the view with another view on top that includes a reference to ROWNUM:

```select /*+ no_merge(x) */ * from (
select  /*+ gather_plan_statistics */
a.*, b.id as b_id, b.filler as b_filler
from    (
/* ROWNUM in projection prevents pushing of filters */
select  id
, max_a_date1
, min_a_date2
, rownum as rn
from
(
/* The unchanged aggregate view */
select
id
, max(a_date1) as max_a_date1
, min(a_date2) as min_a_date2
from
t
group by
id
)
) a
, t b
where
max_a_date1 >= date '2000-01-01'
and     min_a_date2 <= date '2000-01-01' + 100
and     a.id > 50000
and     a.id = b.id
) x
where
rownum > 1
;

select * from table(dbms_xplan.display_cursor(null, null, 'ALLSTATS LAST'));

---------------------------------------------------------------------
| Id  | Operation                 | Name | Starts | E-Rows | A-Rows |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      |      1 |        |      0 |
|   1 |  COUNT                    |      |      1 |        |      0 |
|*  2 |   FILTER                  |      |      1 |        |      0 |
|   3 |    VIEW                   |      |      1 |  50001 |  50000 |
|*  4 |     HASH JOIN             |      |      1 |  50001 |  50000 |
|*  5 |      VIEW                 |      |      1 |    100K|  50000 |
|   6 |       COUNT               |      |      1 |        |    100K|
|   7 |        VIEW               |      |      1 |    100K|    100K|
|   8 |         HASH GROUP BY     |      |      1 |    100K|    100K|
|   9 |          TABLE ACCESS FULL| T    |      1 |    100K|    100K|
|* 10 |      TABLE ACCESS FULL    | T    |      1 |  50001 |  50000 |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter(ROWNUM>1)
4 - access("A"."ID"="B"."ID")
5 - filter(("MAX_A_DATE1">=TO_DATE(' 2000-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"MIN_A_DATE2"<=TO_DATE(' 2000-04-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "A"."ID">50000))
10 - filter("B"."ID">50000)
```

This way the cardinality estimate is much better for my particular case here and a more suitable HASH JOIN gets used instead.

There are however a couple of noticable drawbacks with that workaround:

- The optimizer is still clueless: If for example the filter on the aggregates actually filtered any data, it would still assume NO filtering at all (still 100K in this case here), which might be just as bad, but in the opposite direction (over-estimate instead of under-estimate)

- Any other potentially useful filters (the "ID > 50000" in my case here) will also be prevented from being pushed into the view, so in my case the GROUP BY has to operate on a much larger data set than actually necessary (100K rows instead of 50K rows) - not good

- The ROWNUM evaluation causes overhead and will be a problem when trying to run this statement using Parallel Execution, as it will cause the plan to be split into multiple DFOs (you'll find multiple PX COORDINATOR operations in such plans which can have nasty side effects, for more info read my OTN mini-series on Parallel Execution) with a serialized operation in between to determine the row number

### Summary

Be careful whenever the cardinality estimates for HAVING clauses become relevant to your execution plan, as the optimizer simply applies default selectivities that can be way off.

If the aggregation is one of the final steps of execution, the cardinality estimate is probably not that relevant, but there are other cases possible where it matters a lot.

### Footnote

From 11.2.0.3 on there is a new undocumented parameter "_optimizer_filter_pushdown" that defaults to "true".

When setting it to "false" the "Filter Pushdown" (FPD) transformation used above and prevented via ROWNUM will be disabled, however on a global statement / session level and not only for a particular view / query block as with the ROWNUM workaround.