Wednesday, April 24, 2019

Bloom Filter Efficiency And Cardinality Estimates

I've recently came across an interesting observation I've not seen documented yet, so I'm publishing a simple example here to demonstrate the issue.

In principle it looks like that the efficiency of Bloom Filter operations are dependent on the cardinality estimates. This means that in particular cardinality under-estimates of the optimizer can make a dramatic difference how efficient a corresponding Bloom Filter operation based on such a cardinality estimate will work at runtime. Since Bloom Filters are crucial for efficient processing in particular when using Exadata or In Memory column store this can have significant impact on the performance of affected operations.

While other operations based on SQL workareas like hash joins for example can be affected by such cardinality mis-estimates, too, these seem to be capable of adapting at runtime - at least to a certain degree. However I haven't seen such an adaptive behaviour of Bloom Filter operations at runtime (not even when executing the same statement multiple times and statistics feedback not kicking in).

To demonstrate the issue I'll create two simple tables that get joined and one of them gets a filter applied:

create table t1 parallel 4 nologging compress
as
with
generator1 as
(
  select /*+
              cardinality(1e3)
              materialize
          */
          rownum as id
        , rpad('x', 100) as filler
  from
          dual
  connect by
          level <= 1e3
),
generator2 as
(
  select /*+
              cardinality(1e4)
              materialize
          */
          rownum as id
        , rpad('x', 100) as filler
  from
          dual
  connect by
          level <= 1e4
)
select
       id
     , id as id2
     , rpad('x', 100) as filler
from   (
          select /*+ leading(b a) */
                  (a.id - 1) * 1e4 + b.id as id
          from
                  generator1 a
                , generator2 b
       )
;

alter table t1 noparallel;

create table t2 parallel 4 nologging compress as select * from t1;

alter table t2 noparallel;

All I did here is create two tables with 10 million rows each, and I'll look at the runtime statistics of the following query:

select /*+ no_merge(x) */ * from (
select /*+
           leading(t1)
           use_hash(t2)
           px_join_filter(t2)
           opt_estimate(table t1 rows=1)
           --opt_estimate(table t1 rows=250000)
           monitor
       */
       t1.id
     , t2.id2
from
       t1
     , t2
where
       mod(t1.id2, 40) = 0
--     t1.id2 between 1 and 250000
and    t1.id = t2.id
) x
where rownum > 1;

Note: If you try to reproduce make sure you get actually a Bloom Filter operation - in an unpatched version 12.1.0.2 I had to add a PARALLEL(2) hint to actually get the Bloom Filter operation.

The query filters on T1 so that 250K rows will be returned and then joins to T2. The first interesting observation regarding the efficiency of the Bloom Filter is that the actual data pattern makes a significant difference: When using the commented filter "T1.ID2 BETWEEN 1 and 250000" the resulting cardinality will be same as when using the "MOD(T1.ID2, 40) = 0", but the former will result in a perfect filtering of the Bloom Filter regardless of the OPT_ESTIMATE hint used, whereas when using the latter the efficiency will be dramatically different.

This is what I get when using version 18.3 (12.1.0.2 showed very similar results) and force the under-estimate using the OPT_ESTIMATE ROWS=1 hint - the output is from my XPLAN_ASH script and edited for brevity:

------------------------------------------------------------------------------------
| Id  | Operation              | Name    | Rows  | Bytes | Execs | A-Rows | PGA    |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |       |       |    1  |     0  |        |
|   1 |  COUNT                 |         |       |       |    1  |     0  |        |
|*  2 |   FILTER               |         |       |       |    1  |     0  |        |
|   3 |    VIEW                |         |     1 |    26 |    1  |   250K |        |
|*  4 |     HASH JOIN          |         |     1 |    24 |    1  |   250K | 12556K |
|   5 |      JOIN FILTER CREATE| :BF0000 |     1 |    12 |    1  |   250K |        |
|*  6 |       TABLE ACCESS FULL| T1      |     1 |    12 |    1  |   250K |        |
|   7 |      JOIN FILTER USE   | :BF0000 |    10M|   114M|    1  | 10000K |        |
|*  8 |       TABLE ACCESS FULL| T2      |    10M|   114M|    1  | 10000K |        |
------------------------------------------------------------------------------------

The Bloom Filter didn't help much, only a few rows were actually filtered (otherwise my XPLAN_ASH script would have shown "10M" as actually cardinality instead of "10000K", which is something slightly less than 10M rounded up).

Repeat the same but this time using the OPT_ESTIMATE ROWS=250000 hint:

-------------------------------------------------------------------------------------------
| Id  | Operation              | Name    | Rows  | Bytes |TempSpc| Execs | A-Rows| PGA    |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |       |       |       |    1  |    0  |        |
|   1 |  COUNT                 |         |       |       |       |    1  |    0  |        |
|*  2 |   FILTER               |         |       |       |       |    1  |    0  |        |
|   3 |    VIEW                |         |   252K|  6402K|       |    1  |  250K |        |
|*  4 |     HASH JOIN          |         |   252K|  5909K|  5864K|    1  |  250K | 12877K |
|   5 |      JOIN FILTER CREATE| :BF0000 |   250K|  2929K|       |    1  |  250K |        |
|*  6 |       TABLE ACCESS FULL| T1      |   250K|  2929K|       |    1  |  250K |        |
|   7 |      JOIN FILTER USE   | :BF0000 |    10M|   114M|       |    1  |  815K |        |
|*  8 |       TABLE ACCESS FULL| T2      |    10M|   114M|       |    1  |  815K |        |
-------------------------------------------------------------------------------------------

So we end up with exactly the same execution plan but the efficiency of the Bloom Filter at runtime has changed dramatically due to the different cardinality estimate the Bloom Filter is based on.

I haven't spent much time yet with the corresponding undocumented parameters that might influence the Bloom Filter behaviour, but when I repeated the same and used the following settings in the session (and ensuring an adequate PGA_AGGREGATE_TARGET setting otherwise the hash join might be starting spilling to disk, which means the Bloom Filter size is considered when calculating SQL workarea sizes):

alter session set "_bloom_filter_size" = 1000000;

I got the following result:

-----------------------------------------------------------------------------------
| Id  | Operation              | Name    | Rows  | Bytes | Execs | A-Rows| PGA    |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |         |       |       |    1  |    0  |        |
|   1 |  COUNT                 |         |       |       |    1  |    0  |        |
|*  2 |   FILTER               |         |       |       |    1  |    0  |        |
|   3 |    VIEW                |         |     1 |    26 |    1  |  250K |        |
|*  4 |     HASH JOIN          |         |     1 |    24 |    1  |  250K | 12568K |
|   5 |      JOIN FILTER CREATE| :BF0000 |     1 |    12 |    1  |  250K |        |
|*  6 |       TABLE ACCESS FULL| T1      |     1 |    12 |    1  |  250K |        |
|   7 |      JOIN FILTER USE   | :BF0000 |    10M|   114M|    1  |  815K |        |
|*  8 |       TABLE ACCESS FULL| T2      |    10M|   114M|    1  |  815K |        |
-----------------------------------------------------------------------------------

which shows a slightly increased PGA usage compared to the first output but the same efficiency as when having the better cardinality estimate in place.

Increasing the size I couldn't however convince Oracle to make the Bloom Filter even more efficient, even when the better cardinality estimate was in place.

Summary

Obviously the efficiency / internal sizing of the Bloom Filter vector at runtime depends on the cardinality estimates of the optimizer. Depending on the actual data pattern this can make a significant difference in terms of efficiency. Yet another reason why having good cardinality estimates is a good thing and yet sometimes so hard to achieve, in particular for join cardinalities.

Footnote

On MyOracleSupport I've found the following note regarding Bloom Filter efficiency:

Bug 8932139 - Bloom filtering efficiency is inversely proportional to DOP (Doc ID 8932139.8)

Another interesting behaviour - the bug is only fixed in version 19.1 but also included in the latest RU(R)s of 18c and 12.2 from January 2019 on.

Chinar Aliyev's Blog

Chinar Aliyev has recently started to pick up on several of my blog posts regarding Parallel Execution and the corresponding new features introduced in Oracle 12c.

It is good to see that obviously Oracle has since then improved some of these and added new ones as well.

Here are some links to the corresponding posts:

New automatic Parallel Outer Join Null Handling in 18c

Improvements regarding automatic parallel distribution skew handling in 18c

Chinar has also put some more thoughts on the HASH JOIN BUFFERED operation:

New thoughts about the HASH JOIN BUFFERED operation

There are also a number of posts on his blog regarding histograms and in particular how to properly calculate the join cardinality in the presence of additional filters and resulting skew, which is a very interesting topic and yet to be handled properly by the optimizer even in the latest versions.