Showing posts with label Projection. Show all posts
Showing posts with label Projection. Show all posts

Sunday, August 16, 2015

Parallel Projection

A recent case at a client reminded me of something that isn't really new but not so well known - Oracle by default performs evaluation at the latest possible point in the execution plan.

So if you happen to have expressions in the projection of a simple SQL statement that runs parallel it might be counter-intuitive that by default Oracle won't evaluate the projection in the Parallel Slaves but in the Query Coordinator - even if it was technically possible - because the latest possible point is the SELECT operation with the ID = 0 of the plan, which is always performed by the Query Coordinator.

Of course, if you make use of expressions that can't be evaluated in parallel or aren't implemented for parallel evaluation, then there is no other choice than doing this in the Query Coordinator.

The specific case in question was a generic export functionality that allowed exporting report results to some CSV or Excel like format, and some of these reports had a lot of rows and complex - in that case CPU intensive - expressions in their projection clause.

When looking at the run time profile of such an export query it became obvious that although it was a (very simple) parallel plan, all of the time was spent in the Query Coordinator, effectively turning this at runtime into a serial execution.

This effect can be reproduced very easily:
create table t_1
compress
as
select  /*+ use_nl(a b) */
        rownum as id
      , rpad('x', 100) as filler
from
        (select /*+ cardinality(1e5) */ * from dual
connect by
        level <= 1e5) a, (select /*+ cardinality(20) */ * from dual connect by level <= 20) b
;

exec dbms_stats.gather_table_stats(null, 't_1', method_opt=>'for all columns size 1')

alter table t_1 parallel cache;

-- Run some CPU intensive expressions in the projection 
-- of a simple parallel Full Table Scan
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
;

-- The plan is clearly parallel
--------------------------------------------------------------------------------------------------------------
| Id  | Operation            | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |          |  2000K|   192M|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR      |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)| :TQ10000 |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   4 |     TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------

-- But the runtime profile looks more serial
-- although the Parallel Slaves get used to run the Full Table Scan
-- All time spent in the operation ID = 0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation            | Name     | Execs | A-Rows| ReadB | ReadReq | Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                     | Parallel Execution Skew ASH      | Activity Graph ASH          | Top 5 Activity ASH  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT     |          |    5  | 2000K |       |         |     3 |   136 |   120 | #################### |   1:sqlplus.exe(120)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]       | ################################ | @@@@@@@@@@@@@@@@@@@ ( 98%)  | ON CPU(120)         |
|   1 |   0 |  PX COORDINATOR      |          |    5  | 2000K |       |         |   119 |     1 |     1 |                 #    |   1:sqlplus.exe(1)[2000K],P008(0)[0],P009(0)[0],P00A(0)[0],P00B(0)[0]         |                                  |                     ( .8%)  | ON CPU(1)           |
|   2 |   1 |   PX SEND QC (RANDOM)| :TQ10000 |    4  | 2000K |       |         |    66 |    11 |     2 |          ##          |   2:P00B(1)[508K],P00A(1)[490K],P008(0)[505K],P009(0)[497K],sqlplus.exe(0)[0] |                                  |                     (1.6%)  | PX qref latch(2)    |
|   3 |   2 |    PX BLOCK ITERATOR |          |    4  | 2000K |       |         |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
|*  4 |   3 |     TABLE ACCESS FULL| T_1      |   52  | 2000K |   23M |     74  |       |       |       |                      |   0:P00B(0)[508K],P008(0)[505K],P009(0)[497K],P00A(0)[490K],sqlplus.exe(0)[0] |                                  |                             |                     |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fortunately there is a simple and straightforward way to make use of the Parallel Slaves for evaluation of projection expressions that can be evaluated in parallel - simply add a suitable NO_MERGE hint for the query block that you want the projection to be evaluated for in the Parallel Slaves.

If you don't want to have side effects on the overall plan shape by not merging views you could always wrap the original query in an outer SELECT and not merging the now inner query block. There seems to be a rule that the projection of a view always get evaluated at the VIEW operator, and if we check the execution plan we can see that the VIEW operator is marked parallel:
set echo on timing on time on

set autotrace traceonly statistics

set arraysize 500

select /*+ no_merge(x) */ * from (
select
        regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') as some_cpu_intensive_exp1
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') as some_cpu_intensive_exp2
      , regexp_replace(filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') as some_cpu_intensive_exp3
from t_1
) x
;

-- View operator is marked parallel
-- This is were the projection clause of the VIEW will be evaluated
---------------------------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |  2000K|    11G|   221   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM) | :TQ10000 |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | P->S | QC (RAND)  |
|   3 |    VIEW               |          |  2000K|    11G|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|   4 |     PX BLOCK ITERATOR |          |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|   5 |      TABLE ACCESS FULL| T_1      |  2000K|   192M|   221   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------

-- Runtime profile now shows effective usage of Parallel Slaves 
-- for doing the CPU intensive work
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Pid | Operation             | Name     | Execs | A-Rows| Start | Dur(T)| Dur(A)| Time Active Graph    | Parallel Distribution ASH                                                         | Parallel Execution Skew ASH| Activity Graph ASH          | Top 5 Activity ASH               |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 |     | SELECT STATEMENT      |          |    5  | 2000K |       |       |       |                      |   0:sqlplus.exe(0)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]             |                            |                             |                                  |
|   1 |   0 |  PX COORDINATOR       |          |    5  | 2000K |    17 |    63 |    10 |    # ##   #    ####  |   1:sqlplus.exe(10)[2000K],P000(0)[0],P001(0)[0],P002(0)[0],P003(0)[0]            | ####                       | *                   (5.6%)  | resmgr:cpu quantum(10)           |
|   2 |   1 |   PX SEND QC (RANDOM) | :TQ10000 |    4  | 2000K |     5 |    61 |    10 | ## #  ## ## ## #     |   3:P002(5)[544K],P001(4)[487K],P000(1)[535K],P003(0)[434K],sqlplus.exe(0)[0]     | #                          |                     (5.6%)  | ON CPU(7),resmgr:cpu quantum(3)  |
|   3 |   2 |    VIEW               |          |    4  | 2000K |     2 |    82 |    69 | #################### |   4:P003(42)[434K],P001(35)[487K],P000(26)[535K],P002(22)[544K],sqlplus.exe(0)[0] | ############               | @@@@@@@@@@@@@@@@@@@ ( 70%)  | ON CPU(125)                      |
|   4 |   3 |     PX BLOCK ITERATOR |          |    4  | 2000K |       |       |       |                      |   0:P002(0)[544K],P000(0)[535K],P001(0)[487K],P003(0)[434K],sqlplus.exe(0)[0]     |                            |                             |                                  |
|*  5 |   4 |      TABLE ACCESS FULL| T_1      |   52  | 2000K |     3 |    78 |    29 | ###### ####### # ### |   4:P000(11)[535K],P002(8)[544K],P001(8)[487K],P003(7)[434K],sqlplus.exe(0)[0]    | ###                        | *****               ( 19%)  | resmgr:cpu quantum(30),ON CPU(4) |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
At runtime the duration of the query now gets reduced significantly and we can see the Parallel Slaves getting used when the VIEW operator gets evaluated. Although the overall CPU time used is similar to the previous example, the duration of the query execution is less since this CPU time is now spent in parallel in the slaves instead in the Query Coordinator.

Summary


By default Oracle performs evaluation at the latest possible point of the execution plan. Sometimes you can improve runtime by actively influencing when the projection will be evaluated by preventing view merging and introducing a VIEW operator that will be used to evaluate the projection clause.

The optimizer so far doesn't seem to incorporate such possibilities in its evaluations of possible plan shapes, so this is something you need to do manually up to and including Oracle 12c (version 12.1.0.2 as of time of writing this).

Thursday, January 31, 2013

Exadata Smart Scan Projection Limitation

Here is an interesting limitation to Exadata Smart Scans - if more than 254 columns from a table (not HCC compressed, more on that in moment) need to be projected, Smart Scans for that particular segment will be disabled and Exadata will fall back to conventional I/O. This means that the number of columns in the projection clause can make a significant difference to performance, since only Smart Scans allow taking advantage of offloading and particularly avoiding I/O via Storage Indexes.

Now the expression "254 columns" might ring a bell, since it is the maximum number of columns that Oracle can store in a single row piece - tables consisting of more than 254 columns will have to be stored in multiple row pieces.

However, what I'm talking about here is not related to such columns residing in different row pieces of a row - Smart Scans still happily work even if columns from different row pieces are projected (which was subject to several bugs in the past), although you might end up with additional "cell single block reads" in case of truly chained rows rather than just additional logical I/O for picking up the different row pieces from the same block, also sometimes called "intra-block" chaining.

No, the limitation simply seems to be that Smart Scans - broadly speaking and ignoring edge cases - can only transport a maximum of 254 columns from a single (non-HCC) segment. Requesting more columns will simply disable Smart Scans for that segment.

Now you might say, offloading and in particular offloading column projection isn't that much relevant if you select that many columns from a table anyway, but the point is that you loose the ability to benefit from Storage Indexes and only transporting the relevant rows back to the compute nodes.

Both features can speed up the processing significantly, in particular if the number of rows selected is only a fraction of the total number of rows, and/or the cells could avoid a significant amount of I/O via Storage Indexes.

To demonstrate the point I've put together a simple test case that generates a test table with more than 254 columns - the script below generates a table of approx. 40GB uncompressed size so that a significant difference in performance could be measured.

set echo on timing on time on

-- MAX is 999, there is a ID column as first col
define num_cols = 300

define compression = nocompress
--define compression = "compress for query low"

drop table many_cols_rg;

purge table many_cols_rg;

declare
  s_sql varchar2(32767);
begin
  s_sql := q'!
create table many_cols_rg pctfree 0
&compression
parallel nologging
as
with generator1 as
(
select  /*+ materialize cardinality(1000) */
        rownum as id
        -- this makes the rowsource wider otherwise PX BLOCK ITERATOR has problems properly spreading work among the slaves
      , rpad('x', 4000) as padding
from
        dual
connect by
        level <= 1e3
),
generator2 as
(
select  /*+ materialize cardinality(10000) */
        rownum as id
        -- this makes the rowsource wider otherwise PX BLOCK ITERATOR has problems properly spreading work among the slaves
      , rpad('x', 4000) as padding
from
        dual
connect by
        level <= 1e4
)
select
        num_id as id
!';
  for i in 1..&num_cols loop
    s_sql := s_sql || ', char_id as col' || to_char(i, 'FM000');
  end loop;
  s_sql := s_sql || q'!
from
        (
        select  /*+ no_merge */
                a.id + (b.id - 1) * 1e3 as num_id
              , cast(to_char(a.id + (b.id - 1) * 1e3, 'FM0000000000') as varchar2(10)) as char_id
        from
                generator1 a
              , generator2 b
        )
  !';
  execute immediate s_sql;
end;
/

exec dbms_stats.gather_table_stats(null, 'many_cols_rg')

Assuming a Storage Index was generated on the ID column a query like the following (using Parallel Query at a DOP of 32 in the test runs here) can benefit from offloading, since in principle all I/O could be avoided via the Storage Index, and virtually no data needs to be transported from the cells to the compute nodes.

Note that it projects exactly 254 columns from different row pieces.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:28:22 SQL>
11:28:22 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:28:22 SQL>
11:28:22 SQL> select
11:28:22   2  col001,
11:28:22   3  col002,
.
.
.
11:28:23 254  col253,
11:28:23 255  col300/*,
11:28:23 256  col254*/
11:28:23 257  from many_cols_rg where id between -2 and -1;

no rows selected

Elapsed: 00:00:02.40
11:28:25 SQL>
11:28:25 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                    2,1195E+10
cell physical IO interconnect bytes returned by smart scan          3000784

Elapsed: 00:00:00.01
11:28:25 SQL>

As you can see from the snippet, it took less than 2.5 seconds to run the query on the 40GB segment, and the session statistics report 20GB avoided via the Storage Index (which seems to be an instrumentation bug as it always reports only 50% of the total segment size as a maximum, this output was taken from 11.2.0.2 Exadata BP14). Furthermore only a couple of MB were exchanged between the cells and the compute nodes.

The corresponding Real-Time SQL Monitoring report confirms the "Smart Scan":


Increasing the number of columns projected from the segment above 254 (and as outlined above it doesn't matter from which row pieces these columns come from) disables the Smart Scan, and it takes more than 9 seconds to run essentially the same query, pumping all 40GB through the compute nodes to filter all rows.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:29:14 SQL>
11:29:14 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:29:14 SQL>
11:29:14 SQL> select
11:29:14   2  col001,
11:29:14   3  col002,
.
.
.
11:29:15 254  col253,
11:29:15 255  col300,
11:29:15 256  col254
11:29:15 257  from many_cols_rg where id between -2 and -1;

no rows selected

Elapsed: 00:00:09.22
11:29:24 SQL>
11:29:24 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.01
11:29:24 SQL>

The corresponding Real-Time SQL Monitoring report confirms the fallback to "direct path reads":


Funnily, in this deliberately crafted, extreme case here, it is much faster to access the segment twice and get the remaining columns via a self-join in order to benefit from the offloading features - it only takes 4.7 seconds to run the self-join, and the session statistics confirm that both segment scans could leverage offloading and in particular avoid I/O via Storage Indexes:

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options

11:29:37 SQL>
11:29:37 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan                0

Elapsed: 00:00:00.00
11:29:37 SQL>
11:29:37 SQL> select
11:29:37   2  a.col001,
11:29:37   3  a.col002,
.
.
.
11:29:37 254  a.col253,
11:29:37 255  a.col300,
11:29:37 256  b.col254
11:29:37 257  from many_cols_rg a, many_cols_rg b
11:29:37 258  where a.id between -2 and -1 and b.id between -2 and -1
11:29:37 259  and a.id = b.id;

no rows selected

Elapsed: 00:00:04.77
11:29:42 SQL>
11:29:42 SQL> select a.name, b.value from v$mystat b, v$statname a where a.name in ('cell physical IO bytes saved by storage index', 'cell physical IO interconnect bytes returned by smart scan') and a.statistic# = b.statistic#;

NAME                                                                  VALUE
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                    4,2390E+10
cell physical IO interconnect bytes returned by smart scan          6001568

Elapsed: 00:00:00.01
11:29:42 SQL>
11:29:42 SQL>

Note: This is one of the cases where you don't want the optimizer to eliminate a self-join via a query transformation based on corresponding constraints on the join columns :-)

HCC Compression


Now the funny thing is that if you repeat the table creation script but uncomment the HCC compression, the Smart Scan happily works with up to 1,000 columns of such a compressed segment.

So obviously the general code implementation supports transporting rows with more than 254 columns from the cell to the compute nodes, but the question is why does it only do so with HCC compressed segments. It's probably a question that my client will raise with Oracle Support to find out the answer.

Footnote


At least only the number of "raw" columns projected count towards the limitation - any expressions based on columns don't count, therefore you can project actually more than 254 expressions from a segment and still benefit from Smart Scans as long as the expressions refer to a maximum of 254 "base" columns.

The same limitation could also be reproduced when using (almost) the latest available Exadata version as of writing this (11.2.0.3 BP12, almost because BP14 just came out as far as I know).

Wednesday, January 27, 2010

When your projection is not cost-free

First of all I have to admit that the title of this post is a bit misleading - the projection is almost never "cost-free". Indirectly cost will be generated by accessing rows from row sources to obtain any columns selected, and with the introduction of System Statistics, Oracle even assigns a direct CPU cost to accessing a particular column of a row - as it has been pointed out by Joze Senegacnik and is demonstrated e.g. in Christian Antognini's "Troubleshooting Oracle Performance" book on page 117.

However, Oracle obviously treats work that has to be performed as part of the projection differently than work that has to be performed as part of the selection part.

In particular user-defined functions or scalar subqueries will not be accounted for costing when calculating the overall query cost.

This holds even true for user-defined functions that have a cost assigned via the Extensible Optimizer framework. For more information regarding the "Extensible Optimizer" framework, read Adrian Billington's article on oracle-developer.net or refer e.g. to our latest "Expert Oracle Practices" OakTable book where Joze Senegacnik dedicated a whole chapter to this topic.

Of course the best way to deal with this situation is to avoid using user-defined functions or scalar subqueries and replace them with appropriate joins/subqueries. In most cases this yields the best performance, provided that the optimizer comes up with a reasonable execution plan, and also solves the issue of non-contributed work, because regular join constructs will be considered in the cost calculation.

However, what to do of you're in the situation that you can't simply change the query (e.g. third party vendor application)?

If a user-defined function or scalar subquery is used as part of the selection clause the cost-based optimizer will make use of any cost associated with the function, or evaluated for the scalar subquery, whereas the same construct used as part of the projection will not be taken into account for the cost.

In particular in the case of user-defined functions that perform costly operations, for example recursive SQL, and are not declared as deterministic (or are really of undeterministic nature) this can make a significant difference.

The problem with functions that are not declared as deterministic is that the built-in caching feature of Oracle that can help with scalar subqueries or functions declared as deterministic (since Oracle 10.2) can not be used to alleviate the potentially resource-intensive numerous calls to the function.

In case of more complex queries that make use of views, the view merging transformations applied to the query therefore can lead to quite different work performed by a query.

Consider the following example setup:


drop function generate_lio;

drop table t1;

create table t1 (
run_id integer not null, /* identify the process inserting the data */
batch_id integer not null, /* represents clustered data, could also be a (arriving) date */
a_value number null, /* represents sequence based data */
a_random number null, /* represents randomly scattered data */
a_date timestamp default systimestamp not null, /* represents the insert timestamp */
filler char(1) default 'x' not null /* can be used to size the row as required */
);

create index t1_idx1 on
t1 (
a_random
);

insert into t1 (
run_id
, batch_id
, a_value
, a_random
)
select
1 as run_id
, trunc(id - 1 / 400) + 1 as batch_id
, id as a_value
, trunc(dbms_random.value(1, 40000.999999999)) as a_random
from
(
select
level as id
from
dual
connect by
level <= 40000
);

commit;

exec dbms_stats.gather_table_stats(null, 't1')

create or replace function generate_lio(in_lio in number default 1)
return number
--deterministic
as
n_val number;
begin
select /*+ first_rows(1) */
run_id
into
n_val
from
(
select
*
from
(
select
rownum as rn
, run_id
, substrb(rowid, 1, 15) as block
, a_random
from
t1
where
a_random is not null
order by
a_random
)
where rownum <= in_lio
)
where
rn = in_lio;
return trunc(dbms_random.value(0, 1000)) * n_val;
end generate_lio;
/

associate statistics with functions generate_lio default cost (100000, 3, 0) default selectivity 100;


The code creates a table holding 40,000 rows and a simple index with a bad clustering factor. A user-defined function that allows to generate logical I/O based on that index is created and associated a default cost (and selectivity) using ASSOCIATE STATISTICS. The function deliberately uses the DBMS_RANDOM package to simulate a non-deterministic behaviour. Each BATCH_ID in the table covers 400 rows, with 100 batch_ids in total.

Now consider the following query:


/* Hint is required from 11g on to prevent GROUP BY placement */
select /*+ no_place_group_by */
sum(val), t1.filler
from
t1
, (
/* See the difference in the runtime statistics
between merging this view and not */
select /* no_merge */
generate_lio(1) as val
, batch_id
from
t1
) t2
where
t1.batch_id = t2.batch_id
and t1.batch_id = 42
group by
t1.filler;


This query will generate 160,000 rows by combining 400 rows from each row source. Here is the execution plan (all tested on 11.1.0.7) when merging the view T2 by default:


----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 8 | 146 (28)| 00:00:02 |
| 1 | HASH GROUP BY | | 1 | 8 | 146 (28)| 00:00:02 |
|* 2 | HASH JOIN | | 160K| 1250K| 115 (8)| 00:00:02 |
|* 3 | TABLE ACCESS FULL| T1 | 400 | 2000 | 55 (4)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| T1 | 400 | 1200 | 55 (4)| 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."BATCH_ID"="BATCH_ID")
3 - filter("T1"."BATCH_ID"=42)
4 - filter("BATCH_ID"=42)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=1) "T1"."FILLER"[CHARACTER,1], SUM("GENERATE_LIO"(1))[22]
2 - (#keys=1) "T1"."FILLER"[CHARACTER,1]
3 - "T1"."BATCH_ID"[NUMBER,22], "T1"."FILLER"[CHARACTER,1]
4 - "BATCH_ID"[NUMBER,22]


Notice in particular where the evaluation of the function takes place according to the "Projection" information ("+PROJECTION" option of DBMS_XPLAN.DISPLAY).

And here is the execution plan when explicitly requesting to not merge the view T2:


-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 21 | 146 (28)| 00:00:02 |
| 1 | HASH GROUP BY | | 1 | 21 | 146 (28)| 00:00:02 |
|* 2 | HASH JOIN | | 160K| 3281K| 115 (8)| 00:00:02 |
|* 3 | TABLE ACCESS FULL | T1 | 400 | 2000 | 55 (4)| 00:00:01 |
| 4 | VIEW | | 400 | 6400 | 55 (4)| 00:00:01 |
|* 5 | TABLE ACCESS FULL| T1 | 400 | 2400 | 55 (4)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."BATCH_ID"="T2"."BATCH_ID")
3 - filter("T1"."BATCH_ID"=42)
5 - filter("BATCH_ID"=42)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=1) "T1"."FILLER"[CHARACTER,1], SUM("VAL")[22]
2 - (#keys=1) "T1"."FILLER"[CHARACTER,1], "VAL"[NUMBER,22]
3 - "T1"."BATCH_ID"[NUMBER,22], "T1"."FILLER"[CHARACTER,1]
4 - "VAL"[NUMBER,22], "T2"."BATCH_ID"[NUMBER,22]
5 - "BATCH_ID"[NUMBER,22]


It is not that obvious from the "Projection" information, but in this case the function (the "VAL" of the "Projection" in operation id 4) is evaluated before the join takes place.

Notice that the (still undocumented) "NO_PLACE_GROUP_BY" hint is required from 11g on to prevent the optimizer from getting too clever with this kind of statement. The GROUP BY is used in this case to simplify the result set processing aggregating it into a single row, but in 11g by default the new GROUP BY placement pushes the GROUP BY into the view, effectively solving the issue of excessive function calls by simply reducing the row source sizes that subsequently get joined. Since this is not supposed to be point of this demonstration, the clever trick of pushing the group by into the view is prevented. However it is interesting to note how new features of the optimizer can help to solve problems by side-effects.

The problem described here can still be seen in 11g without any hints when not using a GROUP BY clause to aggregate the result set.

If you run this with statistics_level set to ALL and check the runtime statistics (DBMS_XPLAN.DISPLAY_CURSOR), you'll notice a significant difference between merging the view T2 or not. By default Oracle will merge the view, and obviously perform the "projection" as part of the HASH GROUP BY operation after joining the data, leading to 160,000 calls to the function, each generating three logical I/Os (when using 1 as parameter to the function).


-----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:14.58 | 480K| | | |
| 1 | HASH GROUP BY | | 1 | 1 | 1 |00:00:14.58 | 480K| 805K| 805K| 602K (0)|
|* 2 | HASH JOIN | | 1 | 160K| 160K|00:00:00.01 | 390 | 988K| 988K| 317K (0)|
|* 3 | TABLE ACCESS FULL| T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
|* 4 | TABLE ACCESS FULL| T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."BATCH_ID"="BATCH_ID")
3 - filter("T1"."BATCH_ID"=42)
4 - filter("BATCH_ID"=42)


Compare the difference in runtime and the number of logical I/Os performed ("Buffers") to this one, when not merging the view:


------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.34 | 1590 | | | |
| 1 | HASH GROUP BY | | 1 | 1 | 1 |00:00:00.34 | 1590 | 805K| 805K| 369K (0)|
|* 2 | HASH JOIN | | 1 | 160K| 160K|00:00:00.02 | 1590 | 988K| 988K| 354K (0)|
|* 3 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.02 | 195 | | | |
| 4 | VIEW | | 1 | 400 | 400 |00:00:00.04 | 1395 | | | |
|* 5 | TABLE ACCESS FULL| T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."BATCH_ID"="T2"."BATCH_ID")
3 - filter("T1"."BATCH_ID"=42)
5 - filter("BATCH_ID"=42)


So obviously it is a good idea in this particular case to not merge the view, but although a I/O cost has been explicitly assigned to the function, you can see that both execution plans have exactly the same cost and Oracle happily merges the view.

The costing looks different when using the function as part of the projection clause:


select /*+ no_place_group_by */
sum(val), t1.filler
from
t1
, (
select
1 as val
, batch_id
, a_date
, run_id
from
t1
)t2
where
t1.batch_id = t2.batch_id
and t1.batch_id = 42
and generate_lio(t2.run_id) >= 0
group by
t1.filler;


Note that I had to change the function parameter to something that refers to a column expression, otherwise the optimizer treats "generate_lio(1)" as "independent" and adds a FILTER operation that evaluates "generate_lio" exactly once.

It is now obvious from the execution plan that the cost of executing the function is considered:


----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 11 | 1356 (4)| 00:00:17 |
| 1 | HASH GROUP BY | | 1 | 11 | 1356 (4)| 00:00:17 |
|* 2 | HASH JOIN | | 160K| 1718K| 1325 (2)| 00:00:16 |
|* 3 | TABLE ACCESS FULL| T1 | 400 | 2000 | 55 (4)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| T1 | 400 | 2400 | 1265 (1)| 00:00:16 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."BATCH_ID"="BATCH_ID")
3 - filter("T1"."BATCH_ID"=42)
4 - filter("BATCH_ID"=42 AND "GENERATE_LIO"("RUN_ID")>=0)


Notice the increased cost of the operation id 4 - calling the function 400 times with an associated I/O cost of 3 and adding on top the associated CPU cost.

A similar behaviour can be seen when using scalar subqueries as part of the selection or projection.

Consider this query:


select /*+ no_place_group_by */
sum(val), t1.filler
from
t1
, (
select /* no_merge */
(
select
min(run_id)
from
t1 a
where
a.a_random = b.a_random
) as val
, batch_id
from
t1 b
) t2
where
t1.batch_id = t2.batch_id
and t1.batch_id = 42
group by
t1.filler;


Checking the execution plan you'll notice again the same cost as above when using the function in the projection:


----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 146 (28)| 00:00:02 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 2 | 16 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T1_IDX1 | 2 | | 1 (0)| 00:00:01 |
| 4 | HASH GROUP BY | | 1 | 13 | 146 (28)| 00:00:02 |
|* 5 | HASH JOIN | | 160K| 2031K| 115 (8)| 00:00:02 |
|* 6 | TABLE ACCESS FULL | T1 | 400 | 2000 | 55 (4)| 00:00:01 |
|* 7 | TABLE ACCESS FULL | T1 | 400 | 3200 | 55 (4)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("A"."A_RANDOM"=:B1)
5 - access("T1"."BATCH_ID"="BATCH_ID")
6 - filter("T1"."BATCH_ID"=42)
7 - filter("BATCH_ID"=42)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=0) MIN("RUN_ID")[22]
2 - "RUN_ID"[NUMBER,22]
3 - "A".ROWID[ROWID,10]
4 - (#keys=1) "T1"."FILLER"[CHARACTER,1], SUM( (SELECT MIN("RUN_ID") FROM
"T1" "A" WHERE "A"."A_RANDOM"=:B1))[22]
5 - (#keys=1) "T1"."FILLER"[CHARACTER,1], "B"."A_RANDOM"[NUMBER,22]
6 - "T1"."BATCH_ID"[NUMBER,22], "T1"."FILLER"[CHARACTER,1]
7 - "BATCH_ID"[NUMBER,22], "B"."A_RANDOM"[NUMBER,22]


Notice again the subtle difference in the projection when preventing the view merge:


----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 21 | 146 (28)| 00:00:02 |
| 1 | SORT AGGREGATE | | 1 | 8 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 2 | 16 | 3 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | T1_IDX1 | 2 | | 1 (0)| 00:00:01 |
| 4 | HASH GROUP BY | | 1 | 21 | 146 (28)| 00:00:02 |
|* 5 | HASH JOIN | | 160K| 3281K| 115 (8)| 00:00:02 |
|* 6 | TABLE ACCESS FULL | T1 | 400 | 2000 | 55 (4)| 00:00:01 |
| 7 | VIEW | | 400 | 6400 | 55 (4)| 00:00:01 |
|* 8 | TABLE ACCESS FULL | T1 | 400 | 4400 | 55 (4)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("A"."A_RANDOM"=:B1)
5 - access("T1"."BATCH_ID"="T2"."BATCH_ID")
6 - filter("T1"."BATCH_ID"=42)
8 - filter("BATCH_ID"=42)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

1 - (#keys=0) MIN("RUN_ID")[22]
2 - "RUN_ID"[NUMBER,22]
3 - "A".ROWID[ROWID,10]
4 - (#keys=1) "T1"."FILLER"[CHARACTER,1], SUM("VAL")[22]
5 - (#keys=1) "T1"."FILLER"[CHARACTER,1], "VAL"[NUMBER,22]
6 - "T1"."BATCH_ID"[NUMBER,22], "T1"."FILLER"[CHARACTER,1]
7 - "VAL"[NUMBER,22], "T2"."BATCH_ID"[NUMBER,22]
8 - "BATCH_ID"[NUMBER,22], "B"."A_RANDOM"[NUMBER,22]


It is also interesting to note that although the subquery is executed as part of the VIEW projection step in operation id 7, the scalar subquery is still shown at top level of the query starting with operation id 1. It would be more accurate to show it as child of operation id 7 in this particular case, but this is probably not supported by EXPLAIN PLAN at present.

At runtime however, the outcome is different from the function case, mainly due to the filter optimization / subquery caching feature, which also makes the subquery implicitly deterministic - it will only get executed as many times as there are distinct number of input values, which is the A_RANDOM column in this case.

In both cases the subquery will be executed only approx. 400 times, because there are only 400 distinct values in the generated row source.

View merged:


-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.32 | 1418 | | | |
| 1 | SORT AGGREGATE | | 398 | 1 | 398 |00:00:00.01 | 1028 | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 398 | 2 | 781 |00:00:00.01 | 1028 | | | |
|* 3 | INDEX RANGE SCAN | T1_IDX1 | 398 | 2 | 781 |00:00:00.01 | 402 | | | |
| 4 | HASH GROUP BY | | 1 | 1 | 1 |00:00:00.32 | 1418 | 805K| 805K| 570K (0)|
|* 5 | HASH JOIN | | 1 | 160K| 160K|00:00:00.02 | 390 | 988K| 988K| 390K (0)|
|* 6 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.02 | 195 | | | |
|* 7 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
-----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("A"."A_RANDOM"=:B1)
5 - access("T1"."BATCH_ID"="BATCH_ID")
6 - filter("T1"."BATCH_ID"=42)
7 - filter("BATCH_ID"=42)


View not merged:


-----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.24 | 1418 | | | |
| 1 | SORT AGGREGATE | | 398 | 1 | 398 |00:00:00.02 | 1028 | | | |
| 2 | TABLE ACCESS BY INDEX ROWID| T1 | 398 | 2 | 781 |00:00:00.02 | 1028 | | | |
|* 3 | INDEX RANGE SCAN | T1_IDX1 | 398 | 2 | 781 |00:00:00.02 | 402 | | | |
| 4 | HASH GROUP BY | | 1 | 1 | 1 |00:00:00.24 | 1418 | 805K| 805K| 362K (0)|
|* 5 | HASH JOIN | | 1 | 160K| 160K|00:00:00.02 | 1418 | 988K| 988K| 377K (0)|
|* 6 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.02 | 195 | | | |
| 7 | VIEW | | 1 | 400 | 400 |00:00:00.02 | 1223 | | | |
|* 8 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
-----------------------------------------------------------------------------------------------------------------------------


It is obvious from the "Buffers" column that the scalar subquery has been executed at different steps of the execution plan.

However in cases where you're not that lucky and the filter optimization doesn't work that efficiently, there still might be a significant difference between the merged and unmerged view variant of the query.

Again, when moving the subquery to the selection, the cost calculation looks quite different:


select /*+ no_place_group_by */
sum(val), t1.filler
from
t1
, (
select
run_id as val
, batch_id
from
t1
) t2
where
t1.batch_id = t2.batch_id
and t1.batch_id = 42
and t1.run_id >= (
select /*+ no_unnest */
min(run_id)
from
t1 a
where
a.a_random = t1.a_random
)
group by
t1.filler;



------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 19 | 349K (1)| 01:10:00 |
| 1 | HASH GROUP BY | | 1 | 19 | 349K (1)| 01:10:00 |
|* 2 | FILTER | | | | | |
|* 3 | HASH JOIN | | 160K| 2968K| 116 (9)| 00:00:02 |
|* 4 | TABLE ACCESS FULL | T1 | 400 | 2400 | 55 (4)| 00:00:01 |
|* 5 | TABLE ACCESS FULL | T1 | 400 | 5200 | 56 (6)| 00:00:01 |
| 6 | SORT AGGREGATE | | 1 | 8 | | |
| 7 | TABLE ACCESS BY INDEX ROWID| T1 | 2 | 16 | 3 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | T1_IDX1 | 2 | | 1 (0)| 00:00:01 |
------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("T1"."RUN_ID">= (SELECT /*+ NO_UNNEST */ MIN("RUN_ID") FROM "T1"
"A" WHERE "A"."A_RANDOM"=:B1))
3 - access("T1"."BATCH_ID"="BATCH_ID")
4 - filter("BATCH_ID"=42)
5 - filter("T1"."BATCH_ID"=42)
8 - access("A"."A_RANDOM"=:B1)



It is interesting that the FILTER operation is executed "late" (and therefore potentially more often) - since it is only depending on the T1.RUN_ID and T1.A_RANDOM column it could be executed "earlier" while processing the T1 row source, and in fact this can be achieved by adding the PUSH_SUBQ hint to the subquery. I haven't investigated this further, but may be the optimizer doesn't cost the different subquery pushing options when explicitly requesting to not unnest it (the NO_UNNEST hint) - without the NO_UNNEST hint the subquery is transformed into a join in this particular case.

It can be seen that in this case the optimizer used a "worst case" approach estimating that the scalar subquery gets executed many, many times. Very likely the cost increase can be explained by 160,000 times the cost of the scalar subquery (which might be less than 3 and gets rounded up in the EXPLAIN PLAN output) due to the "late" execution.

At runtime the filter optimization again works very efficiently:


-------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.38 | 1418 | | | |
| 1 | HASH GROUP BY | | 1 | 1 | 1 |00:00:00.38 | 1418 | 805K| 805K| 362K (0)|
|* 2 | FILTER | | 1 | | 160K|00:00:00.16 | 1418 | | | |
|* 3 | HASH JOIN | | 1 | 160K| 160K|00:00:00.01 | 390 | 968K| 968K| 377K (0)|
|* 4 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
|* 5 | TABLE ACCESS FULL | T1 | 1 | 400 | 400 |00:00:00.01 | 195 | | | |
| 6 | SORT AGGREGATE | | 398 | 1 | 398 |00:00:00.04 | 1028 | | | |
| 7 | TABLE ACCESS BY INDEX ROWID| T1 | 398 | 2 | 781 |00:00:00.02 | 1028 | | | |
|* 8 | INDEX RANGE SCAN | T1_IDX1 | 398 | 2 | 781 |00:00:00.02 | 402 | | | |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("T1"."RUN_ID">=)
3 - access("T1"."BATCH_ID"="BATCH_ID")
4 - filter("BATCH_ID"=42)
5 - filter("T1"."BATCH_ID"=42)
8 - access("A"."A_RANDOM"=:B1)


So it is not obvious to me why the optimizer treats these two cases differently - in case of the selection it uses a "worst-case" approach, but why this is not used in case of the projection is not clear to me.

In summary, you need to be careful in particular when using functions as part of the projection clause and complex queries - the order of evaluation might make a significant difference to the overall query performance.

As already mentioned, the best way to deal with such constructs is to avoid them. If you can not, you first should evaluate if the function can be declared as deterministic (will be cached from 10.2 on), or if you can use the workaround of wrapping the function call into a scalar subquery (select f(x) from dual) to take advantage of the subquery caching feature, which makes the function also implicitly deterministic.