Sunday, August 28, 2016

Oracle Database Cloud (DBaaS) Performance Consistency - Part 3

This is the third part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using logical I/O only (see part 2) the test harness executes a SQL statement that has to perform physical I/O mostly, and to be more precise this round is a read-only test, so doesn't include any modification to data (except the logging and any other background system generated write activity, like maintaining ASH /AWR data etc.).

In order to maximize the physical I/O part the database instance was configured with a minimum sized buffer cache (16k block size in this case) and the following script was executed as many times as CPUs were available:

declare
  n number;
begin
  loop
    select /*+
              leading(t_o)
              use_nl(t_i)
              index(t_o)
              index(t_i)
          */
          sum(t_i.n)
          into n
    from
          t_o
        , t_i&tabname t_i
    where
          t_o.id_fk = t_i.id;
    insert into timings(testtype, thread_id, ts) values ('&testtype', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

The outer table T_O was created like this:

create table t_o (id primary key, id_fk)
organization index
as
select
       rownum as id
     , case mod(rownum, 2) + 1
       when 1
       then mod(rownum, &tab_size / 20)
       else &tab_size / 10 - mod(rownum, &tab_size / 20) + 1
       end as id_fk
from dual

connect by level <= &tab_size;

Each thread got its dedicated inner table of the Nested Loop join, created like this:

create table t_i' || i || ' (id not null, n, filler)
pctfree 99 pctused 1
tablespace &tbs
as
select cast(rownum as integer) as id,
cast(rownum as number) as n,
cast(rpad('x', 200) as varchar2(200)) as filler
from dual

connect by level <= &tab_size / 10;

create index t_i' || i || '_idx on t_i' || i || ' (id, filler) pctfree 99 tablespace &tbs;


Due to the way the data was crafted and thanks to the Nested Loop join batching performed by Oracle this ensured that the sessions were performing "db file parallel read" I/O as much and hard as possible, so were submitting multiple I/O requests using a single call and taking advantage of asynchronous I/O where possible.

When running with 8 threads on those 8 CPUs on the DBaaS service this resulted in approx. 30.000 IOPS for those 16K blocks (approx. 480MB / sec throughput). Note that I got these 30.000 IOPS for 16K blocks also on a different instance with just 4 CPUs (2 OCPUs) and 4 threads, so it looks like this is a common 16K blocks IOPS limit for a DBaaS instance independently from the CPU count without further tweaking (I didn't attempt to optimize this in any way but used the service as provided / configured by the Oracle Cloud). It looks like this is more a throughput limitation at 480MB / sec than an IOPS limitation, because when using a 4K blocksize I was able to get 60.000 IOPS (middle value), but the IOPS rate was much more fluctuating (between 20.000 and 90.000) whereas the 30.000 IOPS using 16K blocksize was pretty much stable.

On the physical host I got approx. half of these IOPS (roughly 15.500) for those 16K blocks (approx. 250MB / sec throughput), which is very likely the throughput limit for the rather outdated hardware configuration, albeit using a rather modern Samsung SSD drive, but also meant that running at 4 threads I got a comparable number of IOPS per thread as in the DBaaS setup. Using a 4K blocksize the physical host maxed out at (very stable) 52.000 IOPS.

The overall results look like this:

DBaaS:



Physical host:




Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. Similar to the previous, logical I/O only test this time the DBaaS service shows a significantly different profile, with an even larger spread of deviation up to almost 9 percent from the median runtime, whereas the physical host only shows significant deviation up to 2.5 percent.

Again, like in the previous test, the physical host shows more extreme outliers than the DBaaS service.

The same graph on a per day basis this time shows significant differences between the days for the DBaaS service:
The physical host shows a very common pattern, except for the first day:
Looking at the individual performance of each thread the DBaaS shows a significant variation in performance per day:
The physical host shows a pretty consistent performance pattern, but interestingly the different threads show different, but consistent runtimes:
The next test round will include physical writes.

Sunday, August 21, 2016

Adaptive Cursor Sharing Fail

Here is another example (besides the fact that Adaptive Cursor Sharing only gets evaluated during a PARSE call (still valid in 12c) and supports a maximum of 14 bind variables) I've recently come across at a client site where the default implementation of Adaptive Cursor Sharing fails to create a more suitable execution plan for different bind variable values.

Broken down to a bare minimum the query was sometimes executed using non-existing values for a particular bind variable, but other times these values were existing and very popular. There were two suitable candidate indexes and one of them appeared to the optimizer more attractive in case of the "non-existing" value case. Unfortunately this index was a very bad choice for the "existing and very popular" case.

The critical point of the case is that due to other, additional filters on the same table/index the final number of rows produced by the row source operation was (almost) the same for the "non-existing" and the "existing and very popular" case, but in the latter case a large proportion of the bad index had to be scanned due to the popularity of the bind value, which was the only predicate that could be used as ACCESS predicate and the additional ones could only be applied as FILTER on the index.

So although there was a suitable histogram available on the column in question and the optimizer in principle was aware of the different popularities of values and marked the cursor as "bind sensitive" it looks like when a cursor is then monitored it just is assigned to different buckets as exposed via V$SQL_CS_HISTOGRAM, based on the rows produced by the row source. Since the operation in question didn't differ much in the final number of rows produced between the two cases (either 0 rows or 1 row) the different executions were all assigned to the same bucket - although the amount of work required to produce this single row was huge, reflected in execution time and physical / logical I/Os performed per execution.

Obviously this vastly different amount of work performed to produce a similar number of rows is not used as criteria for marking a cursor as "bind aware" and evaluate different execution plans based on ranges of bind values.

There is a view V$SQL_CS_STATISTICS that includes information about CPU_TIME and BUFFER_GETS but in 11.2.0.4 it only reflects the initial execution when starting with the "non-existing" value case, but not the slow executions afterwards. In 12.1.0.2 the view is no longer populated at all, which suggests that this information is simply not used for deciding the "bind aware" status of a cursor.

Discussing the case with Mohamed Houri, who has spent significant time on investigating the "Adaptive Cursor Sharing" feature seemed to confirm this assumption.

Here is a simple test case that allows reproducing the issue:
-- FILTER1 is highly skewed (here one very popular value -1)
create table t
as
select
        rownum as id
      , rownum as n
      , case when rownum <= 100000 then rownum else -1 end as filter1
      , rownum as filter2
      , rpad('x', 200) as filler1
      , rownum as filter3
      , rpad('x', 200) as filler2
from
        dual
connect by level <= 1e6;

-- Histogram on FILTER1 should make the sample query "bind sensitive"
exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1 for columns size 254 filter1')

-- The "good" index, ACCESS on all three columns
-- But higher BLEVEL (3) due to FILLER1 only one value and wide - needs to be part of every branch block
create index t_idx1 on t (filter1, filler1, filter2) compress 2 /*advanced low*/;

-- The "bad" index, ACCESS only on first column, FILTER on subsequent columns
-- But lower BLEVEL (2) due to FILTER3 before FILLER1, leads to much more compact branch blocks
create index t_idx2 on t (filter1, filter3, filler1, filler2, filter2) compress 1 /*advanced low*/;
The critical point is that the good index has a higher BLEVEL than the bad index. So in case a non-existing value for FILTER1 gets used the optimizer will favor the index with the lower BLEVEL, which is for that case the key cost component:
set echo on

-- Non-existing value
-- T_IDX2 gets preferred
explain plan for
select sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX2 |     1 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-999999) AND "FILLER1"='x

                                                                                         ' AND
              "FILTER2"=999999)
       filter("FILTER2"=999999 AND "FILLER1"='x

                                                                                      ')
                                                                                     
-- Non-existing value
-- T_IDX1 has a higher cost
explain plan for
select /*+ index(t t_idx1) */ sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX1 |     1 |       |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-999999) AND "FILLER1"='x

                                                                                         ' AND
              "FILTER2"=999999)
              
-- Existing and very popular value
-- T_IDX1 gets preferred
explain plan for
select sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX1 |     1 |       |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-1) AND "FILLER1"='x

                                                                                    ' AND
              "FILTER2"=999999)


-- Existing and very popular value
-- T_IDX2 is a bad idea, correctly reflected in the cost estimate
explain plan for
select /*+ index(t t_idx2) */ sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 | 58269   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 | 58269   (1)| 00:00:03 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX2 |     1 |       | 58268   (1)| 00:00:03 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-1) AND "FILLER1"='x

                                                                                    ' AND
              "FILTER2"=999999)
       filter("FILTER2"=999999 AND "FILLER1"='x

                                                                                      ')
So when optimizing for the two different cases the optimizer does the right thing and understands that for the case "existing and very popular" T_IDX2 is a bad choice. It's also obvious from the "Predicate Information" section that the index T_IDX2 only can use FILTER1 as ACCESS predicate.

But when using bind variables the different cases are not recognized and the bad index is used for both cases when the optimization is based on the "non-existing value" case:
-- Default behaviour, although being BIND_SENSITIVE thanks to the histogram, no ACS kicks in
-- The V$SQL_CS_HISTOGRAM shows all executions being in bucket 0, so according to BIND_SENSITIVE monitoring no need for action
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.28
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> declare
  2    res number;
  3  begin
  4    for i in 1..1000 loop
  5      select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  6    end loop;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:04:48.64
SQL>

-- With BIND_AWARE a second child cursors gets generated and used on second parse, interesting
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    for i in 1..1000 loop
  5      select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  6    end loop;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.20
SQL>

-- Or hinting for a fixed plan helps, too
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/
Executing the unhinted, default case leads to a very inefficient execution of the thousand executions using the popular value (and the actual table / index at the client site was much larger than this example and the query got executed very frequently) - compare that to the execution time for the other two variants, either using the BIND_AWARE hint or simply requesting the good index.

So when requesting to mark the cursor immediately BIND_AWARE via the corresponding hint (only possible if the cursor qualifies to be bind sensitive) Oracle happily generates a second child cursor at the second PARSE call with a different, more efficient execution plan for the popular value, because when a cursor is marked bind aware a completely different algorithm gets used that is based on range of values as exposed via V$SQL_CS_SELECTIVITY, and since the different values are falling into different ranges of values a new plan gets generated that is different from the previous one and hence gets used from then on for those (range of) values.

Monday, August 15, 2016

Nested Loop Join Physical I/O Optimizations

Having done my mini-series on Nested Loop join logical I/O optimizations a while ago I unfortunately never managed to publish anything regarding the Nested Loop join physical I/O optimizations, which are certainly much more relevant to real-life performance.

Therefore the main purpose of this blog post is to point you to Nikolay Savvinov's (whose blog I can recommend in general) great mini-series covering various aspects of these optimizations:

Part 1
Part 2
Part 3
Summary

One point that - at least to me - isn't entirely clear when reading Nikolay's series is which specific plan shape he refers to, in particular since in 12c even more plan shapes for a Nested Loop join are possible.

Hence here is an attempt to summarize the various two table Nested Loop join plan shapes as of 12c and what kind of physical I/O optimizations one can expect from them:

1. Nested Loop Join Batching, the default in most cases since 11g and also in 12c
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    NESTED LOOPS               |                   |
|   4 |     INDEX FULL SCAN           | SYS_IOT_TOP_97632 |
|*  5 |     INDEX RANGE SCAN          | T_I6_IDX          |
|   6 |    TABLE ACCESS BY INDEX ROWID| T_I6              |
-----------------------------------------------------------
This is the plan shape that Nikolay calls "batch" or "batching" in his posts. It can provide batched I/O (mainly "db file parallel read" issuing multiple I/O requests in a single call, potentially asynchronous I/O) on both the INDEX RANGE SCAN as well as the TABLE ACCESS BY ROWID operation as described in Nikolay's blog post.

Please note that I deliberately write "can provide batched I/O", because, as Nikolay also points out in his posts, the runtime engine monitors the execution and can dynamically adjust the behaviour, which also means that it might decide to use conventional single block reads ("db file sequential read").

Please also note that in addition to the "db file parallel read" operation that submits multiple I/O requests in a single I/O submit call (see Frits Hoogland's blog post from some time ago about Linux internals of this I/O operation) the runtime engine might use "db file scattered read" multi-block I/Os under certain circumstances, in particular when performing "cache warmup prefetching".

Also note that as Nikolay points out when enabling SQL Trace or "rowsource statistics" the "Batched I/O" optimization for some reason gets disabled.

Also this plan shape at least in 12c seems to lead to inconsistencies in the Real-Time SQL Monitoring in several ways. First the number of "Executions" for the INDEX RANGE SCAN and TABLE ACCESS component of the inner row source is not necessarily consistent with the number of rows in the driving row source, second almost all activity and I/O volume seems to be contributed to the "INDEX RANGE SCAN" plan operation and not the "TABLE ACCESS" operation, even if it's the "TABLE ACCESS" that causes physical I/O.

2. Nested Loop Join Prefetch plan shape including BATCHED ROWID table access (only from 12c on)
------------------------------------------------------------------
| Id  | Operation                            | Name              |
------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                   |
|   1 |  SORT AGGREGATE                      |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|   3 |    NESTED LOOPS                      |                   |
|   4 |     INDEX FULL SCAN                  | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN                 | T_I5_IDX          |
------------------------------------------------------------------
This is the Nested Loop prefetching plan shape introduced in Oracle 9i combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c. It's the plan shape Nikolay refers to with "BATCHED ROWID" in his posts. In my (and Nikolay's) tests this provides the most aggressive batching of I/O (highest number of I/O requests submitted per call in "db file parallel read") for the TABLE ACCESS BY ROWID BATCHED, but didn't perform the same on the index ("db file sequential read", single block reads on the index segment), which doesn't mean that different test and data setups might provide that, too.
Note you should get this plan shape only with explicit hinting or disabling Nested Loop Join Batching via parameter.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
--batch_table_access_by_rowid(b)
The "batch_table_access_by_rowid" isn't strictly necessary since it's the default behaviour in 12c.

Other variants are possible, like "use_nl_with_index" instead of "use_nl" and "index" separately or "opt_param('_nlj_batching_enabled', 0)" instead of "no_nlj_batching" to disable the batching (but then batching is disabled for the whole execution plan, not just for a particular join).

3. Classic Nested Loop Join Prefetch introduced in 9i
----------------------------------------------------------
| Id  | Operation                    | Name              |
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |
|   1 |  SORT AGGREGATE              |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_I5              |
|   3 |    NESTED LOOPS              |                   |
|   4 |     INDEX FULL SCAN          | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN         | T_I5_IDX          |
----------------------------------------------------------
This is what you get in pre-12c when preventing Nested Loop Join Batching, in 12c the BATCHED table access needs to be disabled in addition.

This plan shape provides the less aggressive table prefetching as described in Nikolay's posts (he refers to this plan shape as "Prefetch"), maximum number of requests submitted per call in a "db file parallel read" operation seems to 39.

As mentioned above, it didn't provide index prefetching in my tests.

The session statistics don't mention "Batched I/O", so although the "db file parallel read" wait event is the same the internal implementation obviously is a different code path.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

4. Classic Nested Loop Join plan shape with BATCHED ROWID table access (only from 12c on)
-------------------------------------------------------------------
| Id  | Operation                             | Name              |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                   |
|   1 |  SORT AGGREGATE                       |                   |
|   2 |   NESTED LOOPS                        |                   |
|   3 |    INDEX FULL SCAN                    | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|*  5 |     INDEX RANGE SCAN                  | T_I5_IDX          |
-------------------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i) combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c.

In my tests it provides only "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID BATCHED can make use of "db file parallel read" to batch these I/O requests to the table blocks, and the session statistics show "Batched I/O" counters increasing.

However, it doesn't attempt to optimize "inter-loop" / multiple loop iterations, so if each loop iteration via the INDEX RANGE SCAN only points to a single table block, only single block reads ("db file sequential read") on the TABLE ACCESS BY INDEX ROWID BATCHED can be seen.

In my tests this plan shape didn't provide index prefetching / I/O optimizations and performed single block reads ("db file sequential read") on the INDEX RANGE SCAN operation.

For a two table join it needs explicit hinting in 12c to arrive at this plan shape, but it is the default plan shape for the "inner" joins in case of nested Nested Loop joins (multiple, consecutive Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
--batch_table_access_by_rowid(b)
See my older posts regarding Nested Loop join logical I/O optimizations why the combination of OPT_PARAM nd NO_NLJ_PREFETCH is required to arrive at this plan shape - in short, specifying the obvious NO_NLJ_PREFETCH plus NO_NLJ_BATCHING doesn't work. Other variants as above.

5. Classic Nested Loop Join plan shape
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    INDEX FULL SCAN            | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID| T_I5              |
|*  5 |     INDEX RANGE SCAN          | T_I5_IDX          |
-----------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i).

Interestingly in my tests at least in 12c it provides also "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID can make use of "db file parallel read" to submit multiple of these I/O requests to the table blocks in a single call, but the session statistics don't show "Batched I/O" counters increasing and the maximum number of requests seem to be 39, so it is less aggressive and looks very similar to the internal implementation used for the classic "table prefetching" plan shape above.

There is no sign of "inter-loop" optimizations and the INDEX RANGE SCAN seems to make use of single block reads only ("db file sequential read").

For pre-12c this is default plan shape used for the "inner" joins in case of nested Nested Loop joins (multiple Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

Multiple, consecutive Nested Loop Joins


Another point that I think it is important to mention when describing these "inter-loop" prefetching and batching Nested Loop join optimization techniques is that they only apply to the "outer-most" Nested Loop join in case of multiple, consecutive Nested Loop joins, which makes their "game changing" character that Nikolay mentions in his posts less strong to me.

For example, this is the 12c default shape of a four table join using three Nested Loop joins:
----------------------------------------------------------------------
| Id  | Operation                                | Name              |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                   |
|   1 |  SORT AGGREGATE                          |                   |
|   2 |   NESTED LOOPS                           |                   |
|   3 |    NESTED LOOPS                          |                   |
|   4 |     NESTED LOOPS                         |                   |
|   5 |      NESTED LOOPS                        |                   |
|   6 |       INDEX FULL SCAN                    | SYS_IOT_TOP_97632 |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T_I1              |
|*  8 |        INDEX RANGE SCAN                  | T_I1_IDX          |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED | T_I2              |
|* 10 |       INDEX RANGE SCAN                   | T_I2_IDX          |
|* 11 |     INDEX RANGE SCAN                     | T_I3_IDX          |
|  12 |    TABLE ACCESS BY INDEX ROWID           | T_I3              |
----------------------------------------------------------------------
As it can be seen only the outer-most Nested Loop joins gets the "batching" plan shape and can benefit from the optimizations described above. The inner Nested Loop joins show the classic plan shape, in case of 12c with the ROWID BATCHED option, so they can only benefit from the "intra-loop" optimizations described above, if a single loop iteration INDEX RANGE SCAN points to several table blocks.

Note that even when using explicit, additional NLJ_BATCHING hints for the inner tables joined I wasn't able to enforce any other plan shape.

If there are multiple blocks of (consecutive) Nested Loop joins (for example a HASH JOIN in between), then in each block the outer-most Nested Loop join gets the optimization, so multiple of those are possible per execution plan:
---------------------------------------------------------------------
| Id  | Operation                               | Name              |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                   |
|   1 |  SORT AGGREGATE                         |                   |
|   2 |   NESTED LOOPS                          |                   |
|   3 |    NESTED LOOPS                         |                   |
|*  4 |     HASH JOIN                           |                   |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_I2              |
|   6 |       INDEX FULL SCAN                   | T_I2_IDX          |
|   7 |      NESTED LOOPS                       |                   |
|   8 |       NESTED LOOPS                      |                   |
|   9 |        INDEX FULL SCAN                  | SYS_IOT_TOP_97632 |
|* 10 |        INDEX RANGE SCAN                 | T_I1_IDX          |
|  11 |       TABLE ACCESS BY INDEX ROWID       | T_I1              |
|* 12 |     INDEX RANGE SCAN                    | T_I3_IDX          |
|  13 |    TABLE ACCESS BY INDEX ROWID          | T_I3              |
---------------------------------------------------------------------
As it can be seen now that T_I2 is joined via a HASH JOIN both Nested Loop joins to T_I1 and T_I3 get the "batched" optimization plan shape.

I don't know why this limitation is there that only the outer-most in a block of Nested Loop joins gets the optimized plan shape, but it certainly can reduce the possible benefit.

Since I don't think there is a costing difference on CBO level for the different Nested Loop join plan shapes (so the decision which shape to use isn't cost-driven) it might be possible to see a significant performance difference depending on which join gets the optimization - there might be room for improvement by manually influencing the join order in case of multiple, consecutive Nested Loop joins.

Sunday, August 7, 2016

Oracle Database Cloud (DBaaS) Performance Consistency - Part 2

This is the second part of this installment, comparing the performance consistency of the DBaaS cloud offering with a dedicated physical host. This time instead of burning CPU using a trivial PL/SQL loop (see part 1) the test harness executes a SQL statement that performs logical I/O only, so no physical I/O involved.

In order to achieve that a variation of Jonathan Lewis' good old "kill_cpu" script got executed. In principle each thread performed the following:

define tabname = &1

define thread_id = &1;

alter session set "_old_connect_by_enabled" = true;

declare
  n number;
begin
  loop
    select
   count(*) X
    into    n
    from
   kill_cpu&tabname
    connect by
   n > prior n
    start with
   n = 1;
    insert into timings(testtype, thread_id, ts) values ('SQLLIO', &thread_id, systimestamp);
    commit;
  end loop;
end;
/

Each thread got its own, exclusive "kill_cpu<n>" table, so this setup attempted to avoid competition for buffer cache latches. The tables (in fact IOTs) were created like the following:

create table kill_cpu<n>(n primary key)
organization index
as
select
rownum n
from
all_objects
where
rownum <= 25
;

Again there were as many threads started as CPUs available - which meant 8 threads for the DBaaS environment and 4 threads for the physical host and again this was left running for several days.
The overall results look like this:
DBaaS:
Physical host:
Again the graph is supposed to show how many of the runs deviated how much from the overall median runtime. In contrast to the previous, simple PL/SQL test this time the DBaaS service shows a significantly different profile with a larger spread of deviation up to 3 percent from the median runtime, whereas the physical host only shows significant deviation up to 1.5 percent.
It's interesting to note however, that the physical host this time shows more extreme outliers than the DBaaS service, which was the other way around in the previous PL/SQL test.

The same graph on a per day basis doesn't show too significant differences between the days for either environment (here DBaaS):
Physical host:
Looking at the individual performance of each thread again the DBaaS shows a similar behaviour than last time - the different threads show a slightly different performance, and they also get slightly slower towards the end of the measurement:
And again the physical host shows a more consistent performance between threads:
The next test round will be a physical I/O bound setup.

Wednesday, August 3, 2016

DBMS_STATS - Gather statistics on tables with many columns - 12c update

This is just a short 12c update on my post about gathering statistics on tables with many columns from some time ago.

I'm currently investigating the "Incremental Statistics" feature in 12.1.0.2 for a client, which probably will be worth one or more other posts, but since we're still in the process of evaluating and installing various patches it's too early to write about that.

As part of the investigation I've noticed a significant change in behaviour in 12.1.0.2 compared to previous versions when it comes to gathering statistics on tables with many columns, hence this post here.

The key message of the original post was that DBMS_STATS needs potentially several passes when gathering statistics on tables with many columns, which meant a significant increase in overall work and resource consumption, exaggerated by the fact that tables with that many columns consist of multiple row pieces.

Along with other significant improvements in 12c (among others new types of histograms and the capability to derive some types of these histograms from a single pass with no need to run separate passes for each histogram) obviously now no longer multiple passes are required for such tables - Oracle can obviously now cope with up to and including 1000 columns in a single pass.

Repeating the test case from the original post for a table with 1.000 columns / 10.000 blocks and using a default METHOD_OPT setting of "FOR ALL COLUMNS SIZE AUTO" results in the following single query executed by the DBMS_STATS call:
SQL ID: 98828wcrfyn0c Plan Hash: 1962185829

select /*+  full(t)    no_parallel(t) no_parallel_index(t) dbms_stats 
  cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring 
  xmlindex_sel_idx_tbl no_substrb_pad  */to_char(count("COL968")),
  substrb(dump(min("COL968"),16,0,64),1,240),substrb(dump(max("COL968"),16,0,
  64),1,240),to_char(count("COL969")),substrb(dump(min("COL969"),16,0,64),1,
  240),substrb(dump(max("COL969"),16,0,64),1,240),to_char(count("COL970")),
  substrb(dump(min("COL970"),16,0,64),1,240),substrb(dump(max("COL970"),16,0,
  64),1,240),to_char(count("COL971")),substrb(dump(min("COL971"),16,0,64),1,
  240),substrb(dump(max("COL971"),16,0,64),1,240),to_char(count("COL972")),
.
.
.
from
 "CBO_TEST"."MANY_X_COLS" t  /* NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
  NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,
  NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,
  NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,
  NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,
.
.
.
  NDV,NIL,NIL,NDV,NIL,NIL*/


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch        1      1.95       1.97          0      40011          0           1
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total        3      1.95       1.98          0      40011          0           1

Misses in library cache during parse: 0
Optimizer mode: ALL_ROWS
Parsing user id: 111     (recursive depth: 1)
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         1          1          1  SORT AGGREGATE (cr=40011 pr=0 pw=0 time=1969578 us)
      9995       9995       9995   OPTIMIZER STATISTICS GATHERING  (cr=40011 pr=0 pw=0 time=751494 us cost=2717 size=240000 card=10000)
     10000      10000      10000    TABLE ACCESS FULL MANY_X_COLS (cr=40011 pr=0 pw=0 time=413062 us cost=2717 size=240000 card=10000)
As it can be seen the "APPROXIMATE NDV AGGREGATE" operation introduced in 11g for the new statistics gathering mode has been renamed in 12c to "OPTIMIZER STATISTICS GATHERING".

Apart from that this is good news as it minimizes the work required to gather statistics for such tables - it takes the minimum of logical / physical I/Os to process the data. And as already mentioned 12c is also capable of more, like generating synopses for incremental statistics and histograms from such a single pass.