Sunday, February 7, 2016

Big Nodes, Concurrent Parallel Execution And High System/Kernel Time

This probably only is relevant for customers that run Oracle on big servers with lots of cores, like some of my clients that make use of the Exadata Xn-8 servers, like a X4-8 with 120 cores / 240 CPUs per node.

They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this:



When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:



The AWR reports showed some pretty unusual PX wait events as significant:



"PX Deq Slave session stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly.

Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but 85%+ of that time was spent on the redistribution operations:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            98|             86|             87|


Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            96|             38|             37|

So it looked like those queries ran into some kind of contention that wasn't instrumented in Oracle but happened outside on O/S level, showing up as CPU Kernel time - similar to what could be seen in previous versions of Oracle when spinning on mutexes.

Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:



So the big question was - where was that time spent in the kernel to see whether this gives further clues.

Analysis


Running "perf top" on the node during such a run showed this profile:

  PerfTop:  129074 irqs/sec  kernel:76.4%  exact:  0.0% [1000Hz cycles],  (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________________________________________________________

          1889395.00 67.8% __ticket_spin_lock       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            27746.00  1.0% ktime_get                /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            24622.00  0.9% weighted_cpuload         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            23169.00  0.8% find_busiest_group       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            17243.00  0.6% pfrfd1_init_locals       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            16961.00  0.6% sxorchk                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            15434.00  0.6% kafger                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            11531.00  0.4% try_atomic_semop         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            11006.00  0.4% __intel_new_memcpy       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10557.00  0.4% kaf_typed_stuff          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10380.00  0.4% idle_cpu                 /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             9977.00  0.4% kxfqfprFastPackRow       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             9070.00  0.3% pfrinstr_INHFA1          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8905.00  0.3% kcbgtcr                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8757.00  0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             8641.00  0.3% kgxSharedExamine         /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             7487.00  0.3% update_queue             /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             7233.00  0.3% kxhrPack                 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6809.00  0.2% rworofprFastUnpackRow    /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6581.00  0.2% ksliwat                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6242.00  0.2% kdiss_fetch              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6126.00  0.2% audit_filter_syscall     /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5860.00  0.2% cpumask_next_and         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5618.00  0.2% kaf4reasrp1km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5482.00  0.2% kaf4reasrp0km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5314.00  0.2% kopp2upic                /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5129.00  0.2% find_next_bit            /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             4991.00  0.2% kdstf01001000000km       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4842.00  0.2% ktrgcm                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4762.00  0.2% evadcd                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4580.00  0.2% kdiss_mf_sc              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle

Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:

     0.36%     ora_xxxx  [kernel.kallsyms]             [k] 

__ticket_spin_lock
               |
               --- __ticket_spin_lock
                  |          
                  |--99.98%-- _raw_spin_lock
                  |          |          
                  |          |--100.00%-- ipc_lock
                  |          |          ipc_lock_check
                  |          |          |          
                  |          |          |--99.83%-- semctl_main
                  |          |          |          sys_semctl
                  |          |          |          system_call
                  |          |          |          __semctl
                  |          |          |          |          
                  |          |          |           --100.00%-- skgpwpost
                  |          |          |                     kslpsprns
                  |          |          |                     kskpthr
                  |          |          |                     ksl_post_proc
                  |          |          |                     kxfprienq
                  |          |          |                     kxfpqrenq
                  |          |          |                     |          
                  |          |          |                     |--98.41%-- kxfqeqb
                  |          |          |                     |          kxfqfprFastPackRow
                  |          |          |                     |          kxfqenq
                  |          |          |                     |          qertqoRop
                  |          |          |                     |          kdstf01001010000100km
                  |          |          |                     |          kdsttgr
                  |          |          |                     |          qertbFetch
                  |          |          |                     |          qergiFetch
                  |          |          |                     |          rwsfcd
                  |          |          |                     |          qertqoFetch
                  |          |          |                     |          qerpxSlaveFetch
                  |          |          |                     |          qerpxFetch
                  |          |          |                     |          opiexe
                  |          |          |                     |          kpoal8

Running "strace" on those Parallel Slaves showed this:

.
.
.
semctl(1347842, 397, SETVAL, 0x1)       = 0
semctl(1347842, 388, SETVAL, 0x1)       = 0
semctl(1347842, 347, SETVAL, 0x1)       = 0
semctl(1347842, 394, SETVAL, 0x1)       = 0
semctl(1347842, 393, SETVAL, 0x1)       = 0
semctl(1347842, 392, SETVAL, 0x1)       = 0
semctl(1347842, 383, SETVAL, 0x1)       = 0
semctl(1347842, 406, SETVAL, 0x1)       = 0
semctl(1347842, 389, SETVAL, 0x1)       = 0
semctl(1347842, 380, SETVAL, 0x1)       = 0
semctl(1347842, 395, SETVAL, 0x1)       = 0
semctl(1347842, 386, SETVAL, 0x1)       = 0
semctl(1347842, 385, SETVAL, 0x1)       = 0
semctl(1347842, 384, SETVAL, 0x1)       = 0
semctl(1347842, 375, SETVAL, 0x1)       = 0
semctl(1347842, 398, SETVAL, 0x1)       = 0
semctl(1347842, 381, SETVAL, 0x1)       = 0
semctl(1347842, 372, SETVAL, 0x1)       = 0
semctl(1347842, 387, SETVAL, 0x1)       = 0
semctl(1347842, 378, SETVAL, 0x1)       = 0
semctl(1347842, 377, SETVAL, 0x1)       = 0
semctl(1347842, 376, SETVAL, 0x1)       = 0
semctl(1347842, 367, SETVAL, 0x1)       = 0
semctl(1347842, 390, SETVAL, 0x1)       = 0
semctl(1347842, 373, SETVAL, 0x1)       = 0
semctl(1347842, 332, SETVAL, 0x1)       = 0
semctl(1347842, 379, SETVAL, 0x1)       = 0
semctl(1347842, 346, SETVAL, 0x1)       = 0
semctl(1347842, 369, SETVAL, 0x1)       = 0
semctl(1347842, 368, SETVAL, 0x1)       = 0
semctl(1347842, 359, SETVAL, 0x1)       = 0
.
.
.

So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).

Solution


Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:

- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time.

- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:

kernel.sem = 2048 262144 256 256

"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 12941057   oracle    640        1502     
0xd87a8935 12973826   oracle    640        1502     
0xd87a8936 12006595   oracle    640        1502    

By reducing the number of semaphores per set and increasing the number of sets, like this:

kernel.sem = 100 262144 256 4096

the following "ipcs" output could be seen:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 13137665   oracle    640        93       
0xd87a8935 13170434   oracle    640        93       
0xd87a8936 13203203   oracle    640        93       
0xd87a8937 13235972   oracle    640        93       
0xd87a8938 13268741   oracle    640        93       
0xd87a8939 13301510   oracle    640        93       
0xd87a893a 13334279   oracle    640        93       
0xd87a893b 13367048   oracle    640        93       
0xd87a893c 13399817   oracle    640        93       
0xd87a893d 13432586   oracle    640        93       
0xd87a893e 13465355   oracle    640        93       
0xd87a893f 13498124   oracle    640        93       
0xd87a8940 13530893   oracle    640        93       
0xd87a8941 13563662   oracle    640        93       
0xd87a8942 13596431   oracle    640        93       
0xd87a8943 13629200   oracle    640        93       
0xd87a8944 13661969   oracle    640        93       
0xd87a8945 13694738   oracle    640        93       
0xd87a8946 13727507   oracle    640        93       
0xd87a8947 13760276   oracle    640        93       
0xd87a8948 13793045   oracle    640        93       
0xd87a8949 13825814   oracle    640        93       
0xd87a894a 13858583   oracle    640        93       
0xd87a894b 13891352   oracle    640        93       
0xd87a894c 13924121   oracle    640        93       
0xd87a894d 13956890   oracle    640        93       
0xd87a894e 13989659   oracle    640        93       
0xd87a894f 14022428   oracle    640        93       
0xd87a8950 14055197   oracle    640        93       
0xd87a8951 14087966   oracle    640        93       
0xd87a8952 14120735   oracle    640        93       
0xd87a8953 14153504   oracle    640        93       
0xd87a8954 14186273   oracle    640        93       
0xd87a8955 14219042   oracle    640        93


So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times

- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set.

Conclusion


So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.

This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:



Also in the AWR report the unusual PX related wait events went away and performance improved significantly.

Sunday, January 17, 2016

DML Operations On Partitioned Tables Can Restart On Invalidation

It's probably not that well known that Oracle can actually rollback / re-start the execution of a DML statement should the cursor become invalidated. By rollback / re-start I mean that Oracle actually performs a statement level rollback (so any modification already performed by that statement until that point gets rolled back), performs another optimization phase of the statement on re-start (due to the invalidation) and begins the execution of the statement from scratch. Note that this can happen multiple times - actually it's possible to end up in a kind of infinite loop when this happens, leading to statements that can run for very, very long (I've seen statements on Production environments executing for several days although a single execution would only take minutes).

The pre-requisites to meet for this to happen are not that complex or exotic:

- The target table to manipulate needs to be partitioned

- The cursor currently executing gets invalidated - either by running DDL (typically think of partition related operations) - or simply by gathering statistics on one of the objects involved in the statement

- The DML statement hasn't touched yet one of the partitions of the target table but attempts to do so after the cursor got invalidated

When the last condition is met, the statement performs a rollback, and since it got invalidated - which is one of the conditions to be met - another optimization phase happens, meaning that it's also possible to get different execution plans for the different execution attempts. When the execution plan is ready the execution begins from scratch.

According to my tests the issue described here applies to both conventional and direct-path inserts, merge statements (insert / update / delete) as well as serial and parallel execution. I haven't explicitly tested UPDATE and DELETE statements, but the assumption is that they are affected, too.

The behaviour is documented in the following note on MOS: "Insert Statement On Partitioned Tables Is RE-Started After Invalidation (Doc ID 1462003.1)" which links to Bug "14102209 : INSERT STATEMENT' IS RESTARTING BY ITSELF AFTER INVALIDATION" where you can also find some more comments on this behaviour. The issue seems to be that Oracle at that point is no longer sure if the partition information compiled into the cursor for the partitioned target table is still correct or not (and internally raises and catches a corresponding error, like "ORA-14403: Cursor invalidation detected after getting DML partition lock", leading to the re-try), so it needs to refresh that information, hence the re-optimization and re-start of the cursor.

Note that this also means that the DML statement might already have performed modifications to other partitions but after being invalidated attempts to modify another partition it hasn't touched yet - it just needs an attempt to modify a partition not touched into yet by that statement.

It's also kind of nasty that the statement keeps running the potentially lengthy query part after being invalidated only to find out it needs to re-start after the first row is attempted to be applied to a target table partition not touched yet.

Note that applications typically run into this problem, when they behave like the following:

- There are longer running DML statements that take typically several seconds / minutes until they attempt to actually perform an modification to a partitioned target table

- They either use DBMS_STATS to gather stats on one of the involved tables, typically using NO_INVALIDATE=>FALSE, which leads to an immediate invalidation of all affected cursors

- And/Or they perform partition related operations on one of the tables involved, like truncating, creating or exchanging partitions. Note that it is important to point out that it doesn't matter which objects gets DDL / stats applied, so it's not limited to activity on the partitioned target table being modified - any object involved in the query can cause the cursor invalidation

In principle this is another variation of the general theme "Don't mix concurrent DDL with DML/queries on the same objects". Doing so is something that leads to all kinds of side effects, and the way the Oracle engine is designed means that it doesn't cope very well with doing so.

Here is a simple test case for reproducing the issue, using INSERTs in this case here (either via INSERT or MERGE statement):
create table t_target (
id number(*, 0) not null,
pkey number(*, 0) not null,
filler varchar2(500)
)
--segment creation immediate
partition by range (pkey) --interval (1)
(
  partition pkey_0 values less than (1)
, partition pkey_1 values less than (2)
, partition pkey_2 values less than (3)
, partition pkey_3 values less than (4)
);

create table t_source
compress
as
select 1 as id, rpad('x', 100) as filler
from
(select /*+ cardinality(1e3) */ null from dual connect by level <= 1e3),
(select /*+ cardinality(1e0) */ null from dual connect by level <= 1e0)
union all
select 1 as id, rpad('y', 100) as filler from dual;

-- Run this again once the DML statement below got started
exec dbms_stats.gather_table_stats(null, 't_source', no_invalidate=>false)

exec dbms_stats.gather_table_stats(null, 't_target', no_invalidate=>false)

----------------------------------------------------------------------------------------------------------------------------------
-- INSERT example                                                                                                               --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

-- alter session set tracefile_identifier = 'insert_restart';

-- alter session set events '10046 trace name context forever, level 12';

-- exec sys.dbms_monitor.session_trace_enable(waits => true, binds => true/*, plan_stat => 'all_executions'*/)

insert /* append */ into t_target (id, pkey, filler)
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
;

-- exec sys.dbms_monitor.session_trace_disable

----------------------------------------------------------------------------------------------------------------------------------
-- MERGE example                                                                                                                --
-- Run above DBMS_STATS calls or any other command that invalidates the cursor during execution to force re-start of the cursor --
----------------------------------------------------------------------------------------------------------------------------------

set echo on timing on time on

merge /* append */ into t_target t
using (
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 1 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 2 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
union all
select * from (
select /*+
           use_hash(a b)
           no_eliminate_oby
       */
       a.id, 3 as pkey, a.filler
from t_source a, t_source b
where a.id = b.id
and (
regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'c')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'i')
--or regexp_replace(a.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm') != regexp_replace(b.filler, '^\s+([[:alnum:]]+)\s+$', lpad('\1', 10), 1, 1, 'm')
or (b.filler = rpad('y', 100) and a.filler = rpad('y', 100))
)
order by a.id
)
) s
on (s.id = t.id)
when not matched then
insert (id, pkey, filler) values (s.id, s.pkey, s.filler)
;
The idea of the test case is to maximise the time until each UNION ALL branch produces data to insert by performing an inefficient HASH JOIN (that in fact generates a Cartesian product and needs to apply a costly REGEXP filter on that huge intermediate result) and forcing a sort on the join result, so rows will only be handed over to the parent operations until all rows were processed in the join operation - and each branch generates data for a different partition of the target table. Typically it should take several seconds per branch to execute (if you need more time just un-comment the additional REGEXP_REPLACE filters), so you should have plenty of time to cause the invalidation from another session.

This means during the execution of each branch invalidating the cursor (for example by executing either of the two DBMS_STATS calls on the source or target table using NO_INVALIDATE=>FALSE) will lead to a re-start of the statement at the next attempt to write into a new target partition, possibly rolling back rows already inserted into other partitions.

Diagnostics


If you run the provided INSERT or MERGE statement on newer versions of Oracle that include the SQL_EXEC_START and SQL_EXEC_ID in V$ACTIVE_SESSION_HISTORY (or V$SESSION for that matter) and invalidate the cursor during execution and before a partition of the target table gets inserted for the first time then you can see that these entries change as the statement restarts.

In such cases the INVALIDATIONS and LOADS increase in V$SQL accordingly and the OBJECT_STATUS changes from INVALID_UNAUTH to VALID again with each re-start attempt. In newer versions where you can configure the "plan_stat" information for SQL trace to "all_executions" you'll find STAT lines for each execution attempt dumped to the trace file, but only a single final EXEC line, where the elapsed time covers all execution attempts.

The oldest version I've tested was 10.2.0.4, and that one showed already the re-start behaviour, although I would be inclined to think that this wasn't the case with older versions. So if anybody still runs older versions than 10.2.0.4 I would be interested to hear whether the behaviour reproduces or not.