Sunday, February 7, 2016

Big Nodes, Concurrent Parallel Execution And High System/Kernel Time

The following is probably only relevant for customers that run Oracle on big servers with lots of cores in single instance mode (this specific problem here doesn't reproduce in a RAC environment, see below for an explanation why), like one of my clients that makes use of the Exadata Xn-8 servers, for example a X4-8 with 120 cores / 240 CPUs per node (but also reproduced on older and smaller boxes with 64 cores / 128 CPUs per node).

They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this, showing several "DFO trees" parallelized per second on average over an hour period:



When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:



As you can see, the node was completely CPU bound, spending most of the time in System/Kernel time. The AWR reports showed some pretty unusual PX wait events as significant:



"PX Deq: Slave Session Stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly (and the excessive usage of Parallel Execution was required to see this happen).

Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but according to Active Session History almost 90% of that time was spent on the redistribution operations:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            98|             86|             87|


Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:

SQL statement execution ASH Summary
-----------------------------------------------

              |               |               |
              |PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU|        PERCENT|    CPU PERCENT|
--------------|---------------|---------------|
            96|             38|             37|

So it looked like those queries ran into some kind of contention that wasn't instrumented in Oracle but happened outside on O/S level, showing up as CPU Kernel time - similar to what could be seen in previous versions of Oracle when spinning on mutexes.

Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:



So the big question was - where was that time spent in the kernel to see whether this gives further clues.

Analysis


Running "perf top" on the node during such a run showed this profile:

  PerfTop:  129074 irqs/sec  kernel:76.4%  exact:  0.0% [1000Hz cycles],  (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                 DSO
             _______ _____ ________________________ ___________________________________________________________

          1889395.00 67.8% __ticket_spin_lock       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            27746.00  1.0% ktime_get                /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            24622.00  0.9% weighted_cpuload         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            23169.00  0.8% find_busiest_group       /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            17243.00  0.6% pfrfd1_init_locals       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            16961.00  0.6% sxorchk                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            15434.00  0.6% kafger                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            11531.00  0.4% try_atomic_semop         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
            11006.00  0.4% __intel_new_memcpy       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10557.00  0.4% kaf_typed_stuff          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
            10380.00  0.4% idle_cpu                 /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             9977.00  0.4% kxfqfprFastPackRow       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             9070.00  0.3% pfrinstr_INHFA1          /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8905.00  0.3% kcbgtcr                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             8757.00  0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             8641.00  0.3% kgxSharedExamine         /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             7487.00  0.3% update_queue             /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             7233.00  0.3% kxhrPack                 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6809.00  0.2% rworofprFastUnpackRow    /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6581.00  0.2% ksliwat                  /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6242.00  0.2% kdiss_fetch              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             6126.00  0.2% audit_filter_syscall     /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5860.00  0.2% cpumask_next_and         /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             5618.00  0.2% kaf4reasrp1km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5482.00  0.2% kaf4reasrp0km            /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5314.00  0.2% kopp2upic                /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             5129.00  0.2% find_next_bit            /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
             4991.00  0.2% kdstf01001000000km       /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4842.00  0.2% ktrgcm                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4762.00  0.2% evadcd                   /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
             4580.00  0.2% kdiss_mf_sc              /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle

Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:

     0.36%     ora_xxxx  [kernel.kallsyms]             [k] 

__ticket_spin_lock
               |
               --- __ticket_spin_lock
                  |          
                  |--99.98%-- _raw_spin_lock
                  |          |          
                  |          |--100.00%-- ipc_lock
                  |          |          ipc_lock_check
                  |          |          |          
                  |          |          |--99.83%-- semctl_main
                  |          |          |          sys_semctl
                  |          |          |          system_call
                  |          |          |          __semctl
                  |          |          |          |          
                  |          |          |           --100.00%-- skgpwpost
                  |          |          |                     kslpsprns
                  |          |          |                     kskpthr
                  |          |          |                     ksl_post_proc
                  |          |          |                     kxfprienq
                  |          |          |                     kxfpqrenq
                  |          |          |                     |          
                  |          |          |                     |--98.41%-- kxfqeqb
                  |          |          |                     |          kxfqfprFastPackRow
                  |          |          |                     |          kxfqenq
                  |          |          |                     |          qertqoRop
                  |          |          |                     |          kdstf01001010000100km
                  |          |          |                     |          kdsttgr
                  |          |          |                     |          qertbFetch
                  |          |          |                     |          qergiFetch
                  |          |          |                     |          rwsfcd
                  |          |          |                     |          qertqoFetch
                  |          |          |                     |          qerpxSlaveFetch
                  |          |          |                     |          qerpxFetch
                  |          |          |                     |          opiexe
                  |          |          |                     |          kpoal8

Running "strace" on those Parallel Slaves showed this:

.
.
.
semctl(1347842, 397, SETVAL, 0x1)       = 0
semctl(1347842, 388, SETVAL, 0x1)       = 0
semctl(1347842, 347, SETVAL, 0x1)       = 0
semctl(1347842, 394, SETVAL, 0x1)       = 0
semctl(1347842, 393, SETVAL, 0x1)       = 0
semctl(1347842, 392, SETVAL, 0x1)       = 0
semctl(1347842, 383, SETVAL, 0x1)       = 0
semctl(1347842, 406, SETVAL, 0x1)       = 0
semctl(1347842, 389, SETVAL, 0x1)       = 0
semctl(1347842, 380, SETVAL, 0x1)       = 0
semctl(1347842, 395, SETVAL, 0x1)       = 0
semctl(1347842, 386, SETVAL, 0x1)       = 0
semctl(1347842, 385, SETVAL, 0x1)       = 0
semctl(1347842, 384, SETVAL, 0x1)       = 0
semctl(1347842, 375, SETVAL, 0x1)       = 0
semctl(1347842, 398, SETVAL, 0x1)       = 0
semctl(1347842, 381, SETVAL, 0x1)       = 0
semctl(1347842, 372, SETVAL, 0x1)       = 0
semctl(1347842, 387, SETVAL, 0x1)       = 0
semctl(1347842, 378, SETVAL, 0x1)       = 0
semctl(1347842, 377, SETVAL, 0x1)       = 0
semctl(1347842, 376, SETVAL, 0x1)       = 0
semctl(1347842, 367, SETVAL, 0x1)       = 0
semctl(1347842, 390, SETVAL, 0x1)       = 0
semctl(1347842, 373, SETVAL, 0x1)       = 0
semctl(1347842, 332, SETVAL, 0x1)       = 0
semctl(1347842, 379, SETVAL, 0x1)       = 0
semctl(1347842, 346, SETVAL, 0x1)       = 0
semctl(1347842, 369, SETVAL, 0x1)       = 0
semctl(1347842, 368, SETVAL, 0x1)       = 0
semctl(1347842, 359, SETVAL, 0x1)       = 0
.
.
.

So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).

Solution


Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:

- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time - that is the explanation why in a RAC environment this specific issue doesn't reproduce.

- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:

kernel.sem = 2048 262144 256 256

"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 12941057   oracle    640        1502     
0xd87a8935 12973826   oracle    640        1502     
0xd87a8936 12006595   oracle    640        1502    

By reducing the number of semaphores per set and increasing the number of sets, like this:

kernel.sem = 100 262144 256 4096

the following "ipcs" output could be seen:

------ Semaphore Arrays --------
key        semid      owner     perms      nsems    
.
.
.
0xd87a8934 13137665   oracle    640        93       
0xd87a8935 13170434   oracle    640        93       
0xd87a8936 13203203   oracle    640        93       
0xd87a8937 13235972   oracle    640        93       
0xd87a8938 13268741   oracle    640        93       
0xd87a8939 13301510   oracle    640        93       
0xd87a893a 13334279   oracle    640        93       
0xd87a893b 13367048   oracle    640        93       
0xd87a893c 13399817   oracle    640        93       
0xd87a893d 13432586   oracle    640        93       
0xd87a893e 13465355   oracle    640        93       
0xd87a893f 13498124   oracle    640        93       
0xd87a8940 13530893   oracle    640        93       
0xd87a8941 13563662   oracle    640        93       
0xd87a8942 13596431   oracle    640        93       
0xd87a8943 13629200   oracle    640        93       
0xd87a8944 13661969   oracle    640        93       
0xd87a8945 13694738   oracle    640        93       
0xd87a8946 13727507   oracle    640        93       
0xd87a8947 13760276   oracle    640        93       
0xd87a8948 13793045   oracle    640        93       
0xd87a8949 13825814   oracle    640        93       
0xd87a894a 13858583   oracle    640        93       
0xd87a894b 13891352   oracle    640        93       
0xd87a894c 13924121   oracle    640        93       
0xd87a894d 13956890   oracle    640        93       
0xd87a894e 13989659   oracle    640        93       
0xd87a894f 14022428   oracle    640        93       
0xd87a8950 14055197   oracle    640        93       
0xd87a8951 14087966   oracle    640        93       
0xd87a8952 14120735   oracle    640        93       
0xd87a8953 14153504   oracle    640        93       
0xd87a8954 14186273   oracle    640        93       
0xd87a8955 14219042   oracle    640        93


So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times

- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set, but we haven't tested this option.

Conclusion


So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle runs in single instance mode and used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.

There is probably some internal note available at Oracle that indicates that the default semaphore settings recommended for big nodes are not optimal for running single instance mode under certain circumstances, but I don't know if there is a definitive, official guide available yet.

This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:



Also in the AWR report the unusual PX related wait events went away and performance improved significantly, in particular also for those complex queries mentioned above.

1 comment:

Thomas Teske said...

Dear Randolf,

thanks for sharing the resolution of the problem observed. I'm glad to see, that a relatively simple configuration change improves the situation a lot.

My 2 cents in such situations: anyone running into such situation (PX seems to deliver very little). Please check MOS for known problems. Without knowing the root cause of a problem, it is not possible to judge, if the applied change lets you land on an island of stability.

Generally speaking: parallel execution in multi-user environments require great care. Make sure, everyone gets the resources without blocking the others. I've seen too often an optimistic approach ie parallel degrees far too high or PX contention.
So watch out for any events in the "contention" class and/or being "PX" related.

It would be great to see, how your use case works in 12c. Have you had a chance to test that too?

Kind regards
Thomas

p.s. I'm an Oracle employee. These are my personal thoughts.