They recently came up with a re-write of a core application functionality. Part of this code did start the same code path for different data sets potentially several times concurrently ending up with many sessions making use of Parallel Execution. In addition a significant part of the queries used by this code did make questionable use of Parallel Execution, in that sense that queries of very short duration used Parallel Execution, hence ending up with several Parallel Execution starts per second. You could see this pattern from the AWR reports like this, showing several "DFO trees" parallelized per second on average over an hour period:
When the new code was tested with production-like data volumes and patterns, in the beginning the CPU profile of such a big node (running in single instance mode) looked like this, when nothing else was running on that box:
As you can see, the node was completely CPU bound, spending most of the time in System/Kernel time. The AWR reports showed some pretty unusual PX wait events as significant:
"PX Deq: Slave Session Stats" shouldn't be a relevant wait event since it is about the PX slaves at the end of a PX execution passing an array of session statistics to the PX coordinator for aggregating the statistics on coordinator level. So obviously something was slowing down this PX communication significantly (and the excessive usage of Parallel Execution was required to see this happen).
Also some of the more complex Parallel Execution queries performing many joins and ending up with a significant number of data redistributions ran like in slow motion, although claiming to spend 100% of their time on CPU, but according to Active Session History almost 90% of that time was spent on the redistribution operations:
SQL statement execution ASH Summary
-----------------------------------------------
| | |
|PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU| PERCENT| CPU PERCENT|
--------------|---------------|---------------|
98| 86| 87|
Running the same query with the same execution plan on the same data and the same box during idle times showed a almost 20 times better performance, and less than 40% time spent on redistribution:
SQL statement execution ASH Summary
-----------------------------------------------
| | |
|PX SEND/RECEIVE|PX SEND/RECEIVE|
PERCENTAGE_CPU| PERCENT| CPU PERCENT|
--------------|---------------|---------------|
96| 38| 37|
Reducing the excessive usage of Parallel Execution showed a significant reduction in CPU time, but still the high System/Kernel time was rather questionable:
So the big question was - where was that time spent in the kernel to see whether this gives further clues.
Analysis
Running "perf top" on the node during such a run showed this profile:
PerfTop: 129074 irqs/sec kernel:76.4% exact: 0.0% [1000Hz cycles], (all, 128 CPUs)
-------------------------------------------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ________________________ ___________________________________________________________
1889395.00 67.8% __ticket_spin_lock /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
27746.00 1.0% ktime_get /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
24622.00 0.9% weighted_cpuload /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
23169.00 0.8% find_busiest_group /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
17243.00 0.6% pfrfd1_init_locals /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
16961.00 0.6% sxorchk /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
15434.00 0.6% kafger /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
11531.00 0.4% try_atomic_semop /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
11006.00 0.4% __intel_new_memcpy /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
10557.00 0.4% kaf_typed_stuff /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
10380.00 0.4% idle_cpu /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
9977.00 0.4% kxfqfprFastPackRow /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
9070.00 0.3% pfrinstr_INHFA1 /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
8905.00 0.3% kcbgtcr /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
8757.00 0.3% ktime_get_update_offsets /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
8641.00 0.3% kgxSharedExamine /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
7487.00 0.3% update_queue /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
7233.00 0.3% kxhrPack /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
6809.00 0.2% rworofprFastUnpackRow /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
6581.00 0.2% ksliwat /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
6242.00 0.2% kdiss_fetch /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
6126.00 0.2% audit_filter_syscall /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
5860.00 0.2% cpumask_next_and /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
5618.00 0.2% kaf4reasrp1km /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
5482.00 0.2% kaf4reasrp0km /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
5314.00 0.2% kopp2upic /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
5129.00 0.2% find_next_bit /usr/lib/debug/lib/modules/2.6.39-400.128.17.el5uek/vmlinux
4991.00 0.2% kdstf01001000000km /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
4842.00 0.2% ktrgcm /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
4762.00 0.2% evadcd /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
4580.00 0.2% kdiss_mf_sc /data/oracle/XXXXXXX/product/11.2.0.4/bin/oracle
Running "perf" on a number of Parallel Slaves being busy on CPU showed this profile:
0.36% ora_xxxx [kernel.kallsyms] [k]
__ticket_spin_lock
|
--- __ticket_spin_lock
|
|--99.98%-- _raw_spin_lock
| |
| |--100.00%-- ipc_lock
| | ipc_lock_check
| | |
| | |--99.83%-- semctl_main
| | | sys_semctl
| | | system_call
| | | __semctl
| | | |
| | | --100.00%-- skgpwpost
| | | kslpsprns
| | | kskpthr
| | | ksl_post_proc
| | | kxfprienq
| | | kxfpqrenq
| | | |
| | | |--98.41%-- kxfqeqb
| | | | kxfqfprFastPackRow
| | | | kxfqenq
| | | | qertqoRop
| | | | kdstf01001010000100km
| | | | kdsttgr
| | | | qertbFetch
| | | | qergiFetch
| | | | rwsfcd
| | | | qertqoFetch
| | | | qerpxSlaveFetch
| | | | qerpxFetch
| | | | opiexe
| | | | kpoal8
Running "strace" on those Parallel Slaves showed this:
.
.
.
semctl(1347842, 397, SETVAL, 0x1) = 0
semctl(1347842, 388, SETVAL, 0x1) = 0
semctl(1347842, 347, SETVAL, 0x1) = 0
semctl(1347842, 394, SETVAL, 0x1) = 0
semctl(1347842, 393, SETVAL, 0x1) = 0
semctl(1347842, 392, SETVAL, 0x1) = 0
semctl(1347842, 383, SETVAL, 0x1) = 0
semctl(1347842, 406, SETVAL, 0x1) = 0
semctl(1347842, 389, SETVAL, 0x1) = 0
semctl(1347842, 380, SETVAL, 0x1) = 0
semctl(1347842, 395, SETVAL, 0x1) = 0
semctl(1347842, 386, SETVAL, 0x1) = 0
semctl(1347842, 385, SETVAL, 0x1) = 0
semctl(1347842, 384, SETVAL, 0x1) = 0
semctl(1347842, 375, SETVAL, 0x1) = 0
semctl(1347842, 398, SETVAL, 0x1) = 0
semctl(1347842, 381, SETVAL, 0x1) = 0
semctl(1347842, 372, SETVAL, 0x1) = 0
semctl(1347842, 387, SETVAL, 0x1) = 0
semctl(1347842, 378, SETVAL, 0x1) = 0
semctl(1347842, 377, SETVAL, 0x1) = 0
semctl(1347842, 376, SETVAL, 0x1) = 0
semctl(1347842, 367, SETVAL, 0x1) = 0
semctl(1347842, 390, SETVAL, 0x1) = 0
semctl(1347842, 373, SETVAL, 0x1) = 0
semctl(1347842, 332, SETVAL, 0x1) = 0
semctl(1347842, 379, SETVAL, 0x1) = 0
semctl(1347842, 346, SETVAL, 0x1) = 0
semctl(1347842, 369, SETVAL, 0x1) = 0
semctl(1347842, 368, SETVAL, 0x1) = 0
semctl(1347842, 359, SETVAL, 0x1) = 0
.
.
.
So the conclusion was: A lot of CPU time is spent spinning on the "spin lock" (critical code section) - caused by calls to "semctl" (semaphores), which are part of the PX code path and come from "ipc_lock"->"raw_lock". "strace" shows that all of the calls to "semctl" make use of the same semaphore set (first parameter), which explains the contention on that particular semaphore set (indicating that the locking granule is the semaphore set, not the semaphore).
Solution
Based on the "perf" results an Oracle engineer found a suitable, unfortunately unpublished and closed bug from 2013 for 12.1.0.2 that comes up with three different ways how to address the problem:
- Run with "cluster_database" = true: This will take a different code path which simply reduces the number of semaphore calls by two orders of magnitude. We tested this approach and it showed immediate relief on kernel time - that is the explanation why in a RAC environment this specific issue doesn't reproduce.
- Run with different "kernel.sem" settings: The Exadata boxes came with the following predefined semaphore configuration:
kernel.sem = 2048 262144 256 256
"ipcs" showed the following semaphore arrays with this configuration when starting the Oracle instance:
------ Semaphore Arrays --------
key semid owner perms nsems
.
.
.
0xd87a8934 12941057 oracle 640 1502
0xd87a8935 12973826 oracle 640 1502
0xd87a8936 12006595 oracle 640 1502
By reducing the number of semaphores per set and increasing the number of sets, like this:
kernel.sem = 100 262144 256 4096
the following "ipcs" output could be seen:
------ Semaphore Arrays --------
key semid owner perms nsems
.
.
.
0xd87a8934 13137665 oracle 640 93
0xd87a8935 13170434 oracle 640 93
0xd87a8936 13203203 oracle 640 93
0xd87a8937 13235972 oracle 640 93
0xd87a8938 13268741 oracle 640 93
0xd87a8939 13301510 oracle 640 93
0xd87a893a 13334279 oracle 640 93
0xd87a893b 13367048 oracle 640 93
0xd87a893c 13399817 oracle 640 93
0xd87a893d 13432586 oracle 640 93
0xd87a893e 13465355 oracle 640 93
0xd87a893f 13498124 oracle 640 93
0xd87a8940 13530893 oracle 640 93
0xd87a8941 13563662 oracle 640 93
0xd87a8942 13596431 oracle 640 93
0xd87a8943 13629200 oracle 640 93
0xd87a8944 13661969 oracle 640 93
0xd87a8945 13694738 oracle 640 93
0xd87a8946 13727507 oracle 640 93
0xd87a8947 13760276 oracle 640 93
0xd87a8948 13793045 oracle 640 93
0xd87a8949 13825814 oracle 640 93
0xd87a894a 13858583 oracle 640 93
0xd87a894b 13891352 oracle 640 93
0xd87a894c 13924121 oracle 640 93
0xd87a894d 13956890 oracle 640 93
0xd87a894e 13989659 oracle 640 93
0xd87a894f 14022428 oracle 640 93
0xd87a8950 14055197 oracle 640 93
0xd87a8951 14087966 oracle 640 93
0xd87a8952 14120735 oracle 640 93
0xd87a8953 14153504 oracle 640 93
0xd87a8954 14186273 oracle 640 93
0xd87a8955 14219042 oracle 640 93
So Oracle now allocated a lot more sets with less semaphores per set. We tested this configuration instead of using "cluster_database = TRUE" and got the same low kernel CPU times
- The bug comes up with a third option how fix this, which has the advantage that the host configuration doesn't need to be changed, and the configuration can be done per instance: There is an undocumented parameter "_sem_per_sem_id" that defines the upper limit of semaphores to allocate per set. By setting this parameter to some comparable values like 100 or 128 the net result ought to be the same - Oracle allocates more sets with less semaphores per set, but we haven't tested this option.
Conclusion
So the bottom line was this: Certain usage patterns of the Oracle instance lead to contention on spin locks on Linux O/S level if Oracle runs in single instance mode and used the so far recommended semaphore settings, which resulted in all semaphore calls going for the same semaphore set. By having Oracle allocate more semaphore sets the calls were spread over more sets hence significantly reducing the contention.
There is probably some internal note available at Oracle that indicates that the default semaphore settings recommended for big nodes are not optimal for running single instance mode under certain circumstances, but I don't know if there is a definitive, official guide available yet.
This is the CPU profile of exactly the same test workload as before using the changed "kernel.sem" settings:
Also in the AWR report the unusual PX related wait events went away and performance improved significantly, in particular also for those complex queries mentioned above.
Dear Randolf,
ReplyDeletethanks for sharing the resolution of the problem observed. I'm glad to see, that a relatively simple configuration change improves the situation a lot.
My 2 cents in such situations: anyone running into such situation (PX seems to deliver very little). Please check MOS for known problems. Without knowing the root cause of a problem, it is not possible to judge, if the applied change lets you land on an island of stability.
Generally speaking: parallel execution in multi-user environments require great care. Make sure, everyone gets the resources without blocking the others. I've seen too often an optimistic approach ie parallel degrees far too high or PX contention.
So watch out for any events in the "contention" class and/or being "PX" related.
It would be great to see, how your use case works in 12c. Have you had a chance to test that too?
Kind regards
Thomas
p.s. I'm an Oracle employee. These are my personal thoughts.