Wednesday, February 4, 2015

Parallel Execution 12c New Features Overview

Oracle 12c is the first release since a couple of years that adds significant new functionality in the area of Parallel Execution operators, plan shapes and runtime features. Although 11gR2 added the new Auto DOP feature along with In-Memory Parallel Execution and Statement Queueing, the 12c features are more significant because they introduce new operators that can change both execution plan shape and runtime behaviour.

Here is a list of new features that are worth to note (and not necessarily mentioned in the official documentation and white papers by Oracle):

- The new HYBRID HASH adaptive distribution method, that serves two purposes for parallel HASH and MERGE JOINs:

First, it allows the PX SEND / RECEIVE operation for the left (smaller estimated row source) of the hash join to decide dynamically at runtime, actually for each execution, if it should use either a BROADCAST or HASH distribution, and correspondingly for the other row source to use then either a ROUND-ROBIN or a HASH distribution, too. This is described for example in the corresponding white paper by Maria Colgan here.

Second, to some degree it allows to address data distribution skew in case of HASH distributions (and only for parallel hash joins, not merge joins), which I've described in detail already in the past. A summary post that links to all other relevant articles regarding Parallel Execution Skew can be found here, an overview of the relevant feature can be found here and a detailed description can be found here.

I'll cover some aspects of this adaptive distribution that I haven't mentioned in the existing articles in a separate post.

- The new concurrent UNION ALL operator. This is officially documented here. It comes with a new operator PX SELECTOR that is a generic functionality to pick one of the available PX slaves to perform the child operations of this operator. Since the official documentation leaves a lot details unclear how this concurrent operation will actually behave at run time I'll cover some examples with runtime profiles in a separate post.

- The new PQ_REPLICATE feature that for simple parallel FULL TABLE SCAN row sources (I haven't tested yet if a parallel INDEX FAST FULL SCAN is eligible, too, but I assume so) can decide to run the scan entirely in each PX slave instead of running a distributed scan across the PX slaves in granules and distributing by BROADCAST afterwards. It's not entirely clear to me why this was implemented. Although it reduces the number of redistributions, and in some cases where no other parallel redistributions are required can reduce the number of parallel slave sets to one instead of two, BROADCAST distributions are typically used for smaller row sources, so eliminating this distribution doesn't sound like a huge improvement to justify the development effort. Jonathan Lewis describes the feature here along with some of his ideas, why this feature might be useful.

- The new parallel FILTER operator, an important and potentially huge improvement over the previous only available serial FILTER operator. In the past when a FILTER subquery was part of a parallel plan the data of the "driving" row source of the FILTER (the first child operation) had to be passed to the Query Coordinator and only then the second to nth children could be executed as many times as indicated by the first row source (and depending on the efficiency of filter/subquery caching). Now the FILTER operator can run in the PX slaves and there are a number of distribution variants possible with this new parallel operator. I'll cover that in a separate post.

- The new PX SELECTOR operator that I already mention above as part of the new concurrent UNION ALL operator. As described above, the generic functionality of this operator is to pick one of the available PX slaves to perform the child operations of this operator. It will be used in 12c for serial access operators that are part of a parallel plan (like a serial table or index scan). In the past these parts were performed by the Query Coordinator itself, but now one slave out of a slave set will be selected to perform such operations. This has a number of implications and I'll cover that in a separate post

- The new 1 SLAVE distribution method that is a bit similar to the PX SELECTOR operator in that it will use just one slave of the slave set but gets used for serial parts of the execution plan when the data is redistributed from a parallel part of the plan to a part of the plan that needs to be run in serial, because Oracle cannot parallelize the functionality, for example when evaluating ROWNUM or certain analytic function variants (for example LAG or LEAD with no partition clause). This new 1 SLAVE distribution seems to have two purposes: First avoid activity of the query coordinator (like the PX SELECTOR above) and second avoid the decomposition of the parallel plan into multiple DFO trees. I'll cover that in a separate post

- 12c changes also the way some operations in the plan are marked as PARALLEL or not, which in my opinion can be pretty confusing (and is partly inconsistent with runtime behaviour in my opinion) when just looking at the execution plan, since the runtime activity then might look different from what the execution plan suggests. I'll cover that in a separate post and it will also be picked up in the context of other new functionality mentioned above as appropriate.

- The new EXPRESSION EVALUATION operator (which I originally thought not to be specific to Parallel Execution, but at least in the currently available 12c versions up to and including seems to be) that sometimes gets used (see an example recently published by Jonathan Lewis) to evaluate scalar subqueries as part of the projection in Parallel Execution plans and at present comes with some odd behaviour which I will cover in a separate series of posts about projection and Parallel Queries

There is probably more that I haven't come across yet, but as you can see from the number of times I've mentioned "separate post" in this overview this is already enough material for a whole series of posts to follow.


Stefan Koehler said...

Hi Randolf,
awesome, looking forward to the upcoming blog posts. However just a short question about the adaptive distribution method.

>> First, it allows the PX SEND / RECEIVE operation for the left (smaller estimated row source) of the hash join to decide dynamically at runtime, actually for each execution

Is it really for each execution in case of PX? Afaik the statistics collector is only enabled at the first execution and then becomes a a pass-through operation.

Thank you and looking forward to the upcoming series.


Randolf said...

Hi Stefan,

yes in case of PX the STATISTICS COLLECTOR is evaluated at each execution, so it is truly adaptive (that's why I mention it explicitly, good catch!).

You think of adaptive plans, but there after resolving the final plan there's no point in further evaluating it, so that's the key difference here.


Chris Antognini said...

Hi Randolf

> after resolving the final plan there's no point in further evaluating it

There are cases where it would be sensible to reevaluate the plan. The most common case is in presence of bind variables....


Randolf said...

Hi Chris,

> There are cases where it would be sensible to reevaluate the plan

I absolutely agree - my description is based on the current implementation, not on what I would like the feature to behave like.

I think their current take on this is that ACS is there for the bind variable part, but we all know that it's not the appropriate solution to this problem in general - there are just too many cases where ACS doesn't prevent poor performing executions, at least in my experience.


Anonymous said...

Hi Randolf;

I have a query in and query has 3 dfo. So my query go beyond the parallel_degree_limit described in resource manager.

Is there any parameter or hint to enforce SLAVE distribution method?



Randolf said...

Hi Adem,

no matter how many DFO trees your query has, it shouldn't exceed the max. PARALLEL degree defined in the Resource Manager directives at execution time.

Depending on the plan shape what might happen with multiple DFO trees is that your query allocates more PX slaves than you would expect from a simple PX plan (so more than two sets of PX slaves at max. PARALLEL degree, which might look like exceeding the max. degree, if more than one DFO tree is active at the same time).

If you want to change that you would have to understand why your query arrives at that plan shape and see what you can do about it, which in most cases will include some kind of re-writing of the query.

I don't think you can force the 1 SLAVE distribution, if you refer to that with your last comment. Depending on the SQL features used / combined, 12c might still revert to the old decomposition of the plan into multiple DFO trees, so it doesn't always use the new 1 SLAVE / PX SELECTOR features.