An oft-used (and subsequently oft-abused) execution path is parallel execution, usually ‘instigated’ by some sort of parallel hint. Developers, albeit with the best intentions, misuse and abuse this mechanism because of faulty logic, that ‘logic’ being that if one process executes in X amount of time then Y parallel processes will execute in X/Y amount of time, and nothing could be further from the truth in many cases. Yes, there are opportunities to use parallelism to speed up processing but in most of the cases I’ve seen it’s doing more harm than good.
To illustrate this in a ‘real world’ situation say you’re in a restaurant and the service appears to be slow; you think ‘if I had three waitresses instead of one I’d get my food faster’, but let’s look at that from a different point of view. Three waitresses for one table means that all three waitresses need to communicate with each other to avoid repeating work done by the others (this, of course, takes time). Each waitress needs to be assigned specific duties but also must be available to take over for another if something happens (one broke her toe, for instance); such an occurrence requires reassignment of duties and adjusting the schedule to accommodate the change. Eventually you get your order but it will take MORE time than if a single waitress performed all of the necessary tasks.
Parallel processing does more than simply ‘divide and conquer’ as it requires several steps most developers may not know about or may ignore entirely. DSS and DW systems, running with large numbers of CPUs, can benefit from parallel processing as the load can be distributed among the CPUs reducing the load on a single processor. OLTP systems, on the other hand, usually involve operations that are quick to begin with and the overhead of implementing parallel processing is quite large compared to the overall execution time; additionally it may take longer to complete the parallel execution than it would to properly tune the query for single-threaded processing and eliminate the parallelism altogether.
So what does parallel processing bring to the table? Obviously the possibility, but not the guarantee, of reduced execution time. What does it cost? That depends on the system but for OLTP systems it usually creates more work than it accomplishes as a parallel query coordinator process is spawned, along with X number of parallel slaves all ‘talking’ to that coordinator as well as a parallel-to-serial operation by the coordinator to assemble the final results. Many OLTP systems aren’t designed for parallel processing (and rightly so) as parallel execution is designed to utilize resources that may remain idle (CPU, memory) and are in sufficient quantity to warrant generating the additional overhead required. Queries and statements should be tuned properly (there, I’ve said it again) for speedy execution and slapping a parallel hint on them isn’t the correct way to go about the tuning process. Looking at an example of the differences between parallel and serial execution might help clear the air.
SQL> select id, txtval, status
2 from para_tst pt
3 where id between 9001 and 34001;
ID TXTVAL STATUS
---------- ------------------------------ -------
9389 ALL_SCHEDULER_RUNNING_JOBS VALID
9390 USER_SCHEDULER_RUNNING_JOBS VALID
9391 USER_SCHEDULER_RUNNING_JOBS VALID
[...]
20772 /130d52e2_JDK2Sorter VALID
20773 /130d52e2_JDK2Sorter VALID
20774 /4c28cb16_ToolLogOptions VALID
20775 /4c28cb16_ToolLogOptions VALID
20776 /cbd9a55f_AbortException VALID
20777 /cbd9a55f_AbortException VALID
591960 rows selected.
Elapsed: 00:02:03.68
Execution Plan
----------------------------------------------------------
Plan hash value: 350193380
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 593K| 20M| 1698 (2)| 00:00:21 | | |
| 1 | PARTITION RANGE ITERATOR| | 593K| 20M| 1698 (2)| 00:00:21 | 19 | 21 |
|* 2 | TABLE ACCESS FULL | PARA_TST | 593K| 20M| 1698 (2)| 00:00:21 | 19 | 21 |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("ID"
SQL> select /*+ parallel(pt 2) pq_distribute(pt partition) */
2 id, txtval, status
3 from para_tst pt
4 where id between 9001 and 34001;
ID TXTVAL STATUS
---------- ------------------------------- -------
18404 /79bc64f9_JvmMemoryMeta VALID
18405 /b80019c2_EnumJvmThreadContent VALID
18406 /b80019c2_EnumJvmThreadContent VALID
18407 /bcfa29b5_EnumJvmThreadCpuTime VALID
18408 /bcfa29b5_EnumJvmThreadCpuTime VALID
17997 /b3bd73eb_CommonClassObject VALID
17998 /b3bd73eb_CommonClassObject VALID
17999 /c5a69e17_ServerSchemaObject1 VALID
[...]
20724 /4bd3ef8d_KnownOptions4 VALID
20725 /4bd3ef8d_KnownOptions4 VALID
20726 /75d2b0ba_KnownOptions5 VALID
20727 /75d2b0ba_KnownOptions5 VALID
20728 /75e9b2d4_KnownOptions VALID
20729 /75e9b2d4_KnownOptions VALID
591960 rows selected.
Elapsed: 00:02:23.59
Execution Plan
----------------------------------------------------------
Plan hash value: 1393746857
-------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distr |
-------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 593K| 20M| 942 (1)| 00:00:12 | | | | | |
| 1 | PX COORDINATOR | | | | | | | | | | |
| 2 | PX SEND QC (RANDOM)| :TQ10000 | 593K| 20M| 942 (1)| 00:00:12 | | | Q1,00 | P->S | QC (RAND)|
| 3 | PX BLOCK ITERATOR | | 593K| 20M| 942 (1)| 00:00:12 | 19 | 21 | Q1,00 | PCWC | |
|* 4 | TABLE ACCESS FULL| PARA_TST | 593K| 20M| 942 (1)| 00:00:12 | 19 | 21 | Q1,00 | PCWP | |
-------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - filter("ID"
On a system running 11.2.0.2 with absolutely no load the parallel execution takes almost twenty seconds longer; on a heavily used OLTP system this difference will likely be greater as the overhead to run in parallel is greater than that for the ‘straight’ query. Resources allocated in OLTP configurations are intended for multiple users performing concurrent processing thus there usually aren’t ‘extra’ CPU cycles or memory to toss at parallel execution. Since the parallel query slaves will be waiting for resources (just like their user session counterparts) this will increase the response time and delay processing until such resources are freed. Notice also the execution path, which includes the parallel-to-serial operation that consolidates the outputs from the parallel query slaves into the final result set. Depending upon the server configuration there could be many more parallel slaves than shown here, and ‘funneling’ all of that data into a single ‘pipe’ consumes time and resources you most likely won’t have in the OLTP environment.
To be fair about this Oracle 11.2.0.2 and later releases offer a possibly better mechanism for implementing parallel execution, comprised of auto degree of parallelism (auto DOP), in-memory parallel execution and parallel statement queuing. A few bits and pieces need to be configured, first, but these are available on any system running Oracle 11.2.0.2 or later. What does auto DOP do? Oracle computes the degree of parallelism ‘on the fly’ before the statement executes. It determines the available resources and whether or not the statement will run faster, overall, in parallel. Once the DOP has been set it cannot be changed while the statement is running. It can, though, manage resources better to avoid overloading the system with parallel statements.
That’s great, I hear you say, but what if the statement would run better in parallel but can’t at the time the execution begins? This is where parallel statement queuing takes over; the statement is placed in the parallel run queue until the required resources are available at which time the statement starts execution. Confusing? Possibly, so let’s set up an 11.2.0.3 database to enable auto DOP and parallel statement queuing and run an example.
First and foremost it is absolutely necessary to calibrate the I/O subsystem the database uses. This is accomplished with a packaged procedure, DBMS_RESOURCE_MANAGER.CALIBRATE_IO, that takes two input values and three output variables into which to collect the calculated results. As inputs you need to supply the total number of disks associated with the database and a reasonable estimate of the disk latency; the three output parameters will hold the maximum calculated I/O per second (IOPS), the maximum MB per second and the calculated latency. The script, a calibration run and the generated output follow:
$ cat calibrate_io.sql
Set serveroutput on size 1000000
Declare
Calc_lat number;
Calc_iops number;
Calc_mbps number;
Begin
Dbms_resource_manager.calibrate_io(&dsks,&maxlat, Calc_iops, Calc_mbps, Calc_lat);
Dbms_output.put_line('Max IOPS : '||Calc_iops);
Dbms_output.put_line('Max mbytes-per-sec: '||Calc_mbps);
Dbms_output.put_line('Calc. latency : '||Calc_lat);
End;
/
$
...
SQL> @calibrate_io
Enter value for dsks: 36
Enter value for maxlat: 10
old 7: dbms_resource_manager.calibrate_io (&dsks, &maxlat, Calc_iops, Calc_mbps, Calc_lat);
new 7: dbms_resource_manager.calibrate_io (36, 10, Calc_iops, Calc_mbps, Calc_lat);
Max IOPS : 1022
Max mbytes-per-sec: 2885
Calc. latency : 11
PL/SQL procedure successfully completed.
Elapsed: 00:07:10.57
SQL>
SQL> select num_physical_disks, max_iops, max_mbps, max_pmbps, latency
2 from dba_rsrc_io_calibrate;
NUM_PHYSICAL_DISKS MAX_IOPS MAX_MBPS MAX_PMBPS LATENCY
------------------ ---------- ---------- ---------- ----------
36 1022 2885 289 11
SQL>
[For 36 total disks on an Exadata system this calibration took over 7 minutes to complete, with no load on the system. This is a resource-intensive operation and should never be performed during the normal work day.]
I/O calibration isn’t the only factor in enabling Auto DOP as the parameter parallel_degree_policy must be set to AUTO or LIMITED. By default it is set to MANUAL, which disables the three new parallel features of Oracle Release 11.2: Auto DOP, Parallel Statement Queuing and In-memory Parallel Execution. The three available settings, and their effects on these features, are listed below:
Setting Effect
--------------- ---------------------------------------------------------
MANUAL Disables all three new parallel query features. Causes
Parallel processing reverts to the behavior of prior
releases, parallelizing statements only if they are
hinted or the object being queried is created with or
altered to have a DEGREE greater than the default of 1.
LIMITED Only Auto DOP is enabled with this setting; the remaining
two new parallel features are disabled. With this
setting only queries against objects associated with a
DEGREE of DEFAULT will be considered.
AUTO All three new parallel features are enabled. Queries
and statements will be evaluated for parallel execution
regardless of the DEGREE setting on the object or objects
accessed.
Another parameter that affects one of the new features, this one ‘hidden’, is _parallel_statement_queueing. When set to TRUE, the default, queuing is enabled, which allows Oracle to decide if a parallel statement can be executed at run time or if it needs to wait in the queue until sufficient resources are available. And yet another ‘hidden’ parameter, _parallel_cluster_cache_policy, controls whether In-memory parallel execution is available or not. A setting of cached enables this feature.
So you have all of this configured and set, what does it do for you? Let’s look a parallel statement queuing first. After building a huge version of the EMP table twelve simultaneous sessions were used to select the average salary. Querying V$SQL_MONITOR to see how many of those sessions ended up in the queue produced the following output:
SQL> select sid, sql_id, sql_exec_id, sql_text
2 from v$sql_monitor
3 where status = 'QUEUED'
4 order by 3;
SID SQL_ID SQL_EXEC_ID SQL_TEXT
---------- ------------- ----------- ------------------------------------
1532 5du23va3p3ad0 16777216 select avg(sal) from emp
1059 5du23va3p3ad0 16777217 select avg(sal) from emp
1628 5du23va3p3ad0 16777218 select avg(sal) from emp
865 5du23va3p3ad0 16777219 select avg(sal) from emp
205 5du23va3p3ad0 16777220 select avg(sal) from emp
2199 5du23va3p3ad0 16777221 select avg(sal) from emp
1542 5du23va3p3ad0 16777222 select avg(sal) from emp
159 5du23va3p3ad0 16777223 select avg(sal) from emp
1888 5du23va3p3ad0 16777224 select avg(sal) from emp
1234 5du23va3p3ad0 16777225 select avg(sal) from emp
705 5du23va3p3ad0 16777226 select avg(sal) from emp
11 rows selected.
SQL>
The sql_exec_id is, basically, a run number indicating the statement’s position in the queue. It is also possible for the sql_exec_id to be the same for all queued statements (indicating simultaneous execution) if the sum of the parallel resources required for execution doesn’t exceed the maximum available for the database.
Having auto DOP configured can help performance by parallelizing statements that can benefit from it. Unlike the earlier attempts by Oracle to implement parallelism this mechanism takes the guesswork out of parallel processing. Yes, parameters like parallel_min_servers, parallel_max_servers and parallel_servers_target need to be set by you and if those values are set to very large values you could make performance worse instead of better. Careful consideration in setting those parameters is key to successfully implementing these parallel features. To help in setting parallel_servers_target the following formula can be used:
((4*cpu_count)*parallel_threads_per_cpu)*active number of instances
Given a system with the following settings:
cpu_count 24
parallel_threads_per_cpu 4
and two active instances (this works just as well for RAC as it does stand-alone databases) the default setting would be computed as follows:
((4*24)*4)*2 = 768
You can also use this to provide a guide for setting parallel_max_servers. At that point you can then determine how best to set parallel_min_servers. Even with these improved features it’s likely that Oracle won’t execute most statements in parallel, which is a good thing.
When is it good practice to use parallel processing? One situation which comes to mind is in creating/populating tables using complex queries that may return faster in parallel than with standard serial processing. An application on which I worked was populating a table with a rather benign join but it was taking far too long to complete the load — the elapsed time exceeded the batch processing window. Using parallelism (along with regularly updated statistics) dramatically reduced the response time from over an hour to less than two minutes, well within the batch window allowing the rest of the processing to continue. [Note that this was a batch process, run outside of the normal business day, which freed resources normally allocated to user sessions.] There are others, outside of data warehousing applications, but they’re exceptions and not the rule.
So, we’ve learned that even though parallelism would appear to make things go faster, in reality that’s not often the case due to the extra overhead in managing additional processes and consolidating the individual results into the final result set. We’ve also learned that simply slapping a /*+ parallel */ hint into a query doesn’t constitute tuning and doing so can make performance worse instead of better. The correct choice is to properly tune the query or queries in question using resources such as AWR and ASH reports, execution plans and wait statistics to pinpoint the problem area or areas to address. Parallelism isn’t a silver bullet and wasn’t intended to be and should be used sparingly, if at all.