Analysing and troubleshooting Parallel Execution IT Tage 2015
1.
2. Independent consultant
Performance Troubleshooting
In-house workshops
Cost-Based Optimizer
Performance By Design
Oracle ACE Director
Member of OakTable Network
3. Parallel Execution introduction
Major challenges
Parallel unfriendly examples
Distribution skew examples
How to measure distribution of work
Fixing work distribution skew
4. Oracle Database Enterprise Edition includes
the powerful Parallel Execution feature that
allows spreading the processing of a single
SQL statement execution across multiple
worker processes
The feature is fully integrated into the Cost
Based Optimizer as well as the execution
runtime engine and automatically distributes
the work across the so called Parallel Workers
5. Simple generic parallelization example
Task: Compute sum of 8 numbers
1+8=9, 9+7=16, 16+9=25,...
1+8+7+9+6+2+6+3= ???
n=8 numbers, 7 computation steps required
Serial execution: 7 time units
7. Parallel Execution doesn’t mean “work smarter”
You’re actually willing to accept to “work harder”
Could also be called: “Brute force” approach
8. So with Parallel Execution there
might be the problem that it
doesn’t work “hard enough”
9. Two major challenges
Can the given task be divided into sub-tasks that can
efficiently and independently be processed by the
workers? (“Parallel Unfriendly”)
Can all assigned workers be kept busy all the time?
10. More reasons why Oracle Parallel Execution
might not reduce runtime as expected:
Parallel DML/DDL gotchas
“Downgrade” at execution time (less workers
assigned than expected)
Overhead of Parallel Execution implementation
Limitations of Parallel Execution implementation
11. Parallel DML / DDL gotchas
DML / DDL part can run parallel or serial
Query part can run parallel or serial
14. Other reasons why Oracle Parallel Execution
might not scale as expected:
Parallel DML/DDL gotchas
“Downgrade” at execution time (less workers
assigned than expected)
Overhead of Parallel Execution implementation
Limitations of Parallel Execution implementation
16. Two major challenges
Can the given task be divided into sub-tasks that can
efficiently and independently be processed by the
workers? (“Parallel Unfriendly”)
Can all assigned workers be kept busy all the time?
20. All these examples have one thing in common:
If the Query Coordinator (non-parallel part)
needs to perform a significant part of the overall
work, Parallel Execution won’t reduce the
runtime as expected
21. Two major challenges
Can the given task be divided into sub-tasks that can
efficiently and independently be processed by the
workers? (“Parallel Unfriendly”)
Can all assigned workers be kept busy all the time?
22.
23.
24. Parallel Execution introduction
Major challenges
Parallel unfriendly examples
Distribution skew examples
How to measure distribution of work
Fixing work distribution skew
25. Extended SQL tracing
Generates one trace file per Parallel Worker process
in database server trace directory
For cross-instance Parallel Execution this means files
spread across more than one server
Each Parallel Worker trace file lists, among others,
the number of rows produced per plan line and
elapsed time
26. Extended SQL tracing
Allows detecting data distribution skew
Usually requires to reproduce the issue
Tedious to collect and skim through dozens of trace
files (no tool known that automates that job)
27. DBMS_XPLAN.DISPLAY_CURSOR +
Rowsource statistics
Based on same internal code instrumentation as
extended SQL trace
Feeds back rowsource statistics (rows produced on
execution plan line level, timing, I/O stats) into
Library Cache
Very useful for analyzing serial executions
28. DBMS_XPLAN.DISPLAY_CURSOR +
Rowsource statistics
Not enabled by default, need to reproduce
Doesn’t cope with well multiple Parallel Execution
Servers / more complex parallel plans or Cross
Instance RAC execution
No information per Parallel Execution Server
=> Therefore not able to detect distribution skew
29. Analyzing Data Distribution Skew
Oracle since a long time offers a special view called
V$PQ_TQSTAT
It is populated after a successful Parallel Execution
in the session of the Query Coordinator
It lists the amount of data send via the “Table
Queues” / “Virtual Tables”
In theory this allows exact analysis which operations
caused an uneven data distribution
31. V$PQ_TQSTAT
Allows detecting data distribution skew
Can only be queried from inside session
Doesn’t cope with well more complex parallel plans
No information about relevance of skew
32. Easy to identify whether all workers are kept
busy all the time or not
Easy to identify if there was a problem with
work distribution
Shows actual parallel degree used (“Parallel
Downgrade”)
Supports RAC
33. Reports are not persisted and will be flushed
from memory quite quickly on busy systems
No easy identification and therefore no
systematic troubleshooting which plan
operations cause a work distribution problem
Lacks some precision regarding Parallel
Execution details
34. Real-Time SQL Monitoring allows detecting
Parallel Execution skew in the following way:
In the “Activity” tab
The average active sessions will be less than the DOP of
the operation, allows to detect both temporal and data
distribution skew
In the “Parallel” tab
The DB Time recorded per Parallel Slave will show an
uneven distribution of active work time
37. One very useful approach is using Active
Session History (ASH)
ASH samples active sessions once a second
Activity of Parallel Workers over time can
easily be analyzed
From 11g on the ASH data even contains a
reference to the execution plan line, so a
relation between Parallel Worker activity and
execution plan line based on ASH is possible
38. Custom queries on ASH data required for
detailed analysis
XPLAN_ASH tool runs these queries for a
given SQL_ID
Advantage of ASH is the availability of
retained historic ASH data via AWR on disk
Information can be extracted even for SQL
executions as long ago as the retention
configured for AWR
39. Parallel Execution introduction
Major challenges
Parallel unfriendly examples
Distribution skew examples
How to measure distribution of work
Fixing work distribution skew
40. Fixing Work Distribution Skew
Influence Parallel Distribution: Data volume
estimates, PQ_DISTRIBUTE / PQ_[NO]MAP /
PQ_SKEW (12c+) hint
Limit impact by influencing join order
Rewrite queries trying to limit or avoid skew
Remap skewed data changing the value distribution
41. Fixing Work Distribution Skew
Automatic join skew detection in 12c (PQ_SKEW)
Supports only parallel HASH JOINs
Supports at present only simple probe row sources
(no result of other joins supported)
Histogram showing popular values required (or
forced via PQ_SKEW hint)
Co-author of “Expert Oracle Practices”
Oracle ACE Director: Acknowledged by Oracle for community contributions
OakTable Network: Informal and independent group of people believing in a scientific approach towards Oracle
Simple generic parallelization example
Possibly additional startup cost: Find available /instruct / coordinate workers
Major challenge: Divide task into chunks that can be efficiently and independently processed by workers
Wall clock elapsed time for parallel execution can/should be lower than serial execution
But potentially more worker units required than serial execution => work harder!
Number of worker units assigned matters
Too few can be bad
Too many can be bad, too
Communication between worker units required – data needs to be (re-) distributed (overhead!)
Major challenge: Keep all workers busy all the time
Parallelization might require different approach
Parallel Execution can only reduce runtime as expected if all workers are kept busy
Possibly only a few or a single worker will be active and have to do all the work
In this case Parallel Execution can actually be slower than serial execution
There is a need to measure how busy the workers are kept
Note that this measure doesn’t tell you anything about the efficiency of the actual operation / execution plan
But an otherwise efficient Parallel Execution plan can only scale if the expected number of workers is kept busy ideally all the time
Note that it says “can scale” – if your system cannot scale the required resources (like I/O) you just end up with more workers waiting
Ideally the data to be processed by the Parallel Execution Servers should be equally distributed, so that all slaves have the same amount of work, start around the same time and complete around the same time, which is illustrated by above image. The four Parallel Execution Servers P000 to P003 all spend similar amount of time processing the data and end around the same time.
But if the data is not evenly distributed, which can happen with almost all kinds of distribution methods (most commonly HASH or RANGE distributions) then a picture like the one above could result. P002 gets assigned a lot more data than the other slaves and hence takes much longer to complete. In the meanwhile the other slaves are done for quite a while with their assigned data and stay idle as the overall operations in many cases can only progress after all slaves are done. This means that the duration of such an operation is determined by the slave taking the longest, and during that period the effective degree of parallelism is potentially far lower than in the ideal case. In this example here for a significant amount of time only P002 will be active, so only a single process instead of the four assigned.
More information: https://www.youtube.com/watch?v=vJUE1gz8Cgk
If V$PQ_TQSTAT isn’t that useful how can be measure it then?
If V$PQ_TQSTAT isn’t that useful how can be measure it then?
If V$PQ_TQSTAT isn’t that useful how can be measure it then?
If V$PQ_TQSTAT isn’t that useful how can be measure it then?
One of the challenges of Parallel Execution is the data distribution – if we don’t distribute the work well among the Parallel Slaves, we won’t scale.
How can we measure the data distribution (but not necessarily work distribution)?
Straightforward using the special view V$PQ_TQSTAT which shows the amount of data distributed for each re-distribution.V$PQ_TQSTAT drawbacks
Only populated in the Query Coordinator session, cannot be accessed from outside
Only populated after successful completion of queries, cancelling a long-running query or queries failing due to errors (for example, out of TEMP space) prevents the population of the view
More complex execution plans using Parallel DML / DDL (for example Parallel Insert or CTAS) and / or involving multiple DFOs often result in an incomplete / useless population of the view
For Table Queue ID = 3, the data was produced uniformly but all re-directed to the same Parallel Slave. You can find the corresponding operation in the execution plan.
If V$PQ_TQSTAT isn’t that useful how can be measure it then?
However, Real-Time SQL Monitoring has the following shortcomings:
Monitoring information is not persisted and on busy systems might be flushed from memory pretty quickly (memory size configurable via undocumented parameter)
It doesn’t monitor complex execution plans (> 300 lines, configurable via undocumented parameter)
It doesn’t indicate which operations of the execution plan caused the skew
Besides that it requires Tuning Pack license
Funnily, at least the row distribution per execution plan line operation is available
But not from the report generated, only by directly querying GV$SQL_PLAN_MONITOR
Raw data is not part of the generated report
You need to run your own custom query