Concept of optimization
Our concept of optimizing a query is quite similar to the way we find the good candidate queries. Instead of deeply understand the whole picture, we try to find the weak points and explore its reason(s). For this we need the appropriate information prepared and a good highlighting, which helps us to focus on the point. We do not want to understand each query’s meaning and SQL text, except it is unavoidable for the tuning. It will save us really a lot of time at many cases.
Similar to the “selection” phase we will use the PRISE Tuning Assistant for the individual query tunings. Let’s see how.
Save the ID(s) of the query(es)
When tuning the selected queries, we will return to one, several times. Validate an idea, check things, etc. PTA enables to save a selected list of IDs to CSV, which can be loaded any time into the search window, and the selected queries will appear on demand.
Check typical patterns
We have a standard way we examine a query and test against about a half a dozen of basic patterns – this part is discussed below. If some of them matches, we will get off the deep understanding the query. Otherwise PTA will help to understand what is exactly happening.
Sometimes it is worth to check the idea on different instances of the SQL (other days’ execution).
Document the idea and solution
It is very important to write down what we’ve found and what is the proposed solution. I typically do this iteratively. First I write my first impressions for each target queries, and then I go back and refine them. At the end it is important to have a exact, detailed specification of what to do with the SQLs. I use a word document with a standard format. Each query has a chapter with standard structure:
|Query ID:||Id in DBQL to find the query|
|Job/Step name:||Place where SQL is issued|
|RequestNum:||Sequence within the session, to identify|
|Reason:||Bad performance case|
|Desc:||Detailed description of bad performance reason.|
|Action:||Detailed description of what to do. It is the specification of the development.|
I usually check the concept on a test environment, before I finalize the “Action”. It is important to have a similar environment for validation like the production, anyway same result is not guaranteed.
Implement the modifications – in drops
When we know what to do with each of the SQLs, we can assign them to delivery drops, negotiated with operations and developers. I typically use these aspects:
- Big gains first
- Same job in same drop
- Big operations (patches) span evenly
- Number of queries span evenly
Query analysis – Typical patterns
The first general rule, when optimizing a query is:
Only high probability rules we have, but they are good enough in most of the cases.
My init checklist on a query
I check if there is a “peak step” which runs far more than others, or the total runtime / resource consumption is divided into more dominant steps. For this I check the “Duration%” field of the steps. If the yellow background step (longest running) has 50%+ we have a peak-step query.
Resource skewness check
Let’s see the query’s resource totals, and the skewness. If it is >50-70%, it is most often a skewness problem. They are caused by a non even distribution of data or tasks to be done.
Object skewness check
I browse through the objects take participate in the query, and check whether we have big ones with significant skewness (>10-20%).
“Peak step” examination
The goal is: Learn what is the long run caused by
- Are the sources skewed?
- Is the operation skewed?
- Is the result much data?
- Is the result placing skewed?
- Is the big result “duplicated-to-all-AMPS”?
“Peak step” drilldown
I start from the longest running step (typically the most resource hungry)
Catch an overview and browse through its predecessors.
Looking for bad Fidelity*, highly Skewed or other suspicious things. All are vividly highlihgted by the PTA
, I just got to catch sight of it.
- Fidelity: It is the ratio of estimated and in-fact reow count of a spool. Bad values indicate missing/stale statistics or too many join steps.
When I ran over the checklist above (~ 1 minute for a very complex query with some experience), I have a good impression of the case:
If peaked, probably it is easy to get rid of the problem. If not, maybe got to deal with another query, if possible – because we got to deeply understand what is happening.
If skewed, then data is skewed or an evil hash collision exists.
Skewed source table(s) often imply skewed execution, tables’ PI may need revise.
What type of operation runs long?
This is one of the most important ones.
Most often patterns
Hash collision at Insert / Create_table_as_select
- PeakStep: MRG, Target table: SET type w/NUPI, often high skew
- MULTISET, different PI
Big table duplication
- PeakStep: RET, Placing: Dupl, NO confidence, SpoolSize: huge
- PeakStep: RET, Placing: Dupl, HI confidence, SpoolSize: huge
spool used later in a JOIN, where the other source has bad Fidelity (>10..100)
Some statistics are missing or stale. Find the NO confidence or the bad Fidelity branch, and check the “filter” condition in the explain (PTA has a condition visualization).
If “NO confidence”, then the source table will lack some statistics (on filter condition columns). In case of NO confidence, check the other branch of the join step. There will be some statistics missing.
Redistribution resulting skew
- Step target is highly skewed (>70-80%)
- It depends. Check source tables, exclude peak values, break up the query. Typically needs deeper analysis
Sliding-window join for many partitions
- Merge join of “sliding-window” type, more than a couple of partitions involved
- Apply exactly the same partitioning definition of the joined tables
Case examples in the project
Hash collision insert
This is a very simple case.
- Peak step is a MRG (merge the result into the target table)
- Target table has NUPI, is SET type
- PI cardinality is low (9M distinct compared to 169M rows)
This means that we have a serious number of hash collisions in the table, which will slow down the insert extremely. We have two options:
- change PI or
- change to MULTISET
Both have its side effects, actually, we modify to a MULTISET.
Method: since table is not re-created at every script runs, I created a “patch” script to turn the table over to the new structure:
- Create new structure as empty
- Insert data into
- Copy statistics
- Drop old table
Exclude peak value
In this case the symptom is very simple: a RET (retrieve records into a spool) creates extreme high skew (98%). The redistribution is ok, it has a good reason, it will support the next “merge join” step.
However we have one peak value in the join column: ‘X X X X X’, which does not have corresponding record in the joined table, I checked manually.
Solution: I simply excluded this value in the “WHERE” condition. It eliminated the high skew, and the query ran ok, with the same result.
Materialize a joined view
This SQL was very complex (12 tables involved), and tables all had the appropriate statistics. PTA showed, that there was complicated query-rewrite done by the optimizer, and Fidelity gone rather bad as data flowed from spool-to-spool:
There was a view, which involved 5 tables, some of them were big, however the view resulted only ~100 records.
I materialized the view, collected statistics on them. This resulted a brand new execution plan which was now accurate.
- Work from a saved list
- Follow the workflow
- Find patterns
- Document the concept before implement
The next post is about: “Testing, Delivery and Evaluation – enjoy the results“