Teradata performance optimization project
III. Optimizing a query

Optimizing a a query with PTA

Concept of optimization

Our concept of optimizing a query is quite similar to the way we find the good candidate queries. Instead of deeply understand the whole picture, we try to find the weak points and explore its reason(s). For this we need the appropriate information prepared and a good highlighting, which helps us to focus on the point. We do not want to understand each query’s meaning and SQL text, except it is unavoidable for the tuning. It will save us really a lot of time at many cases.

Similar to the “selection” phase we will use the PRISE Tuning Assistant for the individual query tunings. Let’s see how.

The workflow

Save the ID(s) of the query(es)

When tuning the selected queries, we will return to one, several times. Validate an idea, check things, etc. PTA enables to save a selected list of IDs to CSV, which can be loaded any time into the search window, and the selected queries will appear on demand.

Check typical patterns

We have a standard way we examine a query and test against about a half a dozen of basic patterns – this part is discussed below. If some of them matches, we will get off the deep understanding the query. Otherwise PTA will help to understand what is exactly happening.
Sometimes it is worth to check the idea on different instances of the SQL (other days’ execution).

Document the idea and solution

It is very important to write down what we’ve found and what is the proposed solution. I typically do this iteratively. First I write my first impressions for each target queries, and then I go back and refine them. At the end it is important to have a exact, detailed specification of what to do with the SQLs. I use a word document with a standard format. Each query has a chapter with standard structure:

Query ID: Id in DBQL to find the query
Job/Step name: Place where SQL is issued
RequestNum: Sequence within the session, to identify
Reason: Bad performance case
Desc: Detailed description of bad performance reason.
Action: Detailed description of what to do. It is the specification of the development.

I usually check the concept on a test environment, before I finalize the “Action”. It is important to have a similar environment for validation like the production, anyway same result is not guaranteed.

Implement the modifications – in drops

When we know what to do with each of the SQLs, we can assign them to delivery drops, negotiated with operations and developers. I typically use these aspects:

  1. Big gains first
  2. Same job in same drop
  3. Big operations (patches) span evenly
  4. Number of queries span evenly

Query analysis – Typical patterns

The first general rule, when optimizing a query is:

THERE ARE NO ALMIGHTY RULES!

Only high probability rules we have, but they are good enough in most of the cases.

My init checklist on a query

  1. “Peakyness” check
    I check if there is a “peak step” which runs far more than others, or the total runtime / resource consumption is divided into more dominant steps. For this I check the “Duration%” field of the steps. If the yellow background step (longest running) has 50%+ we have a peak-step query.

    Fig. 1 - Peakyness check
    Fig. 1 – Peakyness check
  2. Resource skewness check
    Let’s see the query’s resource totals, and the skewness. If it is >50-70%, it is most often a skewness problem. They are caused by a non even distribution of data or tasks to be done.

    Fig. 2 - Resource skewness check
    Fig. 2 – Resource skewness check
  3. Object skewness check
    I browse through the objects take participate in the query, and check whether we have big ones with significant skewness (>10-20%).

    Fig. 3 - Object skewness check
    Fig. 3 – Object skewness check
  4. “Peak step” examination
    The goal is: Learn what is the long run caused by

    • Are the sources skewed?
    • Is the operation skewed?
    • Is the result much data?
    • Is the result placing skewed?
    • Is the big result “duplicated-to-all-AMPS”?

    Fig. 4 - Peak step examination
    Fig. 4 – Peak step examination
  5. “Peak step” drilldown
    I start from the longest running step (typically the most resource hungry)
    Catch an overview and browse through its predecessors.
    Looking for bad Fidelity*, highly Skewed or other suspicious things. All are vividly highlihgted by the PTA
    , I just got to catch sight of it.

    Fig. 5 - Peak step drilldown
    Fig. 5 – Peak step drilldown
  • Fidelity: It is the ratio of estimated and in-fact reow count of a spool. Bad values indicate missing/stale statistics or too many join steps.

Checklist conclusions

When I ran over the checklist above (~ 1 minute for a very complex query with some experience), I have a good impression of the case:

  • Step-peaked?
    If peaked, probably it is easy to get rid of the problem. If not, maybe got to deal with another query, if possible – because we got to deeply understand what is happening.
  • Skewed processing?
    If skewed, then data is skewed or an evil hash collision exists.
  • Skewed source?
    Skewed source table(s) often imply skewed execution, tables’ PI may need revise.
  • What type of operation runs long?
    This is one of the most important ones.

Most often patterns

  • Hash collision at Insert / Create_table_as_select

    Symptoms
    PeakStep: MRG, Target table: SET type w/NUPI, often high skew
    Solution
    MULTISET, different PI
  • Big table duplication

    Symptoms
    1. PeakStep: RET, Placing: Dupl, NO confidence, SpoolSize: huge
    2. PeakStep: RET, Placing: Dupl, HI confidence, SpoolSize: huge

    spool used later in a JOIN, where the other source has bad Fidelity (>10..100)

    Solution

    Some statistics are missing or stale. Find the NO confidence or the bad Fidelity branch, and check the “filter” condition in the explain (PTA has a condition visualization).
    If “NO confidence”, then the source table will lack some statistics (on filter condition columns). In case of NO confidence, check the other branch of the join step. There will be some statistics missing.

  • Redistribution resulting skew

    Symptoms
    Step target is highly skewed (>70-80%)
    Solution
    It depends. Check source tables, exclude peak values, break up the query. Typically needs deeper analysis
  • Sliding-window join for many partitions

    Symptoms
    Merge join of “sliding-window” type, more than a couple of partitions involved
    Solution
    Apply exactly the same partitioning definition of the joined tables

Case examples in the project

Hash collision insert

This is a very simple case.

  • Peak step is a MRG (merge the result into the target table)
  • Target table has NUPI, is SET type
  • PI cardinality is low (9M distinct compared to 169M rows)

Fig. 6 - PI cardinality check
Fig. 6 – PI cardinality check

This means that we have a serious number of hash collisions in the table, which will slow down the insert extremely. We have two options:

  • change PI or
  • change to MULTISET

Both have its side effects, actually, we modify to a MULTISET.

Method: since table is not re-created at every script runs, I created a “patch” script to turn the table over to the new structure:

  1. Create new structure as empty
  2. Insert data into
  3. Copy statistics
  4. Drop old table

Exclude peak value

Fig. 7 - High skew after redistribution
Fig. 7 – High skew after redistribution

In this case the symptom is very simple: a RET (retrieve records into a spool) creates extreme high skew (98%). The redistribution is ok, it has a good reason, it will support the next “merge join” step.

However we have one peak value in the join column: ‘X X X X X’, which does not have corresponding record in the joined table, I checked manually.

Solution: I simply excluded this value in the “WHERE” condition. It eliminated the high skew, and the query ran ok, with the same result.

Materialize a joined view

Fig. 8 - Fidelity goes worse
Fig. 8 – Fidelity goes worse

This SQL was very complex (12 tables involved), and tables all had the appropriate statistics. PTA showed, that there was complicated query-rewrite done by the optimizer, and Fidelity gone rather bad as data flowed from spool-to-spool:

There was a view, which involved 5 tables, some of them were big, however the view resulted only ~100 records.
I materialized the view, collected statistics on them. This resulted a brand new execution plan which was now accurate.

Summary

  • Work from a saved list
  • Follow the workflow
  • Find patterns
  • Document the concept before implement

The next post is about: “Testing, Delivery and Evaluation – enjoy the results

Stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *