Teradata performance optimization project
I. Basics and prerequisites

Teradata performance optimization

Introduction

The Teradata database is a pretty expensive box, which serves a lot of people in the company – in a direct or indirect way. The efficient utilization of its resources is essential, from both financial and user satisfaction aspects.


This series of posts will follow a real-life Teradata optimization project’s methodology, steps and results, with this achievement:

System level gain: 31,6%

Levels of performance optimization

We can typically apply 3 levels of Teradata performance optimization:

Application

Meaning: Analyze user requests and system activity, then refine them, eliminating unnecessary activities.
This level checks if the executed processes are really needed, and they do what the user really needs. Eg: eliminate unnecessary data loads or reports; limit a report to a shorter period

  • Big gains possible w/o any trade-off
  • Comprehensive analysis (incl. find and interview appropriate persons)

SQL

Meaning: Pick individual SQLs and optimize its execution
This approach finds the top resource consumer SQL statements and reduces the resource consumption, while produces the same results.
Eg: Change PI; add/refresh statistics; rewrite SQL

  • Easy analysis
  • Low risk (small affect, good testing possibilities)
  • Needs SQL level analysis

Infrastructure

Meaning: Tune low level system (RDBMS, OS, Hardware) parameters
Tuning on infrastructure level is a global intervention regardless of users/business areas or importance

  • Miracle in some special cases
  • High risk (affects the all system)
  • Complex cross-dependencies

I prefer and recommend SQL level optimization (b.) in the beginning, because it delivers high value along with low risk and scaleablity of project. Later on can the others come into focus.

Goal of SQL optimization

The goals of an optimization project can be different. Some examples:
“Finish load procedure until 6am”
“Present my report 2 hours earlier”
“Reduce CPU load by 20%”
“Let hardware extension postponed by 6 months”

The goal of our project was: speed up the “load batch” process, in order to let a specific identified data area load finished and available for further usage 1 hour earlier than before.

Prerequisites

The key input for an SQL optimization is the information about what exactly happens:

What, when, how:

Which SQL statements ran, when, what steps were executed, how much resources did the consume with what skewness, etc.
All these can be found in the DBQL tables of Teradata (dbc.dbql*)
Switching on them will not cause significant extra load, but it is required to archive/purge them regularly (see PDCR tool in Teradata package).
Teradata Administrator enables to switch logging on (Tools/Query logging…) and one can check the actual settings:

select * from dbc.dbqlruletbl;
Explain/Obj/Sql/Step flags should be set to ‘T’, others are not necessary. Global (1 record with UserID: 00000000 ) logging is recommended.
If continuous logging is not possible, logging should be switched on 1 week before starting the optimization project

What / Who runs the statements?

Teradata provides a great feature for this: Query Banding
All executed SQL statements can be flagged with several information elements regarding the environment, job, script, version etc. which helps identifying a specific one.
The queryband data will appear in the DBQL logs – if the running environment is configured to flag the queries being run.
Querybanding is not essential for an optimization, but helps really a lot to identify the exact point in the data warehouse ecosystem, where the modifications should be done (which job/jobstep/script).

What are the environmental conditions?

Eg. Data volumes, statistics, etc.
This information is available in the data dictionary (DBC tables and views), but only as an actual – not for the time the statement run.
Most of the cases it is OK for us, but this property needs to be kept in mind.

When all these are finished, we are ready to do a successful performance optimization project.

The next post is about how to identify the good candidate SQL statements for optimization.

Stay tuned!

QUICKTIPS – DELETE+INSERT – together or separetely?

Shall I run DELETE and INSERT separately?

We often have permanent or temporary tables in our data warehouse which must be purged before they are populated again. We can do the operation within a transaction or separately. It is not the same…

Options – (example)

The examples assume Teradata (BTET)  mode, which is Auto-commit by default

  1. Together: both statements run in the same transaction
    1. Explicit transaction
      begin transaction;
      delete from TABLE1;
      insert into TABLE1 SELECT …..;
      end transaction;

    2. Multi-statement
      delete from TABLE1
      ;insert into TABLE1 SELECT …..;
  2. Separately: the statements run in separate transaction
    delete from TABLE1 [ALL];
    insert into TABLE1 SELECT …..;

     (Please note if you run it in Teradata Administrator aka. WinDDI, this will be run in one explicit transaction, since the application will place a BT; before the script and an ET; after.)

    In this case the ALL clause is only optics: the DELETE without a WHERE condition is handled internally as “ALL”.

What is the difference?

The result

If everything is OK, we will not experience any difference at the end – from result aspect. The same data will be found in TABLE1 after we finished each ways. The difference shows if the INSERT step fails eg. for running out of perm space:

  • In “A.” cases the all transaction will be rolled back, therefore we will see all the data in the table we had before the delete.
  • In “B.” case we will delete the data from the tab, independently the success of the INSERT phase. After a failed INSERT the table will be empty

We have another additional transient difference:

  • In A. case we will not “see” an empty table from another session in any point of time: until a point the table seems as a “before” version, and when the transaction finished it will switch to an “after” image for other sessions in no time.
  • In B. case when the DELETE finished, the table seems to be empty up to the moment of COMMITing the INSERT statement. So if the INSERT lasts 2 hours, others will face and empty TABLE1 for 2 hours.

So I do not recommend to use option B. if the “empty table” state is not valid.

The performance

The A. and B. options will be executed significantly different manner. The DELETE operation will know that all the data will be dropped, so internally it just administers the data blocks to be free instead of clearing their content record-by-record, BUT only if the statement is the only statement in the transaction! In this case only the data block administration must be logged (to be able to roll back for some reason) instead of every block changes. This results far less Transient Journal activity besides less block I/Os.

Let’s see some measurements.
I’ve put 10M records into a table, and then executed the same DELETE + INSERT statement on it. The results are the following:

  1. CPU:    9  sec I/O: 39K
  2. CPU  >0.1sec I/O:   4K

It is worth to mind the difference if you got to do it with billions of records.

PRISE Tuning Assistant use cases

What is PRISE Tuning Assistant (PTA) good for?

PRISE developed this tool to help DBAs and developers solve performance issues quickly and easily. In this post I describe a couple of typical situations when it can assist you.
A full functioning trial version of PRISE Tuning Assistant is available for download.

In general, PTA application consist of two main functions:

  • Query finder
    Enables applying lots of search criteria like: thresholds, skewness, queryband, SQL text, Explain text, used objects, time window, etc.
  • Query analyzer
    Unfolds the execution process of one selected query, helps understanding the query and its performance bottlenecks. Collects all required supplementary information like table sizes, DDLs, statistics, etc. in one comprehensive dashboard, and provides a “walking tour” feature, you can roam in the execution plan with.
All this enables even an average skilled Teradata expert to find which queries to be optimized and figure out the weak points of even a complex query in minutes.

PTA’s key architectural properties:

  • Supports active DBQL (dbc.dbqlogtbl and friends…) and also archived DBQL tables.
  • Optimized DBQL querying (fast response times and low resource usage)
  • Requires read only access to database
  • Portable, multi platform Java application, requires no installation

Let’s see typical use cases of the application.

Find the TOP consuming queries

The optimization’s “low hanging fruits” are the top consuming queries, I mean CPU or I/O  stuffing SQL statements. They are often small in number, however consume significant percent of the total system resources, as I mentioned in this post.

How to find them?
Simply set your focused time interval and select the “CPU”, “I/O” or “Duration” (for long time running ones) as TOP criteria, and click the “Get DBQL data” button. You will be given the list of the qualifying queries in descending order of your TOP criteria. After you can sort it by the other KPIs, like:

  • Skewness
  • Spool usage
  • LHR/RHL (Larry Higa ratio)
  • QueryBand (and specific variables of it)
  • Etc.

Skewed CPU or I/O usage

Teradata is a massive parallel system. It can utilize its full power if the work is spread evenly across the AMPs, otherwise (skewed queries) the response time increases dramatically while virtual resource consumption can remain low.

How to find the bad guys?
Start same as “TOP queries” case, but choose “CPU Impact” or “I/O Impact” as TOP criteria. This will result those queries that have the highest “Impact” (skew-normalized resource consumption) on the system.

“No more spool space”

Teradata uses Spool limiting in fact for cancelling those queries that have bad plans. Therefore “Failure 2646: No more spool space error” means a bad query execution plan, and the solution is usually not spool barrier raising. But often those spool limits are raised, until query is able to finish, and bad queries throttle the system… Why does a query usually run out of spool?

  • Duplicates a large spool/table to all AMPS
    Thinks it is small
  • Redistributes spool/table by a skewed value
    Thinks it is non-skewed
  • Processes (copies to spool) unreasonably large number of records

First two cases are usually caused by missing or stale statistics that can be easily found and supplied.

How to select?

Start same as “TOP queries” case, but choose “Spool” as TOP criteria. This will result those queries that have the highest spool consumption, and SQL dashboard will help you to figure out which statistics should be collected. In addition you can filter for “Error 2646” with Error code option, which will result those queries that has reached the spool limit.

Find a specific query

The DBA gets a question: “My query runs slow, please help me!” – occasionally an SQL is attached in the email. PTA lets figuring out which query was it in the DBQL. You can filter on:

  • Username
  • Session ID
  • Full query text (with LIKE)
  • QueryID (internal DBQL ID)
  • QueryBand (with LIKE)
  • Error code

Search disliked plan elements

PTA enables text search in Explain text of the queries. Eg. find quickly:

  • Product joins
  • Sliding-window joins
  • “Redistributed by”
If it is combined with CPU and/or I/O limitation (CPU >=; I/O >= criteria) then we can get those queries where they really cause problem.

Object usage

We have a table or index in the database, and we look for all the queries that utilize that object.
It is very useful for:

  • Preparing for a PI change
    PI change usually triggered by skewed data distribution, but the more important factor is the access path it provides. Before you change a PI, you need to examine the workload, accessing that table. What is the typical filter/join criteria, does the current/new PI support good access path? PTA enables to do it quickly and easily.
  • Determine usage of a table
    How many times a table is accessed a day? Is it used extensively, or just kept loaded for nothing?
  • Determine usage of a hash/join index
    A hash/join index requires processing and storage resources. Is it pretty utilized, or not?
How?
Simply set the “Object used” filter citeria. In addition you can use the “Thru join/hash index” option for including the join/hash index based data accesses.

Recurring queries

Which query to be optimized? Maybe there are a medium consuming queries, but run lot of times a day, consuming much resource in total. Use the “Recurring query finder” function of the PTA to see aggregated information: total and average Duration, CPU and I/O.

What was a query run by?

If your ETL and OLAP systems use QueryBand properly, you can set specific variables (max of 6) to be parsed into separate columns in the result query list. Sort the result by it, and find which queries of a load job should be optimized.

DBQL analysis II. – Skewness – The big resource robbery detection

The big resource robbery detection

Skewness

Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on.

 

About Skew

If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

Find “Top skewed queries” in DBQL

With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):

select top 50
      ProcID
    , QueryID
    , AMPCPUTime
    , MaxAMPCPUTime * (hashamp () + 1) CPUImpact
    , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
    , TotalIOCount
    , MaxAMPIO * (hashamp () + 1) IOImpact
    , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
    , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
    , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
    , ParserCPUTime
    , Queryband
    , Substr(QueryText,1,2000) QueryText
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst 
where
    logdate=1131201
    and ampcputime > 0
*/
/* For online DBQL*/
    dbc.dbqlogtbl 
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and ampcputime>0
order by 
    CPUImpact desc

Explanation of extra fields:

ParserCPUTime
Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL
Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand
Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)

OK, we’ve listed the terrible top consumers, but what’s next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post…), you can find which job or report issued that SQL, anyway, you can see the QueryText field.

If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

select 
    SQLTextInfo 
from 
    dbc.dbqlsqltbl 
where 
    procid = <procid> 
    and queryid = <queryid> 
order by 
    SQLRowNo asc

You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

System level Skewness

Totals

We have found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?

The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.

We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

select
      sum(AMPCPUTime) AMPCPUTimeSum
    , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
    , sum(TotalIOCount) TotalIOCountSum
    , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
    , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
    , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst
where 
    logdate = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)
*/
/* For online DBQL*/
    dbc.dbqlogtbl
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)

Look at the last two columns. That percent of your CPU and I/O goes to the sink…

Top bad guys

OK, let’s check how many queries accumulate 5%, 10%, 25%, 50%, 75%, 90% of this loss?
Here you are (CPU version, transform for I/O implicitly):

select 
      'How many queries?' as "_"
    , min(limit5) "TOP5%Loss"
    , min(limit10) "TOP10%Loss"
    , min(limit25) "TOP25%Loss"
    , min(limit50) "TOP50%Loss"
    , min(limit75) "TOP75%Loss"
    , min(limit90) "TOP90%Loss"
    , max(rnk) TotalQueries
    , sum(ResourceTotal) "TotalResource"
    , sum(ResourceImpact) "ImpactResource"
from
(
    select
         case when ResRatio < 5.00 then null else rnk end limit5
        ,case when ResRatio < 10.00 then null else rnk end limit10
        ,case when ResRatio < 25.00 then null else rnk end limit25
        ,case when ResRatio < 50.00 then null else rnk end limit50
        ,case when ResRatio < 75.00 then null else rnk end limit75
        ,case when ResRatio < 90.00 then null else rnk end limit90
        ,rnk
        , ResourceTotal
        , ResourceImpact
    from
    (
        select
              sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
            , sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
            , subtotalRes *100.00 / totalRes Resratio
            , sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
            , ResourceTotal
            , ResourceImpact
        from
        (
            select
                AMPCPUTime ResourceTotal
                , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
                , ResourceImpact - ResourceTotal ResourceLoss
            /* For archived DBQL
            from 
                dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131207
                and ampcputime > 0
            */
            /* For online DBQL*/
            from 
                dbc.dbqlogtbl
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
                and ampcputime > 0
        ) x
    ) y
) z
group by 1

I expect you are a bit shocked now, how few queries waste how much golden resources.

I think we will agree that it is worth to tune those dozen of queries,
and you save in orders of $100K..M USD for your company annually, am I right?

PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

Typical reasons of skewness – in a nutshell

  • Skewed tables: Bad choice of PI, Skewed data 
  • Bad execution plans (typically skewed redistributions)
    • Bad data model (normalization,data types,PI, etc.)
    • Missing or stale statistics
    • Too many joins (break up the query!)
  • Hash collision (load time problem)

DBQL analysis I. – Monitor “Top CPU consumers”

DBQL analysis

About DBQL

What is it?

DataBase Query Logging.

It is a nice feature of Teradata RDBMS, which comprehensively logs the issued queries execution – if it is switched on.
Configuration can be checked/administered eg. in the Teradata tools or from DBC.DBQLRuleTbl.
Logging can be set on global/user level, and in respect of details (see DBQL tables)

For detailed information please refer Teradata documentation of your version.

DBQL tables

Table Content
DBQLogTbl Central table, 1 record for each query.
DBQLSQLTbl Whole SQL command, broken up to 30k blocks
DBQLStepTbl Execution steps of the query, one row for each step.
DBQLObjTbl Objects participated in the query. Logged on different levels (db,table, column, index, etc.)
DBQLExplainTbl English explain text, broken up to 30k blocks
DBQLXMLTbl Explain in XML format, broken up to 30k blocks
DBQLSummaryTbl PEs’ aggregated table, which accounts on the desired level.

DBQL tables logically organized into 1:N structure, where DBQLogTbl is the master entity and others (except DBQLSummaryTbl) are the children.
Join fields are the ProcID and QueryId together, eg:

...
from 
    DBQLogTbl a
    join DBQLStepTbl b on 
        a.ProcID = b.ProcID 
        and 
        a.QueryID = b.QueryID
...

Unfortunately PI of DBQL tables are not in sync with logical PK-FK relation in (also in latest V14.10), therefore JOIN-ed selects against online DBQL tables are not optimal.

Cost of using DBQL

DBQL basically consumes negligible amount of processing resources, since it has cached&batch write and generates data proportional to issued queries (flush rate is DBScontrol parameter).
It is important to regularly purge/archive them from the DBC tables, Teradata has a recommendation for it. This ensures that PERM space consumption of the DBQL remains low.
In an environment where ~1M SQLs are issued a day, comprehensive logging generates ~8..10G of DBQL data daily w/o XML and Summary. Less SQLs generate proportionally less data.

It is worth to switch on all option except XML and Summary, since the first generates huge data volume (~makes it double), and the second is similar to Acctg info. If you want to utilize them, they should be switched on, of course.

What is it good for?

It contains:

  • Query run time, duration
  • Consumed resources
  • Environment info (user, default db, etc)
  • SQL text
  • Explain
  • Step resource info
  • Objects involved
  • Etc.

One can get a lot of useful aggregated and query specific tuning information, some of them I will share in the blog.

CPU usage distribution info

(Everything applies to I/O also, just replace CPU with I/O, AMPCPUTime with TotalIOCount…)

Do you think Query optimization is rewarding?

Yes, I know it is hard work to find out why is ONE query run sub-optimally, and what to do with it.

But guess how many queries consume how many percent of the processing resources (CPU) within a whole day’s workload.
Tip it and write down for CPU%: 5%, 10%, 25% and 50%

And now run the query below, which will result it to you. (replace the date value or maybe you have to adjust the date filtering according to local settings)

select 
    'How many queries?' as "_"
    ,min(limit5) as "TOP5%CPU"
    ,min(limit10) as "TOP10%CPU"
    ,min(limit25) as "TOP25%CPU"
    ,min(limit50) as "TOP50%CPU" 
    ,max(rnk) as TotalQueries
from
(
    select
        case when CPURatio < 5.00 then null else rnk end as limit5
        ,case when CPURatio < 10.00 then null else rnk end as limit10
        ,case when CPURatio < 25.00 then null else rnk end as limit25
        ,case when CPURatio < 50.00 then null else rnk end as limit50
        ,rnk
    from
    (
        select
            sum(ampcputime) over (order by ampcputime desc) as totalCPU
            ,sum(ampcputime) over (
                order by ampcputime desc  
                rows unbounded preceding
             ) as subtotalCPU
            ,subtotalCPU * 100.00 / totalCPU as CPUratio
            ,sum(1) over (
                order by ampcputime desc 
                rows unbounded preceding
             ) as rnk
        from
        (
            select *
            /* For archived DBQL
            from dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131201 
                and 
                ampcputime > 0 
            */
            /* For online DBQL*/
            from dbc.dbqlogtbl 
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-10' (date)
                and 
                ampcputime > 0
        ) x
    ) y
) z
group by 1;

Are you surprised?
I bet:

  • Less than 10 queries will consume 5% of the CPU
  • Less than 1% of the queries will consume 50% of the CPU

Let’s calculate.
How much does your Teradata system cost a year? It is all for storage and processing capacity.
If you can save eg. X% of CPU&I/O and X% storage using MVC optimization, you saved X% of the price of the Teradata system, by:

  • Improved user experience (earlier load, faster responses)
  • Resources for additional reports and applications
  • Enable postponing a very expensive Teradata hardware upgrade
PRISE Tuning Assistant

helps you to find those queries and to get the hang of how to accelerate them.

 
 

Accelerate skewed joins

How to “re-parallelize” skewed joins

Case description

Assume that we have 1M customers, 4M transactions and our top customer produce the 2.5% of all transactions.Others produce the remaining 97.5% of transactions approx. evenly.
Scroll down to the bottom of the post for sample table and data generator SQL.

Our task is to join a “Customer” and a “Transaction” tables on Customer_id.

The join

SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
GROUP BY 1;


We experience a pretty slow execution.
On the ViewPoint we see that only one AMP is working, while others are not.

What is the problem?
There are two  separate subsets of the Transact table from “joinability” point of view:

  • “Peak” part (records of top customer(s))
    Very few customers have very much Transact records. Product join would be cost effective
  • “Even” part (records of other customers)
    Much customers have much, but specifically evenly few Transact records. Merge join would be ideal.

Unfortunately Optimizer have to decide, only one operation type can be chosen. It will choose merge join which consumes far less CPU time.

Execution plan looks like this:

 This query is optimized using type 2 profile T2_Linux64, profileid 21.
  1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.t.
  2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.c.
  3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
  4) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
     all-rows scan with a condition of (“NOT (D_DB_TMP.t.Customer_ID IS
     NULL)”) into Spool 4 (all_amps), which is redistributed by the
     hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  Then we do a
     SORT to order Spool 4 by row hash.  The size of Spool 4 is
     estimated with low confidence to be 125 rows (2,125 bytes).  The
     estimated time for this step is 0.01 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
     RowHash match scan, which is joined to D_DB_TMP.c by way of a
     RowHash match scan.  Spool 4 and D_DB_TMP.c are joined using a
     merge join, with a join condition of (“D_DB_TMP.c.Customer_ID =
     Customer_ID”).  The result goes into Spool 3 (all_amps), which is
     built locally on the AMPs.  The size of Spool 3 is estimated with
     index join confidence to be 125 rows (10,375 bytes).  The
     estimated time for this step is 0.02 seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 5.  The size of Spool 5 is
     estimated with no confidence to be 94 rows (14,758 bytes).  The
     estimated time for this step is 0.02 seconds.
  7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
     an all-rows scan into Spool 1 (all_amps), which is built locally
     on the AMPs.  The size of Spool 1 is estimated with no confidence
     to be 94 rows (8,742 bytes).  The estimated time for this step is
     0.02 seconds.
  8) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.07 seconds.

How to identify

If you experience extremely asymmetric AMP load you can suspect on this case.
Find highly skewed JOIN steps in the DBQL (set all logging options on):

select top 50
a.MaxAMPCPUTime * (hashamp()+1) / nullifzero(a.CPUTime) Skw,a.CPUTime,a.MaxAMPCPUTime * (hashamp()+1) CoveringCPUTime,
b.*
from dbc.dbqlsteptbl a
join dbc.dbqlogtbl b on a.procid=b.procid and a.queryid=b.queryid
where
StepName='JIN'
and CPUtime > 100
and Skw > 2
order by CoveringCPUTime desc;




(Note: Covering CPU time is <No-of-AMPs> * <Max AMP’s CPU time>. Virtually this amount of CPU is consumed because asymmetric load of the system)

Or if you suspect a specific query, check the demography of the join field(s) in the “big” table:

SELECT TOP 100 <Join_field>, count(*) Nbr
FROM <Big_table> GROUP BY 1 ORDER BY 2 DESC;


If the top occurences are spectacularly larger than others (or than average) the idea likely matches.

Solution

Break the query into two parts: join the top customer(s) separately, and then all others. Finally union the results. (Sometimes additional modification also required if the embedding operation(s) – the group by here – is/are not decomposable on the same parameter.)
First we have to identify the top customer(s):

SELECT TOP 5 Customer_id, count(*) Nbr
FROM Transact GROUP BY 1 ORDER BY 2 DESC;

Customer_id          Nbr
——————————
          345       100004
     499873                4
     677423                4
     187236                4
       23482                4
     
Replace the original query with his one:

SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
where t.Customer_id in (345)  

/*
   ID of the top Customer(s). 
   If more customers are salient, list them, but max ~5
*/
GROUP BY 1
UNION ALL
SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
where t.Customer_id not in (345)  -- Same customer(s)
GROUP BY 1
;

Be sure that Customer.Customer_id, Transact.Transact_id and Transact.Customer_id have statistics!

Rhis query is more complex, has more steps, scans Transact table 2 times, but runs much faster, you can check it.
But why? And how to determine which “top” customers worth to be handled separately?
Read ahead.

Explanation

Calculation

Let’s do some maths:
Assume that we are on a 125 AMP system.
Customer table contains 1M records with unique ID.
We have ~4.1M records in the Transact table, 100k for the top customer (ID=345), and 4 for each other customers. This matches the 2.5% we assumed above.

If the  Transact table is redistributed on hash(Customer_id) then we will get ~33k records on each AMPs, excluding AMP(hash(345)). Here we’ll get ~133k (33k + 100K).
That means that this AMP will process ~4x more data than others, therefore runs 4x longer.
With other words in 75% of this JOIN step’s time 124 AMPs will DO NOTHING with the query.

Moreover the preparation and subsequent steps are problematic also: the JOIN is prepared by a redistribution which produces a strongly skewed spool, and the JOIN’s result stays locally on the AMPs being skewed also.

Optimized version

This query will consume moderately more CPU, but it is distributed evenly across the AMPs, utilizing the Teradata’s full parallel capability.
It contains a product join also, but is it no problem it joins 1 records to the selected 100k records of Transacts, that will be lightning fast.

All

Look at the execution plan of the broken-up query:

 This query is optimized using type 2 profile T2_Linux64, profileid 21.
  1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.t.
  2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.c.
  3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
  4) We do a single-AMP RETRIEVE step from D_DB_TMP.c by way of the
     unique primary index “D_DB_TMP.c.Customer_ID = 345” with no
     residual conditions into Spool 4 (all_amps), which is duplicated
     on all AMPs.  The size of Spool 4 is estimated with high
     confidence to be 125 rows (10,625 bytes).  The estimated time for
     this step is 0.01 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an
     all-rows scan, which is joined to D_DB_TMP.t by way of an all-rows
     scan with a condition of (“D_DB_TMP.t.Customer_ID = 345”).  Spool
     4 and D_DB_TMP.t are joined using a product join, with a join
     condition of (“Customer_ID = D_DB_TMP.t.Customer_ID”).  The result
     goes into Spool 3 (all_amps), which is built locally on the AMPs.
     The size of Spool 3 is estimated with low confidence to be 99,670
     rows (8,272,610 bytes).  The estimated time for this step is 0.09
     seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 5.  The size of Spool 5 is
     estimated with no confidence to be 74,753 rows (11,736,221 bytes).
     The estimated time for this step is 0.20 seconds.
  7) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by
          way of an all-rows scan into Spool 1 (all_amps), which is
          built locally on the AMPs.  The size of Spool 1 is estimated
          with no confidence to be 74,753 rows (22,052,135 bytes).  The
          estimated time for this step is 0.02 seconds.
       2) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
          all-rows scan with a condition of (“D_DB_TMP.t.Customer_ID <>
          3454″) into Spool 9 (all_amps), which is redistributed by the
          hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  The size
          of Spool 9 is estimated with high confidence to be 4,294,230
          rows (73,001,910 bytes).  The estimated time for this step is
          1.80 seconds.
  8) We do an all-AMPs JOIN step from D_DB_TMP.c by way of an all-rows
     scan with a condition of (“D_DB_TMP.c.Customer_ID <> 3454”), which
     is joined to Spool 9 (Last Use) by way of an all-rows scan.
     D_DB_TMP.c and Spool 9 are joined using a single partition hash
     join, with a join condition of (“D_DB_TMP.c.Customer_ID =
     Customer_ID”).  The result goes into Spool 8 (all_amps), which is
     built locally on the AMPs.  The size of Spool 8 is estimated with
     low confidence to be 4,294,230 rows (356,421,090 bytes).  The
     estimated time for this step is 0.72 seconds.
  9) We do an all-AMPs SUM step to aggregate from Spool 8 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 10.  The size of Spool 10
     is estimated with no confidence to be 3,220,673 rows (505,645,661
     bytes).  The estimated time for this step is 8.46 seconds.
 10) We do an all-AMPs RETRIEVE step from Spool 10 (Last Use) by way of
     an all-rows scan into Spool 1 (all_amps), which is built locally
     on the AMPs.  The size of Spool 1 is estimated with no confidence
     to be 3,295,426 rows (972,150,670 bytes).  The estimated time for
     this step is 0.32 seconds.
 11) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 11.60 seconds.

Sample structures

The table structures (simplified for the example):

CREATE TABLE Customer
(
  Customer_ID   INTEGER
, Customer_name VARCHAR(200)
)
UNIQUE PRIMARY INDEX (Customer_id)
;

insert into Customer values (1,'Cust-1');
Run 20x: 

insert into Customer select mx + sum(1) over (order by Customer_id rows unbounded preceding) id, 'Cust-' || trim(id) from Customer cross join (select max(Customer_id) mx from Customer) x;

collect statistics using sample on customer column (Customer_id);

CREATE TABLE Transact
(
  Transaction_ID   INTEGER
, Customer_ID      INTEGER
)
UNIQUE PRIMARY INDEX (Transaction_id)
;

insert into Transact values (1,1);
Run 22x: 

insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, id mod 1000000 from Transact cross join (select max(Transaction_id) mx from Transact) x;

insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, 345 from Transact t cross join (select max(Transaction_id) mx from Transact) x where t.Transaction_id < 100000;

collect statistics using sample on Transact column (Customer_id);
collect statistics using sample on Transact column (Transaction_id) ;

Teradata Compress optimization

Teradata Compress optimization
Techniques and effects

What is Multi Value Compression (MVC)?

Teradata RDBMS supports a nice feature: multi value compression (aka. Teradata Compress). It enables to reduce the storage space allocated by the tables in the database, while – this is incredible – processing compressed data usually requires less resources (CPU and I/O) than the uncompressed.
The feature needs no additional licence or hardware components.

How does Teradata Compress work?

I give a short summary, if you are interested in the details please refer to Teradata documentation.

MVC can be defined in CREATE TABLE DDL or later added/modified by ALTER TABLE statements. User must define a 1..255 element list of values for each compressable columns. Those will be stored as compressed value, while others will be uncompressed.
If a column is compressed, each row has an additional area of 1..8 bits allocated (if N value is listed: upper(log2(N)) bits will be allocated). One of bit combinations means that the value is uncompressed (and allocates its corresponding space within the row layout), but all others mean compressed value, which will not allocate the value’s place in the row.
The compress bits are allocated in every rows regardless the actual value is compressed or not.
Compress bits are “compacted”, eg.: 3 + 8 + 4 = 15 compress bits will allocate 2 bytes with only 1 wasted bit instead of 3 byte aligned values.
The value belonging to each bit combinations are stored in the table header.

Multi Value Compression is:

  • Column level
    Have to be defined on each applicable columns of a table separately
  • “Manual”
    You have to calculate which values are worth to compress – Teradata gives no automatism
  • Static
    Once you defined the values it will not adapt to the changing conditions by itself

It is obvious that the current optimal settings of the compression depends on the data demography and the applied data types. Optimal setting may be different later, when data demography may be different.

Summary of most important properties of MVC once again:

  • Can be defined in the CREATE TABLE statement
  • Can be applied or modified later in an ALTER TABLE statement
  • Must be set on COLUMN level
  • Optimal value list must be calculated by you
  • Optimal settings may change in time. Optimize regularly.

Storage effects

Using MVC tables will allocate less PERM space, as can be calculated – simple.
What about the increment?
The table sizes usually grow along the time as more and more data is generated. The change in growth speed depands on the volatility of data demography. If it is stable then the growth speed will drop by the rate of compression. If typical values change in time than growth will not drop, or may speed up in extreme cases. However theese cases are when regular optimization is neccessary.

The growth look like this in stable demography cases:

Performance effects

It is a key question – what have to be payed for less storage

It is obvious that compression process requires resources during both compress and decompress phase.However there are processing gains also, which usually dominate the costs. How?

Compressed table will reside in proportionally less data blocks, therefore data fetching requires less I/O operations. In addition moving data in-memory (during processing) requires less CPU cycles.
While SELECTing table data usually small fragment of the row is used, and not used coulmns will not be decompressed.
Caching is a CPU intensive operation also, which is more effective if less data blocks are processed.
Compression helps tables to be treated as “small enough to cache 100% into memory”, which results more effective execution plans.

Summary:

  • INSERT into a compressed table usually consume more CPU by 10..50% (only final step!)
  • SELECT usually cost no more, or less CPU than at uncompressed tables
  • SELECT and INSERT usually cost proportionally less I/O like the compression ratio
  • System level CPU and I/O usage usually drops by 5..10% (!) when compressing the medium and big tables of the system (caused by more effective caching)

How to set MVC?

Setting up the MVC compression on a single table should consist of the following 4 steps:

  1. Analyze the data demography of each compressible columns of the table *1.
  2. Calculate the optimal compress settings for the columns. Notice that
    •   Optimum should be calculated not on separated columns, but on table level, since compress bits are packed into whole bytes.
    •   The more values are listed as compressed, the more overhead is on compress. Proper mathematical formula is to be used for calculating the optimum. *2
    •   Take care of the exceptions: PI / FK / etc.columns and some data types are not compressible (varies in different Teradata versions).
  3. Assemble the corresponding scripts
    CREATE TABLE DDL + INSERT SELECT + RENAME / ALTER TABLE DDL
  4. Implement the compression by running the script
    Concern to take good care of data protection like: backups, locking, documenting.

*1 Simplified sample: 
     select top 256 <columnX> , count(*), avg(<length(columnX)>) from <table> group by 1 order by 2 desc; for each columns
*2 About the algorithm: It is a maximum-seeking function (n) based on the expression of gains when specific TOP {(2^n)-1} frequent values are compressed. The expression is far more complex to discuss here because different datatypes, exceptions and internal storing constructions.

 One time or regular?

Optimal MVC setting is valid for a specific point in time, since your data changes along your business. The daily change is usually negligible, but it accumulates.
Practice shows that it is worth to review compress settings every 3..6 months, and continually optimize new tables, couple of weeks after coming into production.

Estimate how much space and processing capacity is lost if compress optimization is neglected!

Solution in practice

There are “magic excels” on the net, which can calculate the optimal settings if you load the data demography, but it requires lots of manual work in addition (Running the calculations, DDL assembling, transformation script writing, testing, etc.)

If you want a really simple solution, try PRISE Compress Wizard , that supplies a comprehensive solution:

  • Assists to collect good candidate tables to compress
  • Automatically analyses the tables, and gives feedback:
    • How much space can be saved by compress
    • What is the current compress ratio (if there is compress already applied)
    • How much resources were used for analysis
    • What is the optimal structure
  • Generates transforming script (+ checks, lock, logging) along with
    • Backup (arcmain)
    • Revert process (for safety and documentation)
    • Reverse engineering (for E/R documentation update)
  • Log implementation
    •  Reflect achieved space saving: success measurement
    •  Report used CPU and I/O resources for transformation

Curing slow INSERTs and CREATE TABLEs I.

Eliminating hash collisions

Case description

We have an INSERT or CREATE TABLE operation that runs unreasonably long time compared to the affected number of rows, in spite the table is not skewed.

What is hash collision?

Hash collision is when two or more records in a table have the same hash value.

SET type of tables ensure that there are no more records with exactly the same record content within a table. How does Teradata do it?

Teradata stores the records in a hash filesystem, where each record has a hash value calculated from the Primary Index (PI) value. If the PI values are the same in more records, they will surely have the same hash value either.

When INSERTING a record, Teradata has to compare the new record to the table’s only those records that have the same hash value that new record has, since all records with different hash value will surely differ at least at the PI columns.
If we have to INSERT N records with the same hash value into an empty table, Teradata has to do N*(N-1)/2 times – very CPU demanding – full record comparisons.

How to identify

Hash collisions can be easily found by using PRISE Tuning Assistant tool also, or follow this method:

DBQL filtering for qualifying queries:
The Merge (MRG) phase of the INSERT/CREATE TABLE operation consumes lot of CPU.
Look for high CPU consuming ‘MRG’ steps in the dbc.DBQLStepTbl:

sel a.cputime,a.MaxAmpCPUTime * (hashamp() +1) CoveringCPUTIme,  a.stepname,a.RowCount,b.* from
     dbc.DBQLStepTbl a
join dbc.DBQLogTbl   b on a.ProcId=b.ProcId and a.QueryId=b.QueryId
where
    a.StepName in ('MRG' /*, 'MRU' for UPDATEs also*/)
and a.CPUTime > 100 /* Performance boost: eliminates most of the records (small cpu seconds) at low processing cost. Adapt number to your site */
qualify sum(1) over (order by a.cputime desc rows unbounded preceding) <= 100;

 

At a specific SQL statement (INSERT or CREATE TABLE) you have to check your PI for level of hash collisions (number of records where the hash values are the same) in the target table.

How to make sure that the hash-collision is the reason? Let the target table be TableA, with primary index: ColA,ColB,ColC (can be any number of columns in practice)

select top 100 hashrow(ColA,ColB,ColC), count(*) from TableA group by 1 order by 2 desc;


The top row(s) will show the most frequent hash values. Count values >>1 mean significant hash collisions in the order of N * N. Each high frequency hash value will generate a hash-collision group causing comparisons in the order of N*N.

If the table still not exists, embed the producing “SELECT” statement into the script above, and count those field values that would get to the PI columns.

Explanation

If we use “SET” type of table (this is the default setting), Teradata ensures that there will be no perfectly alike records in the table. This can be ensured by comparing the inserted/updated record with the existing ones.
Teradata’s “hash filesystem” gives a very effective trick: only those records must be compared, whose RowID (hash) equals, otherwise at least the PI fields must differ.
If we’ve chosen the Primary Index for UNIQUE, or non-UNIQUE, but on field(s) that are almost unique, then the “SET comparison”
restricts to zero or one records in most cases.

Solution

For good solution unfortunately we have to modify the table structure.

  • Option 1: Change table type to MULTISET. This will eliminate duplication checks, but its disadvantage is the same. If the process falls back on the de-duplication of SET table, you have to replace it with programmed de-duplication (group by, left join…).
  • Option2: Change the table’s PI to a unique or nearly unique column set. Be prudent, consider the workload also. (joins, where conditions, group by expressions, etc.)

Summary

Tables with strongly non unique PI are highly dangered for hash collision slowed INSERTs/CREATE TABLEs (or UPDATEs), even if they are not skewed. Use “more unique” PI, or MULTISET table.

Typical mistake: if a CREATE TABLE … as SELECT… lacks the PRIMARY INDEX() section. In this case Teradata chooses the first column as PI, which often causes terrible performance.

What’s next

Next post will discuss Multi Value Compress (MVC) optimization.

Boost slow (LEFT/RIGHT) OUTER JOINs

How to optimize slow OUTER JOINs

Case description

We have a (LEFT or RIGHT) OUTER JOIN, and it runs a long time while causing skewed CPU / Spool usage. In practice most of the time during the query execution only 1 AMP will work, while others have nothing to do, causing poor parallel efficiency.

How to identify

The query typically runs long time, contains a “MERGE JOIN” step in the Explain description, and that  step consumes skewed CPU consumption (MaxAMPCPUTime * Number-of-AMPS >> AMPCPUTime) and lasts long.

In the DBQL you should find skewed, high CPU usage queries (
dbc.DBQLogtbl.MaxAMPCPUTime * (hashamp()+1)  / nullifzero(dbc.DBQLogtbl.AmpCPUTime) > 1.2 and dbc.DBQLogtbl.AMPCPUTime > 1000 , depends on system size) which also has “left outer join” expression in the execution plan text (dbc.DBQLExplaintbl.ExplainText like ‘%left outer joined using a merge join%’)
This is only an approximation since the skewness causing step may be a different one.

PRISE Tuning Assistant supplies easy-to-use GUI based search function.

Explanation

Let’s assume that we outer join Table1 and Table2 on a condition that causes no product join (merge join instead), eg.:

select Table1.x,Table2.y
from Table1
LEFT JOIN Table2 on Table1.x = Table2.x
...

If Table2 is not a “small” table, Teradata optimizer will choose to “equi-distribute” (place matchable records on the same AMP) the two tables on the join field(s), in our case: Table1.x and Table2.x respectively.
If Table1.x contains significant percentage of NULLs, then the distribution will be skewed, since all “x is NULL” records will get to the same AMP.
We know that the NULL value never results in a join match, so those records are useless to examine, but they have to appear in the resultset, since it is an OUTER JOIN.

Solution

Let’s handle the Table1 into two separate subsets: NULL(x) and NotNULL(x), and modify the select this way:

select Table1.x,Table2.y
from Table1
INNER JOIN Table2 on Table1.x = Table2.x
where Table1.x is not null -- This condition is unneccessary, since Teradata optimizer will apply it implicitely
UNION ALL
select Table1.x,NULL
from Table1
where Table1.x IS NULL;



Practical example:
Some of our transactions are contributed by an operator, in this case OpID is filled, else null. We would like  to query the number of transactions by operators including the non-contributed ones. Most of the transactions are non contributed ones (OpID is null).

select
 
a.Transaction_id, b.OperatorName as ContribOpName
from TransactionT a
LEFT JOIN OperatorT b on a.OpID = b.OpID

Optimized form:

select
  a.Transaction_id, b.OperatorName as ContribOpName
from TransactionT a
INNER JOIN OperatorT b on a.OpID = b.OpID
where a.OpID is not null   /* This can be abandoned, inner join will apply it implicitely*/
UNION ALL
select
  a.Transaction_id, NULL as ContribOpName
from TransactionT
where OpID is null;



The execution will not cause a skewed CPU / Spool, because the those records of Table1 that caused peak ( x is NULL ) are excluded from processing of the join.
The second part will supply the “x is NULL” records to the result set without join processing.

The tradeoff is two full scans and a UNION ALL operation, which are comparably much less cost than a strongly skewed redistribution and a JOIN processing.

What’s next

Next post will discuss unexpectedly slow INSERTs (hash collision).