How to optimize a Teradata query?

Teradata SQL optimization techniques

Introduction

The typical goal of an SQL optimization is to get the result (data set) with less computing resources consumed and/or with shorter response time. We can follow several methodologies depending on our experience and studies, but at the end we have to get the answers for the following questions:

  • Is the task really heavy, or just the execution of the query is non-optimal?
  • What is/are the weak point(s) of the query execution?
  • What can I do to make the execution optimal?

Methodologies

The common part of the methodologies that we have to understand – more or less – what is happening during the execution. The more we understand the things behind the scenes the more we can feel the appropriate point of intervention. One can start with the trivial stuff: collect some statistics, make indices, and continue with query rewrite, or even modifying the base table structures.

What is our goal?

First of all we should branch on what do we have to do:
  1. Optimize a specific query that has been running before and we have the execution detail info
    Step details clearly show where were the big resources burnt
  2. In general, optimize the non optimal queries: find them, solve them
    Like a.,but first find those queries, and then solve them one-by-one
  3. Optimize a query, that has no detailed execution info, just the SQL (and “explain”)
    Deeper knowledge of the base data and “Teradata way-of-thinking” is required, since no easy and trustworthy resource peak-detecting is available. You have to imagine what will happen, and what can be done better

Optimization in practice

This section describes the case b., and expects available detailed DBQL data.
In this post I will not attach example SQL-s, because I also switched to use PRISE Tuning Assistant for getting all the requested information for performance tuning, instead of writing complex SQL queries and making heaps of paper notes.

Prerequisites

My opinion is that DBQL (DataBase Query Logging) is the fundamental basis of a Teradata system performance management – from SQL optimization point of view. I strongly recommend to switch DBQL comprehensively ON (SQL, Step, Explain, Object are important, excluding XML, that is huge, but actually has not too much extra), and use daily archiving from the online tables – just follow Teradata recommendation.

Finding good candidate queries

DBQL is an excellent source for selecting “low hanging fruits” for performance tuning. The basic rule: we can gain big save on expensive items only, let’s focus on the top resource consuming queries first. But what is high resource consumption? I usually check top queries by one or more of these properties:

  • Absolute CPU (CPU totals used by AMPs)
  • Impact CPU (CPU usage corrected by skewness)
  • Absolute I/O (I/O totals used by AMPs)
  • Impact I/O   (Disk I/O usage corrected by skewness)
  • Spool usage
  • Run duration
PRISE Tuning Assistant supplies an easy to use and quick search function for that:

Finding weak point of a query

Examining a query begins with the following steps:
  • Does it have few or many “peak steps”, that consume much resources? 
    • Which one(s)?
    • What type of operations are they?
  • Does it have high skewness?
    Bad parallel efficiency, very harmful
  • Does it consume extreme huge spool?
    Compared to other queries…
PRISE Tuning Assistant again.
Check the yellow highlights in the middle, those are the top consuming steps:

    Most of the queries will have one “peak step”, that consumes most of the total resources. Typical cases:

    • “Retrieve step” with redistribution
      Large number of rows and/or skewed target spool
    • “Retrieve step” with “duplication-to-all-AMPs”
      Large number of rows duplicated to all AMPs
    • Product join
      Huge number of comparisons: N * M
    • Merge or Hash join
      Skewed base or prepared (spool) data
    • OLAP function
      Large data set or skewed operation
    • Merge step
      Skewness and/or many hash collisions
    • Any kind of step
      Non small, but strongly skewed result

    What can we do?

    Teradata optimizer tries its best when produces the execution plan for a query, however it sometimes lacks proper information or its algorithms are not perfect. We – as humans – may have additional knowledge either of the data or the execution, and we can spoil the optimizer to make better decisions. Let’s see our possibilities.
    • Supplement missing / refresh stale statistics
    • Drop disturbing statistics (sometimes occurs…)
    • Restructure the query
    • Break up the query, place part result into volatile table w/ good PI and put statistics on
    • Correct primary index of target / source tables
    • Build secondary/join index/indices
    • Add extra components to the query.
      You may know some additional “easy” filter that lightens the work. Eg. if you know that the join will match for only the last 3 days data of a year-covering table, you can add a date filter, which cost pennies compared to the join.
    • Restrict the result requirements to the real information demand.
      Do the end-user really need that huge amount of data, or just a record of it?

    What should we do?

    First of all, we have to find the root cause(s). Why does that specific top step consume that huge amount or resources or executes so skewed? If we find the cause and eliminate, the problem is usually solved.
    My method is the following:
    1. Find the top consuming step, and determine why it it high consumer
      • Its result is huge
      • Its result is skewed
      • Its work is huge
      • Its input(s) is/are huge
    2. Track the spool flow backwards from the top step, and find
      • Low fidelity results (row count falls far from estimated row count)
      • NO CONFIDENCE steps, specifically w/low fidelity
      • Skewed spool, specifically non small ones
      • Big duplications, specifically w/NO CONFIDENCE
    3. Find the solution
      • Supplement missing statistics, typically on PI, join fields or filter condition
        NO CONFIDENCE, low fidelity, big duplications
      • Break up the query
        Store that part result into a volatile table, where fidelity is very bad, or spool is skewed. Choose a better PI for that
      • Modify PI of the target table
        Slow MERGE step, typical hash-collision problem.
      • Eliminate product joins
      • Decompose large product join-s
      • E.T.C.
    Have a good optimization! 🙂

    Using Join Index

    How to use Join Index in Teradata

    About Join Index (JI)

    Join index is most like a “materialized view”, say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.

    What are the main differences between a JI and a Secondary Index?

    • Different internal structure and purposes
    • Appears as separate object (not a sub-table) in the DBC tables/views
    • Can reside in different database than the base table(s)
    • Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured

    What is the difference between JI and a regular table in which I can store the same query result? 

    • You can’t access it directly from an SQL query
    • Optimizer automatically decides to be used or not:
      • By logic: if the JI can support the query with its data content
      • By cost: if usage of JI will result a “cheaper” execution (always collect statistics!)
    • Maintained automatically if the content of the base table(s) change
    • “Locks” the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).

    JI types

    The Join Index types I will list below are not differentiated by SQL phrase, but the structure of the SQL SELECT used in the JI definition.

    They can be combined also in reasonable ways, eg. <single table – aggregate – sparse> or <multi table – aggregate>, etc.

    Let’s take these base tables for our examples:
    CREATE TABLE TBL_A
    (
      Col1 integer
    , Col2 integer
    , Col3 integer
    PRIMARY INDEX (Col1)
    ;
    CREATE TABLE TBL_B
    (
      Col1 integer
    , Col2 integer
    PRIMARY INDEX (Col1)
    ;

    Single table JI

    This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:

    • Non-covering
      We select only the filtering column(s) (those will be the PI of the JI also) and the “Rowid” pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table’s appropriate records. The JI can be very small this way, but note that we have an additional “join phase”.
    • Covering
      The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some “central” tables.
    This example shows a non-covering index for the query below:
    create join index JI_1
    as
    SELECT Col1,Col2
    FROM TBL_A
    PRIMARY INDEX(Col2)

    ;

    select Col3 from TBL_A where Col2=1;

     Multi table JI

    This kind of JI is for accelerating frequent join statements. Technically it stores the result of a join. It can cover all the columns of just store the key-pairs, or somewhere between.

    create join index JI_2
    as
    SELECT a.Col3,b.Col1
    FROM TBL_A a

    join     TBL_B b on a.Col2=b.Col2
    PRIMARY INDEX(Col3)
    ;

    Aggregate JI

    The JI’s SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata’s optimizer is quite clever, because it can recognize “intermediate” aggregate JIs for further aggregation instead using the base table. Example:

    create join index JI_3
    as
    SELECT Col1,Col2,sum(Col3) X
    FROM TBL_A

    GROUP BY 1,2
    PRIMARY INDEX(Col2)
    ;

    All three SELECTs can be served from the JI_3:
    SQL1: select Col1        , sum(X) from TBL_A group by 1;
    SQL1: select Col2        , sum(X) from TBL_A group by 1;

    SQL1: select Col1,Col2, sum(X) from TBL_A group by 1,2;

    Sparse JI

    Thees JIs contain where clause, say the indexing is not for the whole table. If the where condition of the JI is a logical subset of the supported SQL’s where condition than the JI can support the query.
    Typical usage is on the transactional tables: we have frequent accessed on a transaction table by Customer_id, but have PI of Trx_id. We can observe that 90% of the SELECTs fall onto the last 30 days.
    We can put a sparse non covering single table JI on the table with PI:Customer_id
    90% of the selects will finish in no time, and 10% of the queries will result in a full table scan, meanwhile our JI size remains pretty small: we index only 30 days instead of eg. 365 days.
    Please note that where condition of a sparse JI can not contain “changing” values like current_date. That means JI must be regularly redefined if a moving window sparse JI is to be maintained.

    create join index JI_4
    as
    SELECT Col1,Col2,Col3 X
    FROM TBL_A

    WHERE Col3 between 1 and 10
    PRIMARY INDEX(Col2)
    ;

    Serveable query:
    select Col1 from TBL_A where Col2=20 and Col3=4;
    Non serveable query:
    select Col1 from TBL_A where Col2=20 and Col3=11;

    When will join index help?

    Join indices typically serve two purposes (or the combination of them):

    • Store the result of frequently produced (part-)results
    • Provide alternative access path to the data
    Practical examples:
    • Frequent aggregation or join operation
    • Frequent redistribution of a table
    • Non primary index based filtering on a table (causing full table scan)
    • Non primary index based join of a table (causing redistribution)

    Important things about JIs

    Statistics

    Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use them

    Choosing PI

    Te PI of the JI will be the most effective access path to the JI’s records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.

    DBQL analysis IV – Monitor index usage

    Analyze “Index usage” in DBQL

    Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on. “Object” option in DBQL is required “on” to use the scripts provided.

    About Indices

    Teradata provides possibility of creating INDEX objects for allowing alternative access path to the data records. They are quite different structures then the good old B*Tree or Bitmap indices (common in non-MPP RDBMSes)
    The main goal of the index objects is to improve data access performance in exchange for storage and maintenance processing capacity.
    Having indices is not free (storage and maintenance resources), those ones that bring not enough gain is better being dropped.

    Index usage footprint in DBQL

    If a query uses an index for accessing the data it is declared in the “explain text”, and also registered in the DBQL: dbqlobjtbl. The appearance type of indices depend on the type of index. In case of primary/secondary index only IndexId and the columns of the index are registered, while join/hash indices appear like a regular table: at database, object and column levels all.

    If the join/hash index is covering, the base table may not be listed in the DBQL objects, therefore be careful if analyze table usage from DBQLobjtbl.
    I recommend to use PRISE Tuning Assistant to easily find all type of access to a table data.

    Examples

    Primary index access (2 columns):
    Plan:
      1) First, we do a single-AMP RETRIEVE step from d01.z by way of the
         primary index “d01.z.i = 1, d01.z.j = 1” with no residual

    DBQLobjtbl:
    ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
    D01 Z i 1 Idx
    D01 Z j 1 Idx

    Secondary index access (1 column):
    Plan:
      3) We do an all-AMPs RETRIEVE step from d01.x by way of index # 4
         without accessing the base table “d01.x.j = 1” with no residual

    DBQLobjtbl:
    ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
    D01 X 4 Idx

    Join index:
    create join index d01.ji as sel b from d01.q primary index (b);
    select b from d01.q where a=1;
    Plan:
      1) First, we do a single-AMP RETRIEVE step from D01.JI by way of the
         primary index “D01.JI.a = 1” with no residual conditions into

    DBQLobjtbl:
    ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
    D01 JI 0 Jix
    D01 JI a 1 Idx
    D01 JI b 1026 Col

    Please note that

    • ObjectNum identifies the index (refers to dbc.indices.Indexnumber)
    • As many rows appeas as many columns the index has
    • Eg. in V13.10 Teradata Express the single column secondary index lacks the column name in the logs

    Analyzing DBQL data 

    Prepare data

    CREATE VOLATILE TABLE DBQLIdx_tmp1 AS (
    SELECT databasename,tablename,indexnumber,indextype,uniqueflag,Columnposition,Columnname
    , SUM (1) OVER (partition BY databasename,tablename,indexnumber ORDER BY Columnname ROWS UNBOUNDED PRECEDING) ABCOrder
    FROM dbc.indices WHERE indextype IN ('K','P','Q','S','V','H','O','I')
    ) WITH DATA
    PRIMARY INDEX (databasename,tablename,indexnumber)
    ON COMMIT PRESERVE ROWS


    CREATE VOLATILE TABLE DBQLIdx_tmp2 AS (
    WITH RECURSIVE idxs (Databasename,Tablename,Indexnumber,Indextype,Uniqueflag,Indexcolumns,DEPTH)
    AS (
    SELECT
    databasename,tablename,indexnumber,indextype,uniqueflag,TRIM (Columnname) (VARCHAR (1000)),ABCorder
    FROM DBQLIdx_tmp1 WHERE ABCorder = 1
    UNION ALL
    SELECT
    b.databasename,b.tablename,b.indexnumber,b.indextype,b.uniqueflag,b.Indexcolumns||','||TRIM (a.Columnname),a.ABCOrder
    FROM DBQLIdx_tmp1 a
    JOIN idxs b ON a.databasename = b.databasename AND a.tablename = b.tablename AND a.indexnumber = b.indexnumber AND a.ABCOrder = b.Depth + 1
    )
    SELECT databasename db_name,tablename table_name,indextype,uniqueflag,indexcolumns
    ,indexnumber
    ,CASE WHEN uniqueflag = 'Y' AND indextype IN ('P','Q','K') THEN 'UPI'
    WHEN uniqueflag = 'N' AND indextype IN ('P','Q') THEN 'NUPI'
    WHEN uniqueflag = 'Y' AND indextype IN ('S','V','H','O') THEN 'USI'
    WHEN uniqueflag = 'N' AND indextype IN ('S','V','H','O') THEN 'NUSI'
    WHEN indextype = 'I' THEN 'O-SI'
    ELSE NULL
    END Index_code
    FROM idxs
    QUALIFY SUM (1) OVER (partition BY db_name,table_name,indexnumber ORDER BY DEPTH DESC ROWS UNBOUNDED PRECEDING) = 1
    ) WITH DATA
    PRIMARY INDEX (db_name,table_name)
    ON COMMIT PRESERVE ROWS
    ;

    UPDATE a
    FROM DBQLIdx_tmp2 a,DBQLIdx_tmp2 b
    SET Index_code = 'PK'
    WHERE a.db_name = b.db_name
    AND a.table_name = b.table_name
    AND a.Index_code = 'UPI'
    AND a.indextype = 'K'
    AND b.Index_code = 'NUPI'
    AND b.indextype <> 'K'
    ;

    Report: How many times have the indices been used?

    You may need to modify the script:
    • Date filtering (use between for interval)
    • Online/archived DBQL: use commented section for archived

    SELECT
      COALESCE(usg.objectdatabasename,idx.db_name) db
    , COALESCE(usg.objecttablename,idx.table_name) tbl
    , COALESCE(usg.ObjectNum,idx.IndexNumber) idxNo
    , idx.Index_code
    , idx.Indexcolumns Index_columnss
    , coalesce(usg.drb,0) Nbr_of_usg
    FROM
    (SELECT objectdatabasename,objecttablename,objecttype,ObjectNum,COUNT (*) drb

    —  Archived DBQL
    —  FROM dbql_arch.dbqlobjtbl_hst WHERE logdate = ‘2014-02-20’ (date)

    —  Online DBQL
      FROM dbc.dbqlobjtbl WHERE  cast(collecttimestamp as char(10)) = ‘2014-02-20’
    AND objecttablename IS NOT NULL
    AND ((objecttype IN (‘JIx’,’Hix’) AND objectcolumnname IS NULL)
    OR
    (objecttype IN (‘Idx’))
    )
    AND objectnum <> 1
    GROUP BY 1,2,3,4
    ) usg
    FULL OUTER JOIN
    ( SELECT db_name,table_name,Indextype,Uniqueflag,indexcolumns,Indexnumber,Index_code 

      FROM DBQLIdx_tmp2 a WHERE indextype NOT IN (‘P’,’Q’,’K’)
    union all
    SELECT databasename,tablename,tablekind,cast(null as char(1))

                  ,cast(null as varchar(1000)),cast(null as smallint)
    ,case when tablekind=’I’ then ‘JIX’ else ‘HIX’ end from dbc.tables where tablekind in (‘I’,’N’)
    ) idx ON usg.objectdatabasename = idx.db_name
    AND usg.objecttablename = idx.table_name
    AND ((usg.ObjectNum = idx.IndexNumber) or usg.objecttype IN (‘JIx’,’Hix’))
    ORDER BY 6 DESC
    ;

    DBQL analysis III – Monitor “collect statistics”

    Analyze “collect statistics” in DBQL

    Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on. “Object” and “SQL” option in DBQL is required “on” to use the scripts provided.
    This article is applicable up to V13.10 w/o modifications, statistics handling changed from V14. 

    About Statistics

    “Statistics” is a descriptive object in the Teradata database that are used by the optimizer for transforming SQLs to effective execution plans.
    Statistics reflect the key data demographic information of one or more table column(s).
    These objects should be created and maintained, the RDBMS will not do it by itself.
    Statistics internally contain value histogram, which needs the table data (or sample) to be analyzed, which is an expensive task.

    Summarized: appropriate statistics are required for getting good and effective executon plans for SQLs, but statistics consume resources to be collected or refreshed.

    “Statistics” footprint in DBQL

    When a “statistics” is created or refreshed it is executed by an SQL command: collect statistics….
    This command will create a log entry into the DBQL if the logging is switched on.

    One can track when, which “statistics” was collected, consuming how much CPU and I/O.
    Those statements are very easy to identify in the central table:

    select * from dbc.DBQLogTbl where StatementType=’Collect statistics’

    Analyzing DBQL data 

    Prepare data

    You may need to modify the script:

    • Date (interval)
    • Online/archived: use commented section
    • QueryBand: “JOB” variable is used, modify according to your ETL settings

    create volatile table DBQLStat_tmp1
    as
    (
    sel a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount

    ,case when a.querytext like '% sample %' then 'S' else 'F' end Full_Sample
    ,UserName,(FirstRespTime - StartTime) DAY(4) TO SECOND(4) AS RUNINTERVAL      
    ,(EXTRACT(DAY FROM RUNINTERVAL) * 86400 + EXTRACT(HOUR FROM RUNINTERVAL)  * 3600 + EXTRACT(MINUTE FROM RUNINTERVAL)  * 60 + EXTRACT(SECOND FROM RUNINTERVAL) ) (decimal(10,1)) Duration
    ,b.ObjectDatabaseName DatabaseName,b.ObjectTableName TableName,c.ObjectColumnName ColumnName
    ,case when d.SQLTextInfo like any ('%"PARTITION"%', '%,PARTITION %', '%,PARTITION,%', '% PARTITION,%', '% PARTITION %', '%(PARTITION,%', '%(PARTITION %', '%,PARTITION)%', '% PARTITION)%', '%(PARTITION)%') then 'Y' else 'N' end inclPartition
    ,CAST((case when index(queryband,'JOB=') >0 then  substr(queryband,index(queryband,'JOB=') ) else '' end) AS VARCHAR(500)) tmp_Q
    ,case when queryband = '' then 'N/A'
             when tmp_q = '' then '-Other'
    else CAST( (substr(tmp_Q,characters('JOB=')+1, nullifzero(index(tmp_Q,';'))-characters('JOB=')-1)) AS VARCHAR(500)) end QB_info
    ,sum(1) over (partition by a.procid,a.Queryid order by c.ObjectColumnName, a.QueryID rows unbounded preceding) Rnk
    from
    /* For achived tables
         dbql_arch.DBQLogTbl_hst       a
    join dbql_arch.DBQLObjTbl_hst      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID and a.logDate=b.logDate
    left join dbql_arch.DBQLObjTbl_hst c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID and a.logDate=c.logDate
    join dbql_arch.DBQLSQLTbl_hst      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID and a.logDate=d.logDate
    where a.logDate=1140113
    */
    /*end*/
    /* For online tables */
         dbc.DBQLogTbl       a
    join dbc.DBQLObjTbl      b on b.ObjectType='Tab' and a.procid=b.procid and a.QueryID=b.QueryID
    left join dbc.DBQLObjTbl c on c.ObjectType='Col' and a.procid=c.procid and a.QueryID=c.QueryID
    join dbc.DBQLSQLTbl      d on d.SQLRowNo=1       and a.procid=d.procid and a.QueryID=d.QueryID
    where cast(cast(a.starttime as char(10)) as date) = '2014-01-13' (date)
    /*end*/
    and a.StatementType='Collect statistics'
    ) with data
    primary index (procId,QueryId)
    on commit preserve rows
    ;

    create volatile table DBQLStat
    as
    (
    WITH RECURSIVE rec_tbl
    (
     procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,SColumns
    )
    AS
    (
    select
     procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,inclPartition,ColumnName,Rnk,cast(case when ColumnName is null and inclPartition='Y' then '' else '('||ColumnName end as varchar(10000)) SColumns
    from DBQLStat_tmp1 where Rnk=1
    UNION ALL
    select
      a.procId,a.QueryId,a.StartTime,a.AMPCpuTime,a.TotalIOCount,a.Duration,a.Full_Sample,a.UserName,a.DatabaseName,a.TableName,a.QB_info,a.inclPartition,a.ColumnName,a.Rnk,b.SColumns ||','||a.ColumnName
    from DBQLStat_tmp1     a
    join rec_tbl b on a.procId=b.ProcId and a.QueryId=b.QueryID and a.Rnk=b.Rnk+1
    )
    select   procId,QueryId,StartTime,AMPCpuTime,TotalIOCount,Duration,Full_Sample,UserName,DatabaseName,TableName,QB_info,Rnk NumOfColumns
            ,case when SColumns = '' then '(PARTITION)' else SColumns || case when inclPartition='Y' then ',PARTITION)' else ')' end end StatColumns
    from rec_tbl qualify sum(1) over (partition by procid,queryid order by Rnk desc, QueryID rows unbounded preceding) = 1
    ) with data
    primary index (procid,queryid)
    on commit preserve rows
    ;

    Reports

    • How many statistics has been collected for how much resources?

    select
      UserName /*Or: DatabaseName*//*Or: Full_sample*/
    , count(*) Nbr
    , sum(AMPCpuTIme) CPU
    , sum(TotalIOCount) IO
    from DBQLStat
    group by 1
    order by 1
    ;

    • Which statistics has been collected multiple times?
      (If more days are in preapred data, frequency can be determined, erase “qualify”)

    select a.*,
    sum(1) over (partition by databasename,tablename,statcolumns)  Repl
    from DBQLStat a

    /* Comment for frequency report*/
    qualify sum(1) over (partition by databasename,tablename,statcolumns) > 1

    /*end*/
    order by repl desc, databasename,tablename,statcolumns
    ;

    PRISE Tuning Assistant use cases

    What is PRISE Tuning Assistant (PTA) good for?

    PRISE developed this tool to help DBAs and developers solve performance issues quickly and easily. In this post I describe a couple of typical situations when it can assist you.
    A full functioning trial version of PRISE Tuning Assistant is available for download.

    In general, PTA application consist of two main functions:

    • Query finder
      Enables applying lots of search criteria like: thresholds, skewness, queryband, SQL text, Explain text, used objects, time window, etc.
    • Query analyzer
      Unfolds the execution process of one selected query, helps understanding the query and its performance bottlenecks. Collects all required supplementary information like table sizes, DDLs, statistics, etc. in one comprehensive dashboard, and provides a “walking tour” feature, you can roam in the execution plan with.
    All this enables even an average skilled Teradata expert to find which queries to be optimized and figure out the weak points of even a complex query in minutes.

    PTA’s key architectural properties:

    • Supports active DBQL (dbc.dbqlogtbl and friends…) and also archived DBQL tables.
    • Optimized DBQL querying (fast response times and low resource usage)
    • Requires read only access to database
    • Portable, multi platform Java application, requires no installation

    Let’s see typical use cases of the application.

    Find the TOP consuming queries

    The optimization’s “low hanging fruits” are the top consuming queries, I mean CPU or I/O  stuffing SQL statements. They are often small in number, however consume significant percent of the total system resources, as I mentioned in this post.

    How to find them?
    Simply set your focused time interval and select the “CPU”, “I/O” or “Duration” (for long time running ones) as TOP criteria, and click the “Get DBQL data” button. You will be given the list of the qualifying queries in descending order of your TOP criteria. After you can sort it by the other KPIs, like:

    • Skewness
    • Spool usage
    • LHR/RHL (Larry Higa ratio)
    • QueryBand (and specific variables of it)
    • Etc.

    Skewed CPU or I/O usage

    Teradata is a massive parallel system. It can utilize its full power if the work is spread evenly across the AMPs, otherwise (skewed queries) the response time increases dramatically while virtual resource consumption can remain low.

    How to find the bad guys?
    Start same as “TOP queries” case, but choose “CPU Impact” or “I/O Impact” as TOP criteria. This will result those queries that have the highest “Impact” (skew-normalized resource consumption) on the system.

    “No more spool space”

    Teradata uses Spool limiting in fact for cancelling those queries that have bad plans. Therefore “Failure 2646: No more spool space error” means a bad query execution plan, and the solution is usually not spool barrier raising. But often those spool limits are raised, until query is able to finish, and bad queries throttle the system… Why does a query usually run out of spool?

    • Duplicates a large spool/table to all AMPS
      Thinks it is small
    • Redistributes spool/table by a skewed value
      Thinks it is non-skewed
    • Processes (copies to spool) unreasonably large number of records

    First two cases are usually caused by missing or stale statistics that can be easily found and supplied.

    How to select?

    Start same as “TOP queries” case, but choose “Spool” as TOP criteria. This will result those queries that have the highest spool consumption, and SQL dashboard will help you to figure out which statistics should be collected. In addition you can filter for “Error 2646” with Error code option, which will result those queries that has reached the spool limit.

    Find a specific query

    The DBA gets a question: “My query runs slow, please help me!” – occasionally an SQL is attached in the email. PTA lets figuring out which query was it in the DBQL. You can filter on:

    • Username
    • Session ID
    • Full query text (with LIKE)
    • QueryID (internal DBQL ID)
    • QueryBand (with LIKE)
    • Error code

    Search disliked plan elements

    PTA enables text search in Explain text of the queries. Eg. find quickly:

    • Product joins
    • Sliding-window joins
    • “Redistributed by”
    If it is combined with CPU and/or I/O limitation (CPU >=; I/O >= criteria) then we can get those queries where they really cause problem.

    Object usage

    We have a table or index in the database, and we look for all the queries that utilize that object.
    It is very useful for:

    • Preparing for a PI change
      PI change usually triggered by skewed data distribution, but the more important factor is the access path it provides. Before you change a PI, you need to examine the workload, accessing that table. What is the typical filter/join criteria, does the current/new PI support good access path? PTA enables to do it quickly and easily.
    • Determine usage of a table
      How many times a table is accessed a day? Is it used extensively, or just kept loaded for nothing?
    • Determine usage of a hash/join index
      A hash/join index requires processing and storage resources. Is it pretty utilized, or not?
    How?
    Simply set the “Object used” filter citeria. In addition you can use the “Thru join/hash index” option for including the join/hash index based data accesses.

    Recurring queries

    Which query to be optimized? Maybe there are a medium consuming queries, but run lot of times a day, consuming much resource in total. Use the “Recurring query finder” function of the PTA to see aggregated information: total and average Duration, CPU and I/O.

    What was a query run by?

    If your ETL and OLAP systems use QueryBand properly, you can set specific variables (max of 6) to be parsed into separate columns in the result query list. Sort the result by it, and find which queries of a load job should be optimized.

    DBQL analysis II. – Skewness – The big resource robbery detection

    The big resource robbery detection

    Skewness

    Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on.

     

    About Skew

    If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

    As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

    Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

    Find “Top skewed queries” in DBQL

    With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):

    select top 50
          ProcID
        , QueryID
        , AMPCPUTime
        , MaxAMPCPUTime * (hashamp () + 1) CPUImpact
        , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
        , TotalIOCount
        , MaxAMPIO * (hashamp () + 1) IOImpact
        , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
        , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
        , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
        , ParserCPUTime
        , Queryband
        , Substr(QueryText,1,2000) QueryText
    from
    /* For archived DBQL
        dbql_arch.dbqlogtbl_hst 
    where
        logdate=1131201
        and ampcputime > 0
    */
    /* For online DBQL*/
        dbc.dbqlogtbl 
    where
        cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
        and ampcputime>0
    order by 
        CPUImpact desc

    Explanation of extra fields:

    ParserCPUTime
    Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
    LHR/RHL
    Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
    QueryBand
    Labels that sessions use to identify themselves within the DBQL logs
    QueryText: First 200 characters of the query (depending on DBQL log settings)

    OK, we’ve listed the terrible top consumers, but what’s next?
    Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post…), you can find which job or report issued that SQL, anyway, you can see the QueryText field.

    If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

    select 
        SQLTextInfo 
    from 
        dbc.dbqlsqltbl 
    where 
        procid = <procid> 
        and queryid = <queryid> 
    order by 
        SQLRowNo asc

    You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

    System level Skewness

    Totals

    We have found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?

    The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.

    We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

    select
          sum(AMPCPUTime) AMPCPUTimeSum
        , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
        , sum(TotalIOCount) TotalIOCountSum
        , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
        , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
        , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
    from
    /* For archived DBQL
        dbql_arch.dbqlogtbl_hst
    where 
        logdate = '2013-12-18' (date)
        and (ampcputime > 0 or TotalIOCount > 0)
    */
    /* For online DBQL*/
        dbc.dbqlogtbl
    where
        cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
        and (ampcputime > 0 or TotalIOCount > 0)

    Look at the last two columns. That percent of your CPU and I/O goes to the sink…

    Top bad guys

    OK, let’s check how many queries accumulate 5%, 10%, 25%, 50%, 75%, 90% of this loss?
    Here you are (CPU version, transform for I/O implicitly):

    select 
          'How many queries?' as "_"
        , min(limit5) "TOP5%Loss"
        , min(limit10) "TOP10%Loss"
        , min(limit25) "TOP25%Loss"
        , min(limit50) "TOP50%Loss"
        , min(limit75) "TOP75%Loss"
        , min(limit90) "TOP90%Loss"
        , max(rnk) TotalQueries
        , sum(ResourceTotal) "TotalResource"
        , sum(ResourceImpact) "ImpactResource"
    from
    (
        select
             case when ResRatio < 5.00 then null else rnk end limit5
            ,case when ResRatio < 10.00 then null else rnk end limit10
            ,case when ResRatio < 25.00 then null else rnk end limit25
            ,case when ResRatio < 50.00 then null else rnk end limit50
            ,case when ResRatio < 75.00 then null else rnk end limit75
            ,case when ResRatio < 90.00 then null else rnk end limit90
            ,rnk
            , ResourceTotal
            , ResourceImpact
        from
        (
            select
                  sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
                , sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
                , subtotalRes *100.00 / totalRes Resratio
                , sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
                , ResourceTotal
                , ResourceImpact
            from
            (
                select
                    AMPCPUTime ResourceTotal
                    , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
                    , ResourceImpact - ResourceTotal ResourceLoss
                /* For archived DBQL
                from 
                    dbql_arch.dbqlogtbl_hst 
                where 
                    logdate=1131207
                    and ampcputime > 0
                */
                /* For online DBQL*/
                from 
                    dbc.dbqlogtbl
                where
                    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
                    and ampcputime > 0
            ) x
        ) y
    ) z
    group by 1

    I expect you are a bit shocked now, how few queries waste how much golden resources.

    I think we will agree that it is worth to tune those dozen of queries,
    and you save in orders of $100K..M USD for your company annually, am I right?

    PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

    Typical reasons of skewness – in a nutshell

    • Skewed tables: Bad choice of PI, Skewed data 
    • Bad execution plans (typically skewed redistributions)
      • Bad data model (normalization,data types,PI, etc.)
      • Missing or stale statistics
      • Too many joins (break up the query!)
    • Hash collision (load time problem)

    Interpreting Skewness

    What does Skew metric mean?

    Overview

    You can see this word “Skewness” or “Skew factor” in a lot of places regarding Teradta: documents, applications, etc. Skewed table, skewed cpu. It is something wrong, but what does it explicitly mean? How to interpret it?

    Let’s do some explanation and a bit simple maths.
    Teradata is a massive parallel system, where uniform units (AMPs) do the same tasks on that data parcel they are responsible for. In an ideal world all AMPs share the work equally, no one must work more than the average. The reality is far more cold, it is a rare situation when this equality (called “even distribution”) exists.
    It is obvious that uneven distribution will cause wrong efficiency of using the parallel infrastructure.
    But how bad is the situation? Exactly that is what Skewness characterizes.

    Definitions

    Let “RESOURCE” mean the amount of resource (CPU, I/O, PERM space) consumed by an AMP.
    Let AMPno is the number of AMPs in the Teradata system.

    Skew factor := 100 – ( AVG ( “RESOURCE” ) / NULLIFZERO ( MAX (“RESOURCE”) ) * 100 )

    Total[Resource] := SUM(“RESOURCE”)

    Impact[Resource] := MAX(“RESOURCE”) * AMPno

    Parallel Efficiency := Total[Resource] / Impact[Resource] * 100

    or with some transformation:

    Parallel Efficiency := 100 – Skew factor

    Analysis

    Codomain

    0 <= “Skew factor” < 100

    “Total[Resource]” <= “Impact[Resource]”

    0<“Parallel Efficiency”<=100

    Meaning

    Skew factor : This percent of the consumed real resources are wasted
    Eg. an 1Gbytes table with skew factor of 75 will allocate 4Gbytes*

    Total[Resource] :Virtual resource consumption, single sum of individual resource consumptions , measured on  AMPs as independent systems

    Impact[Resource] :Real resource consumption impacted on the parallel infrastructure

    Parallel Efficiency : As it says. Eg. Skew=80: 20%

    * Theoretically if there is/are complementary characteristics resource allocation (consumes that less resources on that AMP where my load has excess) that can compensate the parallel inefficiency from system point of view, but the probability of it tends to zero.

    Illustration

    The “Average” level indicates the mathematical sum of AMP level resource consumptions (Total[Resource]), while “Peak” is the real consumption from “parallel system view” (Impact[Resource])
    On finding skewed tables I will write a post later.

    PRISE Tuning Assistant helps you to find queries using CPU or I/O and helps to get rid of skewness.

    Using Partitioned Primary Index

    How to use partitioned primary index (PPI)

    This post is about row partitioning and will not discuss columnar.

    What is partitioning?

    To explain it correctly, let’s get back to the basics…
    Each Teradata tables (except NoPI type) have “Primary Index”, aka. PI, which is not physical index, but rather a logical construction: one or more columns of the table which give the input for hashing method. The hash value determines two things:

    • Which AMP will store the record
    • Storing order of the records within the AMPs

    If the PI is non-partitioned then the records are stored in order of hash value of PI.

    If you use (row) partitioning, you define it at the Primary Index.
    In this case Teradata will associate a 2bytes or 2/8 bytes (at V14.10) “partition code” to the record*, and the storing order is <partition code>,<hash_value> (aka. RowKey).
    That way partitions are not sub-tables or other physical objects, but only influence the record storing order.

    * This implies that no more than 64k(2bytes)/9Q(8bytes) partitions can exist. For details read the appropriate Teradata version’s documentation.

    What is the difference between PPI and NUSI?

    NUSI (Non Unique Secondary Index) can serve as similar purposes, but is absolutely different.
    NUSI is a separate subtable, with analogue PI to base table, but different (value) ordering.
    For details please read Teradata documentation.

    How to define?

    Non partitioned table:
    create table tablePI
    (
      Trx_id Integer
    , Trx_dt Date
    )
    PRIMARY INDEX (Trx_id)

    Partitioned table:
    create table tablePPI (   Trx_id Integer
    , Trx_dt Date
    )
    PRIMARY INDEX (Trx_id
    , Trx_dt**)
    PARTITION BY RANGE_N(Trx_dt BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

    Highlights

    • **Partitioning key (Trx_dt here) can be part of the PI or not. This is very important, see below.
    • Partitioning can be single or multiple (MLPPI) levels***
    • RANGE_N or CASE_N functions can be used for determining partition code
    • RANGE_N function has constant interval endpoints and partition length.
    • NO RANGE and UNKNOWN partitions will store the out-of-intervals and null value records respectively

    ***MLPPI is a technique when multiple or nested partitioning is defined on the table. Logically it looks like sub-partitions, but in practice it only influences the calculation of partition code values, which is still a linear 2/8 bytes value overall the table.

    Pros – Cons of using PPI

    PPI is a very useful feature, but not a silver bullet to use it everywhere. Look the trade offs:

    • (+) Partition elimination
      Only the relevant partitions are scanned while accessing data
    • (+) Interval filtering is supported
    • (+) Accelerates INSERTs
      If we load increment data into a populated table. Very likely less data blocks are affected, since few partitions are involved (if date is the partitioning basis) 
    • (-) 2 or 8 bytes extra space allocation per record
    • (-) Compression is not allowed on PartKey column
    • (-) PartKey inclusion problem (see below)
    • (-) Partition elimination works only with literals
      Subselects cause full table scans

    Design aspects

    RANGE_N or CASE_N

    These functions are used to define partitioning. RANGE_N is for concentrate date (integer) intervals into partitions, while CASE_N is like a CASE-WHEN-THEN expression, where the outcome is the partition.
    Typically RANGE_N is used when we partition a transaction table by its date or timestamp, while CASE_N is popular in special cases like categorizing. You can use more columns in the logical expression, but take care, all of them must be used in filter condition to enable partition elimination.

    RANGE_N: what interval size?

    It depends on the granularity of the data, granularity of filtering and how long interval should be stored in the table. Usually daily partitioning is ideal.

    RANGE_N: interval extension or intervals in advance?

    If we load transactional data into our partitioned table, the date column we use as partition key is populated later and later dates, while we have a finite partition range definition.
    Partition ranges can be added to RANGE_N definition periodically (depends on version), or we can define partitions in far advance. (365 partitions required for a year, 65k partitions cover ~ 180years, which is more than enough) Note that empty partitions do not allocate space.
    One of the methods above should be applied, otherwise the NO RANGE partition will grow extensively, which will cause performance degradation due to less effective partition elimination.

    Partitioning Key: include in PI or not?

    This is the funny point.
    Partitioning key is the column(s) that determines the partition, say used in the RANGE_N/CASE_N definition. We can include it in the Primary Index or not, we decide.

    Let’s take an example. We have a master-detail pair of tables, nicely “equi-PI”-ed for effective join:

    CREATE TABLE ORDER_HEAD
    (
      ORDER_NO INTEGER
    , ORDER_DT DATE
    ) UNIQUE PRIMARY INDEX (ORDER_NO);

    CREATE TABLE ORDER_ITEM
    (
      ORDER_NO INTEGER
    , ORDER_ITEM_NO
    , PROD_NO INTEGER
    ) PRIMARY INDEX (ORDER_NO);

    We modify ORDER_HEAD’s PI:
    UNIQUE PRIMARY INDEX (ORDER_NO, ORDER_DT)

    PARTITION BY RANGE_N(ORDER_DT BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

    Should we include ORDER_DT or not? Which is better, what is the difference?

    • Not include
      ORDER_HEAD and ORDER_ITEM tables will have similar AMP distribution, but different physical order within the AMPs.
      Each join operation requires sort of the selected ORDER_HEAD records in spool, or ORDER_ITEMS table will be merge joined against each selected non empty partitions of ORDER_HEAD sequentially (called sliding-window merge join)
    • Include
      ORDER_HEAD and ORDER_ITEM tables will have different AMP distribution, each join operation requires redistribution.Why do we not use the same PI at ORDER_ITEM? Because we do not have that column there.

    Neither of the above is acceptable in many cases. What should we do? In this case I would copy the ORDER_DT to the ORDER_ITEM table also, and use the same “Included” version of PI. Requires some more space, logic in load time, but great gain while accessing data.

    Use cases

    Filtering

    This select will eliminate all partitions except those three:
    select * from ORDER_HEAD where order_dt between '2013-12-12' (date) and '2013-12-14' (date);
    This select will generate all rows scan:
    select * from ORDER_HEAD where cast( order_dt as char(7)) = '2013-12';
    This select will generate all rows scan* either (sub-query):
    select * from ORDER_HEAD  where order_dt in (select max(calendar_date) from sys_calendar.calendar  where year_of_calendar=2013 and month_of_year=5);
    Why? Optimizer has to determine which partitions to be accessed in time of generating execution plan. That time it cannot know what is the result of the subquery. That is it.

    * I got a proper comment on this option to double check. Yes, right, this information is a out-of-date. With actual versions of Teradata (V13.10..V14.10) I experienced 3 different results:

    • Full scan
      Eg. sub-query contains a “group by”
    • Dynamic partition elimination
      Sub-query is simple, indicates “enhanced by dynamic partition elimination” section in the plan
    • Plan-time partititon elimination
      Literal condition or very simple sub query. Parsing time evaluation enables PO to determine which partitions to be scanned.  Plan: “…We do an all-AMPs … step from 3 partitions of…”. Do not really know exactly what decides between full scan, dynamic- or plan-time elimination… Explanations welcome.

    Join

    We join two tables: T1 and T2. The table shows what happens if they are partitioned, not partitioned and the partitioning key is included or not in the PI:
    T2

    T1

    PI:(a) PI:(a) PART(b) PI:(a,b) PART(b)
    PI:(a) Join: T1.a=T2.a
    RowHash match
    PI:(a) PART(b) Join: T1.a=T2.a
    T1 sorted by hash(a) or
    Sliding-window MJ
    Join: T1.a=T2.a
    T1&T2 sorted by hash(a)
    or Sliding-window MJ
    (NxM combinations)
    Join: T1.a=T2.a and T1.b=T2.b
    T1&T2 sorted by RowKey
    RowKey based MJ
    PI:(a,b) PART(b) Join: T1.a=T2.a
    T1 Redistributed & sorted
    by hash(a)
    Join: T1.a=T2.a
    T1 Redistributed by hash(a)
    T2 sorted by hash(a) and MJ
    Join: T1.a=T2.a and T1.b=T2.b
    T2 Redistributed and sorted by RowKey
    RowKey based MJ
    Join: T1.a=T2.a and T1.b=T2.b
    RowKey based MJ

    Insert

    Let’s take a transaction table like ORDERS. In practice we load it periodically (eg. daily) with the new increment which is typically focused to a short interval of transaction date/time. If the ORDERS table is not partitioned, then the outstanding hashing algorithm will spread them all over the data blocks of the table evenly, therefore Teradata has to modify far more data blocks than the increment was reside in.

    But if the ORDERS table is partitioned, then the physical order of the records is primarily determined by the partition key. This means that the increment will reside in very few partitions, close together, and the insert operation requires approx the same number of blocks to be written than the increment was in.

    For more details on PPIs please refer the documentation of the appropriate Teradata version.

    To be continued…

    DBQL analysis I. – Monitor “Top CPU consumers”

    DBQL analysis

    About DBQL

    What is it?

    DataBase Query Logging.

    It is a nice feature of Teradata RDBMS, which comprehensively logs the issued queries execution – if it is switched on.
    Configuration can be checked/administered eg. in the Teradata tools or from DBC.DBQLRuleTbl.
    Logging can be set on global/user level, and in respect of details (see DBQL tables)

    For detailed information please refer Teradata documentation of your version.

    DBQL tables

    Table Content
    DBQLogTbl Central table, 1 record for each query.
    DBQLSQLTbl Whole SQL command, broken up to 30k blocks
    DBQLStepTbl Execution steps of the query, one row for each step.
    DBQLObjTbl Objects participated in the query. Logged on different levels (db,table, column, index, etc.)
    DBQLExplainTbl English explain text, broken up to 30k blocks
    DBQLXMLTbl Explain in XML format, broken up to 30k blocks
    DBQLSummaryTbl PEs’ aggregated table, which accounts on the desired level.

    DBQL tables logically organized into 1:N structure, where DBQLogTbl is the master entity and others (except DBQLSummaryTbl) are the children.
    Join fields are the ProcID and QueryId together, eg:

    ...
    from 
        DBQLogTbl a
        join DBQLStepTbl b on 
            a.ProcID = b.ProcID 
            and 
            a.QueryID = b.QueryID
    ...

    Unfortunately PI of DBQL tables are not in sync with logical PK-FK relation in (also in latest V14.10), therefore JOIN-ed selects against online DBQL tables are not optimal.

    Cost of using DBQL

    DBQL basically consumes negligible amount of processing resources, since it has cached&batch write and generates data proportional to issued queries (flush rate is DBScontrol parameter).
    It is important to regularly purge/archive them from the DBC tables, Teradata has a recommendation for it. This ensures that PERM space consumption of the DBQL remains low.
    In an environment where ~1M SQLs are issued a day, comprehensive logging generates ~8..10G of DBQL data daily w/o XML and Summary. Less SQLs generate proportionally less data.

    It is worth to switch on all option except XML and Summary, since the first generates huge data volume (~makes it double), and the second is similar to Acctg info. If you want to utilize them, they should be switched on, of course.

    What is it good for?

    It contains:

    • Query run time, duration
    • Consumed resources
    • Environment info (user, default db, etc)
    • SQL text
    • Explain
    • Step resource info
    • Objects involved
    • Etc.

    One can get a lot of useful aggregated and query specific tuning information, some of them I will share in the blog.

    CPU usage distribution info

    (Everything applies to I/O also, just replace CPU with I/O, AMPCPUTime with TotalIOCount…)

    Do you think Query optimization is rewarding?

    Yes, I know it is hard work to find out why is ONE query run sub-optimally, and what to do with it.

    But guess how many queries consume how many percent of the processing resources (CPU) within a whole day’s workload.
    Tip it and write down for CPU%: 5%, 10%, 25% and 50%

    And now run the query below, which will result it to you. (replace the date value or maybe you have to adjust the date filtering according to local settings)

    select 
        'How many queries?' as "_"
        ,min(limit5) as "TOP5%CPU"
        ,min(limit10) as "TOP10%CPU"
        ,min(limit25) as "TOP25%CPU"
        ,min(limit50) as "TOP50%CPU" 
        ,max(rnk) as TotalQueries
    from
    (
        select
            case when CPURatio < 5.00 then null else rnk end as limit5
            ,case when CPURatio < 10.00 then null else rnk end as limit10
            ,case when CPURatio < 25.00 then null else rnk end as limit25
            ,case when CPURatio < 50.00 then null else rnk end as limit50
            ,rnk
        from
        (
            select
                sum(ampcputime) over (order by ampcputime desc) as totalCPU
                ,sum(ampcputime) over (
                    order by ampcputime desc  
                    rows unbounded preceding
                 ) as subtotalCPU
                ,subtotalCPU * 100.00 / totalCPU as CPUratio
                ,sum(1) over (
                    order by ampcputime desc 
                    rows unbounded preceding
                 ) as rnk
            from
            (
                select *
                /* For archived DBQL
                from dbql_arch.dbqlogtbl_hst 
                where 
                    logdate=1131201 
                    and 
                    ampcputime > 0 
                */
                /* For online DBQL*/
                from dbc.dbqlogtbl 
                where
                    cast(cast(starttime as char(10)) as date) = '2013-12-10' (date)
                    and 
                    ampcputime > 0
            ) x
        ) y
    ) z
    group by 1;

    Are you surprised?
    I bet:

    • Less than 10 queries will consume 5% of the CPU
    • Less than 1% of the queries will consume 50% of the CPU

    Let’s calculate.
    How much does your Teradata system cost a year? It is all for storage and processing capacity.
    If you can save eg. X% of CPU&I/O and X% storage using MVC optimization, you saved X% of the price of the Teradata system, by:

    • Improved user experience (earlier load, faster responses)
    • Resources for additional reports and applications
    • Enable postponing a very expensive Teradata hardware upgrade
    PRISE Tuning Assistant

    helps you to find those queries and to get the hang of how to accelerate them.

     
     

    Accelerate skewed joins

    How to “re-parallelize” skewed joins

    Case description

    Assume that we have 1M customers, 4M transactions and our top customer produce the 2.5% of all transactions.Others produce the remaining 97.5% of transactions approx. evenly.
    Scroll down to the bottom of the post for sample table and data generator SQL.

    Our task is to join a “Customer” and a “Transaction” tables on Customer_id.

    The join

    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    GROUP BY 1;


    We experience a pretty slow execution.
    On the ViewPoint we see that only one AMP is working, while others are not.

    What is the problem?
    There are two  separate subsets of the Transact table from “joinability” point of view:

    • “Peak” part (records of top customer(s))
      Very few customers have very much Transact records. Product join would be cost effective
    • “Even” part (records of other customers)
      Much customers have much, but specifically evenly few Transact records. Merge join would be ideal.

    Unfortunately Optimizer have to decide, only one operation type can be chosen. It will choose merge join which consumes far less CPU time.

    Execution plan looks like this:

     This query is optimized using type 2 profile T2_Linux64, profileid 21.
      1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
         RowHash to prevent global deadlock for D_DB_TMP.t.
      2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
         RowHash to prevent global deadlock for D_DB_TMP.c.
      3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
      4) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
         all-rows scan with a condition of (“NOT (D_DB_TMP.t.Customer_ID IS
         NULL)”) into Spool 4 (all_amps), which is redistributed by the
         hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  Then we do a
         SORT to order Spool 4 by row hash.  The size of Spool 4 is
         estimated with low confidence to be 125 rows (2,125 bytes).  The
         estimated time for this step is 0.01 seconds.
      5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
         RowHash match scan, which is joined to D_DB_TMP.c by way of a
         RowHash match scan.  Spool 4 and D_DB_TMP.c are joined using a
         merge join, with a join condition of (“D_DB_TMP.c.Customer_ID =
         Customer_ID”).  The result goes into Spool 3 (all_amps), which is
         built locally on the AMPs.  The size of Spool 3 is estimated with
         index join confidence to be 125 rows (10,375 bytes).  The
         estimated time for this step is 0.02 seconds.
      6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 5.  The size of Spool 5 is
         estimated with no confidence to be 94 rows (14,758 bytes).  The
         estimated time for this step is 0.02 seconds.
      7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
         an all-rows scan into Spool 1 (all_amps), which is built locally
         on the AMPs.  The size of Spool 1 is estimated with no confidence
         to be 94 rows (8,742 bytes).  The estimated time for this step is
         0.02 seconds.
      8) Finally, we send out an END TRANSACTION step to all AMPs involved
         in processing the request.
      -> The contents of Spool 1 are sent back to the user as the result of
         statement 1.  The total estimated time is 0.07 seconds.

    How to identify

    If you experience extremely asymmetric AMP load you can suspect on this case.
    Find highly skewed JOIN steps in the DBQL (set all logging options on):

    select top 50
    a.MaxAMPCPUTime * (hashamp()+1) / nullifzero(a.CPUTime) Skw,a.CPUTime,a.MaxAMPCPUTime * (hashamp()+1) CoveringCPUTime,
    b.*
    from dbc.dbqlsteptbl a
    join dbc.dbqlogtbl b on a.procid=b.procid and a.queryid=b.queryid
    where
    StepName='JIN'
    and CPUtime > 100
    and Skw > 2
    order by CoveringCPUTime desc;




    (Note: Covering CPU time is <No-of-AMPs> * <Max AMP’s CPU time>. Virtually this amount of CPU is consumed because asymmetric load of the system)

    Or if you suspect a specific query, check the demography of the join field(s) in the “big” table:

    SELECT TOP 100 <Join_field>, count(*) Nbr
    FROM <Big_table> GROUP BY 1 ORDER BY 2 DESC;


    If the top occurences are spectacularly larger than others (or than average) the idea likely matches.

    Solution

    Break the query into two parts: join the top customer(s) separately, and then all others. Finally union the results. (Sometimes additional modification also required if the embedding operation(s) – the group by here – is/are not decomposable on the same parameter.)
    First we have to identify the top customer(s):

    SELECT TOP 5 Customer_id, count(*) Nbr
    FROM Transact GROUP BY 1 ORDER BY 2 DESC;

    Customer_id          Nbr
    ——————————
              345       100004
         499873                4
         677423                4
         187236                4
           23482                4
         
    Replace the original query with his one:

    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    where t.Customer_id in (345)  

    /*
       ID of the top Customer(s). 
       If more customers are salient, list them, but max ~5
    */
    GROUP BY 1
    UNION ALL
    SELECT Customer_name, count(*)
    FROM Customer c
    JOIN Transact t ON c.Customer_id = t.Customer_id
    where t.Customer_id not in (345)  -- Same customer(s)
    GROUP BY 1
    ;

    Be sure that Customer.Customer_id, Transact.Transact_id and Transact.Customer_id have statistics!

    Rhis query is more complex, has more steps, scans Transact table 2 times, but runs much faster, you can check it.
    But why? And how to determine which “top” customers worth to be handled separately?
    Read ahead.

    Explanation

    Calculation

    Let’s do some maths:
    Assume that we are on a 125 AMP system.
    Customer table contains 1M records with unique ID.
    We have ~4.1M records in the Transact table, 100k for the top customer (ID=345), and 4 for each other customers. This matches the 2.5% we assumed above.

    If the  Transact table is redistributed on hash(Customer_id) then we will get ~33k records on each AMPs, excluding AMP(hash(345)). Here we’ll get ~133k (33k + 100K).
    That means that this AMP will process ~4x more data than others, therefore runs 4x longer.
    With other words in 75% of this JOIN step’s time 124 AMPs will DO NOTHING with the query.

    Moreover the preparation and subsequent steps are problematic also: the JOIN is prepared by a redistribution which produces a strongly skewed spool, and the JOIN’s result stays locally on the AMPs being skewed also.

    Optimized version

    This query will consume moderately more CPU, but it is distributed evenly across the AMPs, utilizing the Teradata’s full parallel capability.
    It contains a product join also, but is it no problem it joins 1 records to the selected 100k records of Transacts, that will be lightning fast.

    All

    Look at the execution plan of the broken-up query:

     This query is optimized using type 2 profile T2_Linux64, profileid 21.
      1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
         RowHash to prevent global deadlock for D_DB_TMP.t.
      2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
         RowHash to prevent global deadlock for D_DB_TMP.c.
      3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
      4) We do a single-AMP RETRIEVE step from D_DB_TMP.c by way of the
         unique primary index “D_DB_TMP.c.Customer_ID = 345” with no
         residual conditions into Spool 4 (all_amps), which is duplicated
         on all AMPs.  The size of Spool 4 is estimated with high
         confidence to be 125 rows (10,625 bytes).  The estimated time for
         this step is 0.01 seconds.
      5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an
         all-rows scan, which is joined to D_DB_TMP.t by way of an all-rows
         scan with a condition of (“D_DB_TMP.t.Customer_ID = 345”).  Spool
         4 and D_DB_TMP.t are joined using a product join, with a join
         condition of (“Customer_ID = D_DB_TMP.t.Customer_ID”).  The result
         goes into Spool 3 (all_amps), which is built locally on the AMPs.
         The size of Spool 3 is estimated with low confidence to be 99,670
         rows (8,272,610 bytes).  The estimated time for this step is 0.09
         seconds.
      6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 5.  The size of Spool 5 is
         estimated with no confidence to be 74,753 rows (11,736,221 bytes).
         The estimated time for this step is 0.20 seconds.
      7) We execute the following steps in parallel.
           1) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by
              way of an all-rows scan into Spool 1 (all_amps), which is
              built locally on the AMPs.  The size of Spool 1 is estimated
              with no confidence to be 74,753 rows (22,052,135 bytes).  The
              estimated time for this step is 0.02 seconds.
           2) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
              all-rows scan with a condition of (“D_DB_TMP.t.Customer_ID <>
              3454″) into Spool 9 (all_amps), which is redistributed by the
              hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  The size
              of Spool 9 is estimated with high confidence to be 4,294,230
              rows (73,001,910 bytes).  The estimated time for this step is
              1.80 seconds.
      8) We do an all-AMPs JOIN step from D_DB_TMP.c by way of an all-rows
         scan with a condition of (“D_DB_TMP.c.Customer_ID <> 3454”), which
         is joined to Spool 9 (Last Use) by way of an all-rows scan.
         D_DB_TMP.c and Spool 9 are joined using a single partition hash
         join, with a join condition of (“D_DB_TMP.c.Customer_ID =
         Customer_ID”).  The result goes into Spool 8 (all_amps), which is
         built locally on the AMPs.  The size of Spool 8 is estimated with
         low confidence to be 4,294,230 rows (356,421,090 bytes).  The
         estimated time for this step is 0.72 seconds.
      9) We do an all-AMPs SUM step to aggregate from Spool 8 (Last Use) by
         way of an all-rows scan , grouping by field1 (
         D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
         computed globally, then placed in Spool 10.  The size of Spool 10
         is estimated with no confidence to be 3,220,673 rows (505,645,661
         bytes).  The estimated time for this step is 8.46 seconds.
     10) We do an all-AMPs RETRIEVE step from Spool 10 (Last Use) by way of
         an all-rows scan into Spool 1 (all_amps), which is built locally
         on the AMPs.  The size of Spool 1 is estimated with no confidence
         to be 3,295,426 rows (972,150,670 bytes).  The estimated time for
         this step is 0.32 seconds.
     11) Finally, we send out an END TRANSACTION step to all AMPs involved
         in processing the request.
      -> The contents of Spool 1 are sent back to the user as the result of
         statement 1.  The total estimated time is 11.60 seconds.

    Sample structures

    The table structures (simplified for the example):

    CREATE TABLE Customer
    (
      Customer_ID   INTEGER
    , Customer_name VARCHAR(200)
    )
    UNIQUE PRIMARY INDEX (Customer_id)
    ;

    insert into Customer values (1,'Cust-1');
    Run 20x: 

    insert into Customer select mx + sum(1) over (order by Customer_id rows unbounded preceding) id, 'Cust-' || trim(id) from Customer cross join (select max(Customer_id) mx from Customer) x;

    collect statistics using sample on customer column (Customer_id);

    CREATE TABLE Transact
    (
      Transaction_ID   INTEGER
    , Customer_ID      INTEGER
    )
    UNIQUE PRIMARY INDEX (Transaction_id)
    ;

    insert into Transact values (1,1);
    Run 22x: 

    insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, id mod 1000000 from Transact cross join (select max(Transaction_id) mx from Transact) x;

    insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, 345 from Transact t cross join (select max(Transaction_id) mx from Transact) x where t.Transaction_id < 100000;

    collect statistics using sample on Transact column (Customer_id);
    collect statistics using sample on Transact column (Transaction_id) ;