DBQL analysis II. – Skewness – The big resource robbery detection

The big resource robbery detection

Skewness

Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on.

 

About Skew

If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

Find “Top skewed queries” in DBQL

With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):

select top 50
      ProcID
    , QueryID
    , AMPCPUTime
    , MaxAMPCPUTime * (hashamp () + 1) CPUImpact
    , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
    , TotalIOCount
    , MaxAMPIO * (hashamp () + 1) IOImpact
    , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
    , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
    , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
    , ParserCPUTime
    , Queryband
    , Substr(QueryText,1,2000) QueryText
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst 
where
    logdate=1131201
    and ampcputime > 0
*/
/* For online DBQL*/
    dbc.dbqlogtbl 
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and ampcputime>0
order by 
    CPUImpact desc

Explanation of extra fields:

ParserCPUTime
Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL
Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand
Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)

OK, we’ve listed the terrible top consumers, but what’s next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post…), you can find which job or report issued that SQL, anyway, you can see the QueryText field.

If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

select 
    SQLTextInfo 
from 
    dbc.dbqlsqltbl 
where 
    procid = <procid> 
    and queryid = <queryid> 
order by 
    SQLRowNo asc

You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

System level Skewness

Totals

We have found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?

The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.

We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

select
      sum(AMPCPUTime) AMPCPUTimeSum
    , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
    , sum(TotalIOCount) TotalIOCountSum
    , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
    , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
    , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst
where 
    logdate = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)
*/
/* For online DBQL*/
    dbc.dbqlogtbl
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)

Look at the last two columns. That percent of your CPU and I/O goes to the sink…

Top bad guys

OK, let’s check how many queries accumulate 5%, 10%, 25%, 50%, 75%, 90% of this loss?
Here you are (CPU version, transform for I/O implicitly):

select 
      'How many queries?' as "_"
    , min(limit5) "TOP5%Loss"
    , min(limit10) "TOP10%Loss"
    , min(limit25) "TOP25%Loss"
    , min(limit50) "TOP50%Loss"
    , min(limit75) "TOP75%Loss"
    , min(limit90) "TOP90%Loss"
    , max(rnk) TotalQueries
    , sum(ResourceTotal) "TotalResource"
    , sum(ResourceImpact) "ImpactResource"
from
(
    select
         case when ResRatio < 5.00 then null else rnk end limit5
        ,case when ResRatio < 10.00 then null else rnk end limit10
        ,case when ResRatio < 25.00 then null else rnk end limit25
        ,case when ResRatio < 50.00 then null else rnk end limit50
        ,case when ResRatio < 75.00 then null else rnk end limit75
        ,case when ResRatio < 90.00 then null else rnk end limit90
        ,rnk
        , ResourceTotal
        , ResourceImpact
    from
    (
        select
              sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
            , sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
            , subtotalRes *100.00 / totalRes Resratio
            , sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
            , ResourceTotal
            , ResourceImpact
        from
        (
            select
                AMPCPUTime ResourceTotal
                , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
                , ResourceImpact - ResourceTotal ResourceLoss
            /* For archived DBQL
            from 
                dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131207
                and ampcputime > 0
            */
            /* For online DBQL*/
            from 
                dbc.dbqlogtbl
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
                and ampcputime > 0
        ) x
    ) y
) z
group by 1

I expect you are a bit shocked now, how few queries waste how much golden resources.

I think we will agree that it is worth to tune those dozen of queries,
and you save in orders of $100K..M USD for your company annually, am I right?

PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

Typical reasons of skewness – in a nutshell

  • Skewed tables: Bad choice of PI, Skewed data 
  • Bad execution plans (typically skewed redistributions)
    • Bad data model (normalization,data types,PI, etc.)
    • Missing or stale statistics
    • Too many joins (break up the query!)
  • Hash collision (load time problem)

Interpreting Skewness

What does Skew metric mean?

Overview

You can see this word “Skewness” or “Skew factor” in a lot of places regarding Teradta: documents, applications, etc. Skewed table, skewed cpu. It is something wrong, but what does it explicitly mean? How to interpret it?

Let’s do some explanation and a bit simple maths.
Teradata is a massive parallel system, where uniform units (AMPs) do the same tasks on that data parcel they are responsible for. In an ideal world all AMPs share the work equally, no one must work more than the average. The reality is far more cold, it is a rare situation when this equality (called “even distribution”) exists.
It is obvious that uneven distribution will cause wrong efficiency of using the parallel infrastructure.
But how bad is the situation? Exactly that is what Skewness characterizes.

Definitions

Let “RESOURCE” mean the amount of resource (CPU, I/O, PERM space) consumed by an AMP.
Let AMPno is the number of AMPs in the Teradata system.

Skew factor := 100 – ( AVG ( “RESOURCE” ) / NULLIFZERO ( MAX (“RESOURCE”) ) * 100 )

Total[Resource] := SUM(“RESOURCE”)

Impact[Resource] := MAX(“RESOURCE”) * AMPno

Parallel Efficiency := Total[Resource] / Impact[Resource] * 100

or with some transformation:

Parallel Efficiency := 100 – Skew factor

Analysis

Codomain

0 <= “Skew factor” < 100

“Total[Resource]” <= “Impact[Resource]”

0<“Parallel Efficiency”<=100

Meaning

Skew factor : This percent of the consumed real resources are wasted
Eg. an 1Gbytes table with skew factor of 75 will allocate 4Gbytes*

Total[Resource] :Virtual resource consumption, single sum of individual resource consumptions , measured on  AMPs as independent systems

Impact[Resource] :Real resource consumption impacted on the parallel infrastructure

Parallel Efficiency : As it says. Eg. Skew=80: 20%

* Theoretically if there is/are complementary characteristics resource allocation (consumes that less resources on that AMP where my load has excess) that can compensate the parallel inefficiency from system point of view, but the probability of it tends to zero.

Illustration

The “Average” level indicates the mathematical sum of AMP level resource consumptions (Total[Resource]), while “Peak” is the real consumption from “parallel system view” (Impact[Resource])
On finding skewed tables I will write a post later.

PRISE Tuning Assistant helps you to find queries using CPU or I/O and helps to get rid of skewness.

Using Partitioned Primary Index

How to use partitioned primary index (PPI)

This post is about row partitioning and will not discuss columnar.

What is partitioning?

To explain it correctly, let’s get back to the basics…
Each Teradata tables (except NoPI type) have “Primary Index”, aka. PI, which is not physical index, but rather a logical construction: one or more columns of the table which give the input for hashing method. The hash value determines two things:

  • Which AMP will store the record
  • Storing order of the records within the AMPs

If the PI is non-partitioned then the records are stored in order of hash value of PI.

If you use (row) partitioning, you define it at the Primary Index.
In this case Teradata will associate a 2bytes or 2/8 bytes (at V14.10) “partition code” to the record*, and the storing order is <partition code>,<hash_value> (aka. RowKey).
That way partitions are not sub-tables or other physical objects, but only influence the record storing order.

* This implies that no more than 64k(2bytes)/9Q(8bytes) partitions can exist. For details read the appropriate Teradata version’s documentation.

What is the difference between PPI and NUSI?

NUSI (Non Unique Secondary Index) can serve as similar purposes, but is absolutely different.
NUSI is a separate subtable, with analogue PI to base table, but different (value) ordering.
For details please read Teradata documentation.

How to define?

Non partitioned table:
create table tablePI
(
  Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id)

Partitioned table:
create table tablePPI (   Trx_id Integer
, Trx_dt Date
)
PRIMARY INDEX (Trx_id
, Trx_dt**)
PARTITION BY RANGE_N(Trx_dt BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

Highlights

  • **Partitioning key (Trx_dt here) can be part of the PI or not. This is very important, see below.
  • Partitioning can be single or multiple (MLPPI) levels***
  • RANGE_N or CASE_N functions can be used for determining partition code
  • RANGE_N function has constant interval endpoints and partition length.
  • NO RANGE and UNKNOWN partitions will store the out-of-intervals and null value records respectively

***MLPPI is a technique when multiple or nested partitioning is defined on the table. Logically it looks like sub-partitions, but in practice it only influences the calculation of partition code values, which is still a linear 2/8 bytes value overall the table.

Pros – Cons of using PPI

PPI is a very useful feature, but not a silver bullet to use it everywhere. Look the trade offs:

  • (+) Partition elimination
    Only the relevant partitions are scanned while accessing data
  • (+) Interval filtering is supported
  • (+) Accelerates INSERTs
    If we load increment data into a populated table. Very likely less data blocks are affected, since few partitions are involved (if date is the partitioning basis) 
  • (-) 2 or 8 bytes extra space allocation per record
  • (-) Compression is not allowed on PartKey column
  • (-) PartKey inclusion problem (see below)
  • (-) Partition elimination works only with literals
    Subselects cause full table scans

Design aspects

RANGE_N or CASE_N

These functions are used to define partitioning. RANGE_N is for concentrate date (integer) intervals into partitions, while CASE_N is like a CASE-WHEN-THEN expression, where the outcome is the partition.
Typically RANGE_N is used when we partition a transaction table by its date or timestamp, while CASE_N is popular in special cases like categorizing. You can use more columns in the logical expression, but take care, all of them must be used in filter condition to enable partition elimination.

RANGE_N: what interval size?

It depends on the granularity of the data, granularity of filtering and how long interval should be stored in the table. Usually daily partitioning is ideal.

RANGE_N: interval extension or intervals in advance?

If we load transactional data into our partitioned table, the date column we use as partition key is populated later and later dates, while we have a finite partition range definition.
Partition ranges can be added to RANGE_N definition periodically (depends on version), or we can define partitions in far advance. (365 partitions required for a year, 65k partitions cover ~ 180years, which is more than enough) Note that empty partitions do not allocate space.
One of the methods above should be applied, otherwise the NO RANGE partition will grow extensively, which will cause performance degradation due to less effective partition elimination.

Partitioning Key: include in PI or not?

This is the funny point.
Partitioning key is the column(s) that determines the partition, say used in the RANGE_N/CASE_N definition. We can include it in the Primary Index or not, we decide.

Let’s take an example. We have a master-detail pair of tables, nicely “equi-PI”-ed for effective join:

CREATE TABLE ORDER_HEAD
(
  ORDER_NO INTEGER
, ORDER_DT DATE
) UNIQUE PRIMARY INDEX (ORDER_NO);

CREATE TABLE ORDER_ITEM
(
  ORDER_NO INTEGER
, ORDER_ITEM_NO
, PROD_NO INTEGER
) PRIMARY INDEX (ORDER_NO);

We modify ORDER_HEAD’s PI:
UNIQUE PRIMARY INDEX (ORDER_NO, ORDER_DT)

PARTITION BY RANGE_N(ORDER_DT BETWEEN DATE '2010-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY , NO RANGE, UNKNOWN)

Should we include ORDER_DT or not? Which is better, what is the difference?

  • Not include
    ORDER_HEAD and ORDER_ITEM tables will have similar AMP distribution, but different physical order within the AMPs.
    Each join operation requires sort of the selected ORDER_HEAD records in spool, or ORDER_ITEMS table will be merge joined against each selected non empty partitions of ORDER_HEAD sequentially (called sliding-window merge join)
  • Include
    ORDER_HEAD and ORDER_ITEM tables will have different AMP distribution, each join operation requires redistribution.Why do we not use the same PI at ORDER_ITEM? Because we do not have that column there.

Neither of the above is acceptable in many cases. What should we do? In this case I would copy the ORDER_DT to the ORDER_ITEM table also, and use the same “Included” version of PI. Requires some more space, logic in load time, but great gain while accessing data.

Use cases

Filtering

This select will eliminate all partitions except those three:
select * from ORDER_HEAD where order_dt between '2013-12-12' (date) and '2013-12-14' (date);
This select will generate all rows scan:
select * from ORDER_HEAD where cast( order_dt as char(7)) = '2013-12';
This select will generate all rows scan* either (sub-query):
select * from ORDER_HEAD  where order_dt in (select max(calendar_date) from sys_calendar.calendar  where year_of_calendar=2013 and month_of_year=5);
Why? Optimizer has to determine which partitions to be accessed in time of generating execution plan. That time it cannot know what is the result of the subquery. That is it.

* I got a proper comment on this option to double check. Yes, right, this information is a out-of-date. With actual versions of Teradata (V13.10..V14.10) I experienced 3 different results:

  • Full scan
    Eg. sub-query contains a “group by”
  • Dynamic partition elimination
    Sub-query is simple, indicates “enhanced by dynamic partition elimination” section in the plan
  • Plan-time partititon elimination
    Literal condition or very simple sub query. Parsing time evaluation enables PO to determine which partitions to be scanned.  Plan: “…We do an all-AMPs … step from 3 partitions of…”. Do not really know exactly what decides between full scan, dynamic- or plan-time elimination… Explanations welcome.

Join

We join two tables: T1 and T2. The table shows what happens if they are partitioned, not partitioned and the partitioning key is included or not in the PI:
T2

T1

PI:(a) PI:(a) PART(b) PI:(a,b) PART(b)
PI:(a) Join: T1.a=T2.a
RowHash match
PI:(a) PART(b) Join: T1.a=T2.a
T1 sorted by hash(a) or
Sliding-window MJ
Join: T1.a=T2.a
T1&T2 sorted by hash(a)
or Sliding-window MJ
(NxM combinations)
Join: T1.a=T2.a and T1.b=T2.b
T1&T2 sorted by RowKey
RowKey based MJ
PI:(a,b) PART(b) Join: T1.a=T2.a
T1 Redistributed & sorted
by hash(a)
Join: T1.a=T2.a
T1 Redistributed by hash(a)
T2 sorted by hash(a) and MJ
Join: T1.a=T2.a and T1.b=T2.b
T2 Redistributed and sorted by RowKey
RowKey based MJ
Join: T1.a=T2.a and T1.b=T2.b
RowKey based MJ

Insert

Let’s take a transaction table like ORDERS. In practice we load it periodically (eg. daily) with the new increment which is typically focused to a short interval of transaction date/time. If the ORDERS table is not partitioned, then the outstanding hashing algorithm will spread them all over the data blocks of the table evenly, therefore Teradata has to modify far more data blocks than the increment was reside in.

But if the ORDERS table is partitioned, then the physical order of the records is primarily determined by the partition key. This means that the increment will reside in very few partitions, close together, and the insert operation requires approx the same number of blocks to be written than the increment was in.

For more details on PPIs please refer the documentation of the appropriate Teradata version.

To be continued…

DBQL analysis I. – Monitor “Top CPU consumers”

DBQL analysis

About DBQL

What is it?

DataBase Query Logging.

It is a nice feature of Teradata RDBMS, which comprehensively logs the issued queries execution – if it is switched on.
Configuration can be checked/administered eg. in the Teradata tools or from DBC.DBQLRuleTbl.
Logging can be set on global/user level, and in respect of details (see DBQL tables)

For detailed information please refer Teradata documentation of your version.

DBQL tables

Table Content
DBQLogTbl Central table, 1 record for each query.
DBQLSQLTbl Whole SQL command, broken up to 30k blocks
DBQLStepTbl Execution steps of the query, one row for each step.
DBQLObjTbl Objects participated in the query. Logged on different levels (db,table, column, index, etc.)
DBQLExplainTbl English explain text, broken up to 30k blocks
DBQLXMLTbl Explain in XML format, broken up to 30k blocks
DBQLSummaryTbl PEs’ aggregated table, which accounts on the desired level.

DBQL tables logically organized into 1:N structure, where DBQLogTbl is the master entity and others (except DBQLSummaryTbl) are the children.
Join fields are the ProcID and QueryId together, eg:

...
from 
    DBQLogTbl a
    join DBQLStepTbl b on 
        a.ProcID = b.ProcID 
        and 
        a.QueryID = b.QueryID
...

Unfortunately PI of DBQL tables are not in sync with logical PK-FK relation in (also in latest V14.10), therefore JOIN-ed selects against online DBQL tables are not optimal.

Cost of using DBQL

DBQL basically consumes negligible amount of processing resources, since it has cached&batch write and generates data proportional to issued queries (flush rate is DBScontrol parameter).
It is important to regularly purge/archive them from the DBC tables, Teradata has a recommendation for it. This ensures that PERM space consumption of the DBQL remains low.
In an environment where ~1M SQLs are issued a day, comprehensive logging generates ~8..10G of DBQL data daily w/o XML and Summary. Less SQLs generate proportionally less data.

It is worth to switch on all option except XML and Summary, since the first generates huge data volume (~makes it double), and the second is similar to Acctg info. If you want to utilize them, they should be switched on, of course.

What is it good for?

It contains:

  • Query run time, duration
  • Consumed resources
  • Environment info (user, default db, etc)
  • SQL text
  • Explain
  • Step resource info
  • Objects involved
  • Etc.

One can get a lot of useful aggregated and query specific tuning information, some of them I will share in the blog.

CPU usage distribution info

(Everything applies to I/O also, just replace CPU with I/O, AMPCPUTime with TotalIOCount…)

Do you think Query optimization is rewarding?

Yes, I know it is hard work to find out why is ONE query run sub-optimally, and what to do with it.

But guess how many queries consume how many percent of the processing resources (CPU) within a whole day’s workload.
Tip it and write down for CPU%: 5%, 10%, 25% and 50%

And now run the query below, which will result it to you. (replace the date value or maybe you have to adjust the date filtering according to local settings)

select 
    'How many queries?' as "_"
    ,min(limit5) as "TOP5%CPU"
    ,min(limit10) as "TOP10%CPU"
    ,min(limit25) as "TOP25%CPU"
    ,min(limit50) as "TOP50%CPU" 
    ,max(rnk) as TotalQueries
from
(
    select
        case when CPURatio < 5.00 then null else rnk end as limit5
        ,case when CPURatio < 10.00 then null else rnk end as limit10
        ,case when CPURatio < 25.00 then null else rnk end as limit25
        ,case when CPURatio < 50.00 then null else rnk end as limit50
        ,rnk
    from
    (
        select
            sum(ampcputime) over (order by ampcputime desc) as totalCPU
            ,sum(ampcputime) over (
                order by ampcputime desc  
                rows unbounded preceding
             ) as subtotalCPU
            ,subtotalCPU * 100.00 / totalCPU as CPUratio
            ,sum(1) over (
                order by ampcputime desc 
                rows unbounded preceding
             ) as rnk
        from
        (
            select *
            /* For archived DBQL
            from dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131201 
                and 
                ampcputime > 0 
            */
            /* For online DBQL*/
            from dbc.dbqlogtbl 
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-10' (date)
                and 
                ampcputime > 0
        ) x
    ) y
) z
group by 1;

Are you surprised?
I bet:

  • Less than 10 queries will consume 5% of the CPU
  • Less than 1% of the queries will consume 50% of the CPU

Let’s calculate.
How much does your Teradata system cost a year? It is all for storage and processing capacity.
If you can save eg. X% of CPU&I/O and X% storage using MVC optimization, you saved X% of the price of the Teradata system, by:

  • Improved user experience (earlier load, faster responses)
  • Resources for additional reports and applications
  • Enable postponing a very expensive Teradata hardware upgrade
PRISE Tuning Assistant

helps you to find those queries and to get the hang of how to accelerate them.