Avoiding Product Joins

How to eliminate Product Joins

What is product join?

Avoiding Product Cross Join in Oracle
Product join is one of the implementation methods of an SQL JOIN operation.
Do not mix up with cross join (Cartesian product), which is one type of SQL joins.

SQL join types, eg.: inner join, left outer join, full outer join, cross (Cartesian) join
Join implementation types, eg.: nested join, merge join, hash join, product join.

Product join (of tables A and B ) is the most simple method of join implementation:

  • Produce each of <A;B> record combinations, say take each records from A singly, and match it with each records of B one-by-one.
  • Test the join condition on each produced <A;B> record pairs, and eliminate those combinations where the condition fails.
The two steps are often combined, and the “testing phase” is executed right after a record combination is generated, and the non valid combinations right after dropped. This saves a lot of temp space.

Why don’t we like product joins?

Well, it has a really bad reputation. It is slow, stuffs CPU, etc.
Yes, it usually is, does. It is the brute force method for executing a join, with costs in order of N*M (where N, M are the record numbers of the joinable tables)

Indeed there are situations when it is the best choice, or the only feasible way.

When is it good or necessary?

Please note that product join is the method what is always applicable, independently of all circumstances.

Good

Product join is typically simple, dumb and slow algorithm, this is why we do not like it, but has a very important advantage: requires no pre-processing.* This is why we LIKE IT:)
If we have to join a really large table to a very small table (couple of records) product join is far the most effective method, since the sort of a very large table ( order of N*logN ) can cost a lot, while joining to 1-2 records is really not a big deal.

Necessary

There are join situations when the only way to go is the product join. Why? Because of the join condition. The “clever joins” (merge, hash) require some information and/or condition that somehow enables to cheat the A x B comparisons: reduce them to the ones that really necessary, and be done in a more effective manner.

* OK, in Teradata this means: only requires that the matchable records from both tables must be on the same AMP. This implies the “small” table to be duplicated to all AMPs.

Merge join example


from A
join  B on A.customer_id = B.customer_id
and A.trx_dt between B.eff_dt and B.exp_dt

  • Customer_id clause is in AND condition with the others
  • Customer_id is selective enough that hash(customer_id) can reduce the comparisons reasonably
  • Note that A and B must be sorted (re-sorted) by the hash of customer_id

Product join example


from A
join   B on substr(A.telephone_no,1,B.prefix_length) = B.telephone_no_prefix

  • There is no comparison reducing partial-condition
  • Note that neither of the tables required to be sorted in a specific order.

Unavoidable product joins

  • Non-eqality condition
  • Function used (eg. substr())
  • Dependent expression is used (eg. A.x+B.y = A.z)
  • Cross join: intentional Cartesian product

Avoidable product joins

Data type mismatch

The merge join example above works only if customer_no in A and B tables have the same “style” data types, since their hash value will match only in this case. Say hash(13674) <> hash(‘13674’), however integer is compatible with decimal, and char is compatible with varchar.
Pay attention on data type consistence during physical data modeling.
  • Use domains to eliminate the possibility of mismatch
  • Align to used data types when defining temp tables, or use “create table as …” statements
  • If you cannot avoid mismatch, relocate the necessary data to temp tables with proper data types during processing.

OR condition

Let’s assume the following join condition:
select ...
from A
join  B on A.col1 = B.Col1
OR 

           A.Col2 = B.Col2

This is equivalent, w/o compulsory product join :

select ... 
from A
join  B on A.col1 = B.Col1 

UNION 
select ...

from A
join  B on A.Col2 = B.Col2

Missing/stale statistics

As I mentioned before product join is the most effective join between a very large and a really small (couple of records) table. If the optimizer thinks that a table is pretty small, but it is not indeed, it may choose a product join in all good faith, misleaded by a stale or missing statistics.
Define and keep fresh those statistics by the optimizer can determine the size of the joinable record sets  properly.

How to find avoidable product joins

It is not trivial to list the avoidable product joins. Practically all product joins are required to be examined one-by-one and judged to be avoidable or not. And if avoidable, what to do for.

I strongly recommend to use PRISE Tuning Assistant for both finding the product joins and analyzing the possibility and necessity of elimination:

  • List top consuming queries with product join(s)
  • Check the PROD JOIN steps: which tables are processed that way
  • Check those join conditions for cases described above

What to do if cannot be avoided?

In this case I recommend to try the decomposition, described here.
It can help reducing the number of comparisons, saving CPU and runtime.

 
Have a successful optimization!
 

Teradata performance optimization project
I. Basics and prerequisites

Teradata performance optimization

Introduction

The Teradata database is a pretty expensive box, which serves a lot of people in the company – in a direct or indirect way. The efficient utilization of its resources is essential, from both financial and user satisfaction aspects.


This series of posts will follow a real-life Teradata optimization project’s methodology, steps and results, with this achievement:

System level gain: 31,6%

Levels of performance optimization

We can typically apply 3 levels of Teradata performance optimization:

Application

Meaning: Analyze user requests and system activity, then refine them, eliminating unnecessary activities.
This level checks if the executed processes are really needed, and they do what the user really needs. Eg: eliminate unnecessary data loads or reports; limit a report to a shorter period

  • Big gains possible w/o any trade-off
  • Comprehensive analysis (incl. find and interview appropriate persons)

SQL

Meaning: Pick individual SQLs and optimize its execution
This approach finds the top resource consumer SQL statements and reduces the resource consumption, while produces the same results.
Eg: Change PI; add/refresh statistics; rewrite SQL

  • Easy analysis
  • Low risk (small affect, good testing possibilities)
  • Needs SQL level analysis

Infrastructure

Meaning: Tune low level system (RDBMS, OS, Hardware) parameters
Tuning on infrastructure level is a global intervention regardless of users/business areas or importance

  • Miracle in some special cases
  • High risk (affects the all system)
  • Complex cross-dependencies

I prefer and recommend SQL level optimization (b.) in the beginning, because it delivers high value along with low risk and scaleablity of project. Later on can the others come into focus.

Goal of SQL optimization

The goals of an optimization project can be different. Some examples:
“Finish load procedure until 6am”
“Present my report 2 hours earlier”
“Reduce CPU load by 20%”
“Let hardware extension postponed by 6 months”

The goal of our project was: speed up the “load batch” process, in order to let a specific identified data area load finished and available for further usage 1 hour earlier than before.

Prerequisites

The key input for an SQL optimization is the information about what exactly happens:

What, when, how:

Which SQL statements ran, when, what steps were executed, how much resources did the consume with what skewness, etc.
All these can be found in the DBQL tables of Teradata (dbc.dbql*)
Switching on them will not cause significant extra load, but it is required to archive/purge them regularly (see PDCR tool in Teradata package).
Teradata Administrator enables to switch logging on (Tools/Query logging…) and one can check the actual settings:

select * from dbc.dbqlruletbl;
Explain/Obj/Sql/Step flags should be set to ‘T’, others are not necessary. Global (1 record with UserID: 00000000 ) logging is recommended.
If continuous logging is not possible, logging should be switched on 1 week before starting the optimization project

What / Who runs the statements?

Teradata provides a great feature for this: Query Banding
All executed SQL statements can be flagged with several information elements regarding the environment, job, script, version etc. which helps identifying a specific one.
The queryband data will appear in the DBQL logs – if the running environment is configured to flag the queries being run.
Querybanding is not essential for an optimization, but helps really a lot to identify the exact point in the data warehouse ecosystem, where the modifications should be done (which job/jobstep/script).

What are the environmental conditions?

Eg. Data volumes, statistics, etc.
This information is available in the data dictionary (DBC tables and views), but only as an actual – not for the time the statement run.
Most of the cases it is OK for us, but this property needs to be kept in mind.

When all these are finished, we are ready to do a successful performance optimization project.

The next post is about how to identify the good candidate SQL statements for optimization.

Stay tuned!

How to speed up Teradata Fastexport

About Teradata Fastexport

Fastexport1
Teradata Fastexport is a nice, powerful tool to export mass volume of data to file. Its default execution flow is the following:

  1. Apply locks on the affected tables
  2. Execute the SELECT statement
  3. Places the result in a SPOOL
  4. Releases locks
  5. Exports data from the SPOOL

It has an advantage: the release of locks will happen immediately after the result data is copied into the SPOOL. Terefore need not to wait until the data is written to the export file outside the database – it is typically a single process operation, can last for a long time.

The price to be payed for this advantage: all the exported data must be copied first to the SPOOL. When the SELECT is a simple operation (no join, sort or other extras), the copy phase can be saved, which can approach 50% of the total resource and time consumption.

The NOSPOOL trick

For these simple cases one can use the NOSPOOL/NOSPOOLONLY mode.
While NOSPOOL is a soft switch, which is ignored if not applicable, NOSPOOLONLY will make statement failed if the nospool option is not available for the given query.

In NOSPOOL mode the database will not make a copy of result data. Instead it directly passes the data to the client, which has another benefit: the execution of the SELECT runs dominantly parallel with the file write operation, instead of successively like in SPOOL (default) case.

Usage:
.BEGIN EXPORT
SESSIONS …
[SPOOLMODE SPOOL|NOSPOOL|NOSPOOLONLY];

The tradeoff

The NOSPOOL operation only supports the “simple” SELECT statements. The not supported operations:

  • SORT (ORDER BY, HAVING)
  • JOIN
  • Aggregations (GROUP BY, SUM())
  • OLAP functions
  • Non-table data (date, session, user)
  • Multiple statements

But allows:

  • Sampling
  • Column selection
  • Scalar functions (like col1 + col2, substr())
  • “Where” clause (incl. partition elimination)

The disadvatages of NOSPOOL mode:

  • No ActivityCount is available
  • Locks maintained till the end of the export session
  • Data conversion errors can happen during the exporting phase
  • (Only available from V13.10)

What is it good for indeed?
It can almost double the throughput in case of:

  • DataMart export (to another database)
  • Archive dumps
  • Copy data to test environment
  • Migration to another database systems

Measures

We’ve executed a Teradata to Netezza migration, where got to transport ~12TBytes of data through file interface. After we activated the NOSPOOL option, the average export speed grew up by 84%.

The hidden product join

Hiding enemy: the hidden “product” join

Introduction

When we check query performance one of the first things we search for is the PRODUCT JOIN. Usually this is the component we hate and pursue, because we belive that it is the root problem. It is not always true, but it is a different topic.
Sometimes we find merge join, but it is still very slow, in spite there is no skewness, PIs seem OK.

So the question is: how can a simple merge join be so pathetically slow sometimes?

What can be the problem?

The real evil in this case is – my favourite – the hash collision.
Let’s imagine the tables below (transactions’ columns splitted into two tables):

CREATE TABLE TRX_A  (
  User_id INTEGER,
  User_trx_seq INTEGER
  ColA...
  ColB...
  ...
) WITH DATA
PRIMARY INDEX (User_id)

CREATE TABLE TRX_B  (
  User_id INTEGER,
  User_trx_seq INTEGER
  ColC...
  ColD...
  ...
  ...
) WITH DATA
PRIMARY INDEX (User_id)


The individual users produce generally equal number of transaction in our cited case, so table skewness is not experienced. The INSERTs into these tables are not that fast, but the daily increment is not that big, so it makes no outstanding problem – in our imagined case.
We have all required statistics either.

Let’s take this query:

SELECT
  …
  …
FROM TRX_1 a
JOIN TRX_2   b on a.User_id =b.User_id and a.User_trx_seq = b.User_trx_seq

How does the optimizer think?
Distribution of the tables are excellent, both have equal PI and in addition the column is part of an “AND”-ed join condition. The matchable records reside on the same AMP – the only thing to do is evaluate the “residual” condition: a.User_trx_id = b.User_trx_id.
However it is not that simple.
In the beginning each user has few transactions, eg. 10.
What does the system do? Each records of User1 in TRX_1 table must be compared with 10 records from TRX_2. That is 100 comparisons.
What if each user has 1000 transactions? 1million comparisons for each user!!
The growth is in the order of X2, where X is the number of records per PI (exactly PI hash) value.
In this type of cases the number of records per PI value grows linearly or quicker as time goes by, what means longer and longer execution times.

Please note thet the total time required for the join step is determined by the maximum number of records per PI value, since all these records will fall to the same AMP, therefore that one will perform the slowest.

 

What is the solution?

We can experience this problem in two cases:

  • Sources are regular tables of our datamodel
  • Sources (at least one of them) is a Spool during a query execution

In the first case we have to think on changing the PI, if we can (to <User_id,User_trx_seq> in our case). This will speed up the INSERT procedures, so does the JOINED DELETE and UPDATE operations. The disadvantage is that the simple User_id based joins will collapse. They will probably require redistribution of out TRX_x table. The User_id based filters will turn to full-table-scans instead of PI access, which can be handled by secondary- or join indices.
Another option is to do a pre-fetch from the TRX_1 and TRX_2 tables into volatile tables having PI: (User_id,User_trx_seq). But note that it may be expensive, if we cannot apply a pretty selective filtering. This means redistribution of the tables.

In the second case we may apply the method of storing the partial result. Put the Spool into a volatile table with the appropriate PI (User_id,User_trx_seq), and this will hopefully enforce the other table/spool to be redistributed to this hash-collision free distribution.

 

How can we catch those guys?

Unfortunately the “explain plan” will not have a blinking LED to highlight this situation, we need to have an experience to suspect that someting’s wrong. The main indicator is the cost/input ratio. Check how much records are in the bigger source table and compare with the step’s impactCPU, while your other eye is on the sources’ and target’s skew. If the skews are low, but the CPU (or I/O) is unreasonably high, you probably got it!

Does it seem complicated? Maybe yes, but quite simple if using PRISE Tuning Assistant:

 hidden_product_join_pta

Using Join Index

How to use Join Index in Teradata

About Join Index (JI)

Join index is most like a “materialized view”, say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.

What are the main differences between a JI and a Secondary Index?

  • Different internal structure and purposes
  • Appears as separate object (not a sub-table) in the DBC tables/views
  • Can reside in different database than the base table(s)
  • Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured

What is the difference between JI and a regular table in which I can store the same query result? 

  • You can’t access it directly from an SQL query
  • Optimizer automatically decides to be used or not:
    • By logic: if the JI can support the query with its data content
    • By cost: if usage of JI will result a “cheaper” execution (always collect statistics!)
  • Maintained automatically if the content of the base table(s) change
  • “Locks” the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).

JI types

The Join Index types I will list below are not differentiated by SQL phrase, but the structure of the SQL SELECT used in the JI definition.

They can be combined also in reasonable ways, eg. <single table – aggregate – sparse> or <multi table – aggregate>, etc.

Let’s take these base tables for our examples:
CREATE TABLE TBL_A
(
  Col1 integer
, Col2 integer
, Col3 integer
PRIMARY INDEX (Col1)
;
CREATE TABLE TBL_B
(
  Col1 integer
, Col2 integer
PRIMARY INDEX (Col1)
;

Single table JI

This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:

  • Non-covering
    We select only the filtering column(s) (those will be the PI of the JI also) and the “Rowid” pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table’s appropriate records. The JI can be very small this way, but note that we have an additional “join phase”.
  • Covering
    The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some “central” tables.
This example shows a non-covering index for the query below:
create join index JI_1
as
SELECT Col1,Col2
FROM TBL_A
PRIMARY INDEX(Col2)

;

select Col3 from TBL_A where Col2=1;

 Multi table JI

This kind of JI is for accelerating frequent join statements. Technically it stores the result of a join. It can cover all the columns of just store the key-pairs, or somewhere between.

create join index JI_2
as
SELECT a.Col3,b.Col1
FROM TBL_A a

join     TBL_B b on a.Col2=b.Col2
PRIMARY INDEX(Col3)
;

Aggregate JI

The JI’s SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata’s optimizer is quite clever, because it can recognize “intermediate” aggregate JIs for further aggregation instead using the base table. Example:

create join index JI_3
as
SELECT Col1,Col2,sum(Col3) X
FROM TBL_A

GROUP BY 1,2
PRIMARY INDEX(Col2)
;

All three SELECTs can be served from the JI_3:
SQL1: select Col1        , sum(X) from TBL_A group by 1;
SQL1: select Col2        , sum(X) from TBL_A group by 1;

SQL1: select Col1,Col2, sum(X) from TBL_A group by 1,2;

Sparse JI

Thees JIs contain where clause, say the indexing is not for the whole table. If the where condition of the JI is a logical subset of the supported SQL’s where condition than the JI can support the query.
Typical usage is on the transactional tables: we have frequent accessed on a transaction table by Customer_id, but have PI of Trx_id. We can observe that 90% of the SELECTs fall onto the last 30 days.
We can put a sparse non covering single table JI on the table with PI:Customer_id
90% of the selects will finish in no time, and 10% of the queries will result in a full table scan, meanwhile our JI size remains pretty small: we index only 30 days instead of eg. 365 days.
Please note that where condition of a sparse JI can not contain “changing” values like current_date. That means JI must be regularly redefined if a moving window sparse JI is to be maintained.

create join index JI_4
as
SELECT Col1,Col2,Col3 X
FROM TBL_A

WHERE Col3 between 1 and 10
PRIMARY INDEX(Col2)
;

Serveable query:
select Col1 from TBL_A where Col2=20 and Col3=4;
Non serveable query:
select Col1 from TBL_A where Col2=20 and Col3=11;

When will join index help?

Join indices typically serve two purposes (or the combination of them):

  • Store the result of frequently produced (part-)results
  • Provide alternative access path to the data
Practical examples:
  • Frequent aggregation or join operation
  • Frequent redistribution of a table
  • Non primary index based filtering on a table (causing full table scan)
  • Non primary index based join of a table (causing redistribution)

Important things about JIs

Statistics

Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use them

Choosing PI

Te PI of the JI will be the most effective access path to the JI’s records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.

DBQL analysis II. – Skewness – The big resource robbery detection

The big resource robbery detection

Skewness

Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on.

 

About Skew

If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:

As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.

Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.

Find “Top skewed queries” in DBQL

With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):

select top 50
      ProcID
    , QueryID
    , AMPCPUTime
    , MaxAMPCPUTime * (hashamp () + 1) CPUImpact
    , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%"
    , TotalIOCount
    , MaxAMPIO * (hashamp () + 1) IOImpact
    , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%"
    , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR
    , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL
    , ParserCPUTime
    , Queryband
    , Substr(QueryText,1,2000) QueryText
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst 
where
    logdate=1131201
    and ampcputime > 0
*/
/* For online DBQL*/
    dbc.dbqlogtbl 
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and ampcputime>0
order by 
    CPUImpact desc

Explanation of extra fields:

ParserCPUTime
Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
LHR/RHL
Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
QueryBand
Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)

OK, we’ve listed the terrible top consumers, but what’s next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post…), you can find which job or report issued that SQL, anyway, you can see the QueryText field.

If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:

select 
    SQLTextInfo 
from 
    dbc.dbqlsqltbl 
where 
    procid = <procid> 
    and queryid = <queryid> 
order by 
    SQLRowNo asc

You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.

System level Skewness

Totals

We have found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?

The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.

We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.

select
      sum(AMPCPUTime) AMPCPUTimeSum
    , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum
    , sum(TotalIOCount) TotalIOCountSum
    , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum
    , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%"
    , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%"
from
/* For archived DBQL
    dbql_arch.dbqlogtbl_hst
where 
    logdate = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)
*/
/* For online DBQL*/
    dbc.dbqlogtbl
where
    cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
    and (ampcputime > 0 or TotalIOCount > 0)

Look at the last two columns. That percent of your CPU and I/O goes to the sink…

Top bad guys

OK, let’s check how many queries accumulate 5%, 10%, 25%, 50%, 75%, 90% of this loss?
Here you are (CPU version, transform for I/O implicitly):

select 
      'How many queries?' as "_"
    , min(limit5) "TOP5%Loss"
    , min(limit10) "TOP10%Loss"
    , min(limit25) "TOP25%Loss"
    , min(limit50) "TOP50%Loss"
    , min(limit75) "TOP75%Loss"
    , min(limit90) "TOP90%Loss"
    , max(rnk) TotalQueries
    , sum(ResourceTotal) "TotalResource"
    , sum(ResourceImpact) "ImpactResource"
from
(
    select
         case when ResRatio < 5.00 then null else rnk end limit5
        ,case when ResRatio < 10.00 then null else rnk end limit10
        ,case when ResRatio < 25.00 then null else rnk end limit25
        ,case when ResRatio < 50.00 then null else rnk end limit50
        ,case when ResRatio < 75.00 then null else rnk end limit75
        ,case when ResRatio < 90.00 then null else rnk end limit90
        ,rnk
        , ResourceTotal
        , ResourceImpact
    from
    (
        select
              sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes
            , sum(ResourceLoss) over (order by ResourceLoss desc  rows unbounded preceding) subtotalRes
            , subtotalRes *100.00 / totalRes Resratio
            , sum(1) over (order by ResourceLoss desc  rows unbounded preceding) rnk
            , ResourceTotal
            , ResourceImpact
        from
        (
            select
                AMPCPUTime ResourceTotal
                , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact
                , ResourceImpact - ResourceTotal ResourceLoss
            /* For archived DBQL
            from 
                dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131207
                and ampcputime > 0
            */
            /* For online DBQL*/
            from 
                dbc.dbqlogtbl
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-18' (date)
                and ampcputime > 0
        ) x
    ) y
) z
group by 1

I expect you are a bit shocked now, how few queries waste how much golden resources.

I think we will agree that it is worth to tune those dozen of queries,
and you save in orders of $100K..M USD for your company annually, am I right?

PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.

Typical reasons of skewness – in a nutshell

  • Skewed tables: Bad choice of PI, Skewed data 
  • Bad execution plans (typically skewed redistributions)
    • Bad data model (normalization,data types,PI, etc.)
    • Missing or stale statistics
    • Too many joins (break up the query!)
  • Hash collision (load time problem)

DBQL analysis I. – Monitor “Top CPU consumers”

DBQL analysis

About DBQL

What is it?

DataBase Query Logging.

It is a nice feature of Teradata RDBMS, which comprehensively logs the issued queries execution – if it is switched on.
Configuration can be checked/administered eg. in the Teradata tools or from DBC.DBQLRuleTbl.
Logging can be set on global/user level, and in respect of details (see DBQL tables)

For detailed information please refer Teradata documentation of your version.

DBQL tables

Table Content
DBQLogTbl Central table, 1 record for each query.
DBQLSQLTbl Whole SQL command, broken up to 30k blocks
DBQLStepTbl Execution steps of the query, one row for each step.
DBQLObjTbl Objects participated in the query. Logged on different levels (db,table, column, index, etc.)
DBQLExplainTbl English explain text, broken up to 30k blocks
DBQLXMLTbl Explain in XML format, broken up to 30k blocks
DBQLSummaryTbl PEs’ aggregated table, which accounts on the desired level.

DBQL tables logically organized into 1:N structure, where DBQLogTbl is the master entity and others (except DBQLSummaryTbl) are the children.
Join fields are the ProcID and QueryId together, eg:

...
from 
    DBQLogTbl a
    join DBQLStepTbl b on 
        a.ProcID = b.ProcID 
        and 
        a.QueryID = b.QueryID
...

Unfortunately PI of DBQL tables are not in sync with logical PK-FK relation in (also in latest V14.10), therefore JOIN-ed selects against online DBQL tables are not optimal.

Cost of using DBQL

DBQL basically consumes negligible amount of processing resources, since it has cached&batch write and generates data proportional to issued queries (flush rate is DBScontrol parameter).
It is important to regularly purge/archive them from the DBC tables, Teradata has a recommendation for it. This ensures that PERM space consumption of the DBQL remains low.
In an environment where ~1M SQLs are issued a day, comprehensive logging generates ~8..10G of DBQL data daily w/o XML and Summary. Less SQLs generate proportionally less data.

It is worth to switch on all option except XML and Summary, since the first generates huge data volume (~makes it double), and the second is similar to Acctg info. If you want to utilize them, they should be switched on, of course.

What is it good for?

It contains:

  • Query run time, duration
  • Consumed resources
  • Environment info (user, default db, etc)
  • SQL text
  • Explain
  • Step resource info
  • Objects involved
  • Etc.

One can get a lot of useful aggregated and query specific tuning information, some of them I will share in the blog.

CPU usage distribution info

(Everything applies to I/O also, just replace CPU with I/O, AMPCPUTime with TotalIOCount…)

Do you think Query optimization is rewarding?

Yes, I know it is hard work to find out why is ONE query run sub-optimally, and what to do with it.

But guess how many queries consume how many percent of the processing resources (CPU) within a whole day’s workload.
Tip it and write down for CPU%: 5%, 10%, 25% and 50%

And now run the query below, which will result it to you. (replace the date value or maybe you have to adjust the date filtering according to local settings)

select 
    'How many queries?' as "_"
    ,min(limit5) as "TOP5%CPU"
    ,min(limit10) as "TOP10%CPU"
    ,min(limit25) as "TOP25%CPU"
    ,min(limit50) as "TOP50%CPU" 
    ,max(rnk) as TotalQueries
from
(
    select
        case when CPURatio < 5.00 then null else rnk end as limit5
        ,case when CPURatio < 10.00 then null else rnk end as limit10
        ,case when CPURatio < 25.00 then null else rnk end as limit25
        ,case when CPURatio < 50.00 then null else rnk end as limit50
        ,rnk
    from
    (
        select
            sum(ampcputime) over (order by ampcputime desc) as totalCPU
            ,sum(ampcputime) over (
                order by ampcputime desc  
                rows unbounded preceding
             ) as subtotalCPU
            ,subtotalCPU * 100.00 / totalCPU as CPUratio
            ,sum(1) over (
                order by ampcputime desc 
                rows unbounded preceding
             ) as rnk
        from
        (
            select *
            /* For archived DBQL
            from dbql_arch.dbqlogtbl_hst 
            where 
                logdate=1131201 
                and 
                ampcputime > 0 
            */
            /* For online DBQL*/
            from dbc.dbqlogtbl 
            where
                cast(cast(starttime as char(10)) as date) = '2013-12-10' (date)
                and 
                ampcputime > 0
        ) x
    ) y
) z
group by 1;

Are you surprised?
I bet:

  • Less than 10 queries will consume 5% of the CPU
  • Less than 1% of the queries will consume 50% of the CPU

Let’s calculate.
How much does your Teradata system cost a year? It is all for storage and processing capacity.
If you can save eg. X% of CPU&I/O and X% storage using MVC optimization, you saved X% of the price of the Teradata system, by:

  • Improved user experience (earlier load, faster responses)
  • Resources for additional reports and applications
  • Enable postponing a very expensive Teradata hardware upgrade
PRISE Tuning Assistant

helps you to find those queries and to get the hang of how to accelerate them.

 
 

Accelerate skewed joins

How to “re-parallelize” skewed joins

Case description

Assume that we have 1M customers, 4M transactions and our top customer produce the 2.5% of all transactions.Others produce the remaining 97.5% of transactions approx. evenly.
Scroll down to the bottom of the post for sample table and data generator SQL.

Our task is to join a “Customer” and a “Transaction” tables on Customer_id.

The join

SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
GROUP BY 1;


We experience a pretty slow execution.
On the ViewPoint we see that only one AMP is working, while others are not.

What is the problem?
There are two  separate subsets of the Transact table from “joinability” point of view:

  • “Peak” part (records of top customer(s))
    Very few customers have very much Transact records. Product join would be cost effective
  • “Even” part (records of other customers)
    Much customers have much, but specifically evenly few Transact records. Merge join would be ideal.

Unfortunately Optimizer have to decide, only one operation type can be chosen. It will choose merge join which consumes far less CPU time.

Execution plan looks like this:

 This query is optimized using type 2 profile T2_Linux64, profileid 21.
  1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.t.
  2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.c.
  3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
  4) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
     all-rows scan with a condition of (“NOT (D_DB_TMP.t.Customer_ID IS
     NULL)”) into Spool 4 (all_amps), which is redistributed by the
     hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  Then we do a
     SORT to order Spool 4 by row hash.  The size of Spool 4 is
     estimated with low confidence to be 125 rows (2,125 bytes).  The
     estimated time for this step is 0.01 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of a
     RowHash match scan, which is joined to D_DB_TMP.c by way of a
     RowHash match scan.  Spool 4 and D_DB_TMP.c are joined using a
     merge join, with a join condition of (“D_DB_TMP.c.Customer_ID =
     Customer_ID”).  The result goes into Spool 3 (all_amps), which is
     built locally on the AMPs.  The size of Spool 3 is estimated with
     index join confidence to be 125 rows (10,375 bytes).  The
     estimated time for this step is 0.02 seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 5.  The size of Spool 5 is
     estimated with no confidence to be 94 rows (14,758 bytes).  The
     estimated time for this step is 0.02 seconds.
  7) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by way of
     an all-rows scan into Spool 1 (all_amps), which is built locally
     on the AMPs.  The size of Spool 1 is estimated with no confidence
     to be 94 rows (8,742 bytes).  The estimated time for this step is
     0.02 seconds.
  8) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 0.07 seconds.

How to identify

If you experience extremely asymmetric AMP load you can suspect on this case.
Find highly skewed JOIN steps in the DBQL (set all logging options on):

select top 50
a.MaxAMPCPUTime * (hashamp()+1) / nullifzero(a.CPUTime) Skw,a.CPUTime,a.MaxAMPCPUTime * (hashamp()+1) CoveringCPUTime,
b.*
from dbc.dbqlsteptbl a
join dbc.dbqlogtbl b on a.procid=b.procid and a.queryid=b.queryid
where
StepName='JIN'
and CPUtime > 100
and Skw > 2
order by CoveringCPUTime desc;




(Note: Covering CPU time is <No-of-AMPs> * <Max AMP’s CPU time>. Virtually this amount of CPU is consumed because asymmetric load of the system)

Or if you suspect a specific query, check the demography of the join field(s) in the “big” table:

SELECT TOP 100 <Join_field>, count(*) Nbr
FROM <Big_table> GROUP BY 1 ORDER BY 2 DESC;


If the top occurences are spectacularly larger than others (or than average) the idea likely matches.

Solution

Break the query into two parts: join the top customer(s) separately, and then all others. Finally union the results. (Sometimes additional modification also required if the embedding operation(s) – the group by here – is/are not decomposable on the same parameter.)
First we have to identify the top customer(s):

SELECT TOP 5 Customer_id, count(*) Nbr
FROM Transact GROUP BY 1 ORDER BY 2 DESC;

Customer_id          Nbr
——————————
          345       100004
     499873                4
     677423                4
     187236                4
       23482                4
     
Replace the original query with his one:

SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
where t.Customer_id in (345)  

/*
   ID of the top Customer(s). 
   If more customers are salient, list them, but max ~5
*/
GROUP BY 1
UNION ALL
SELECT Customer_name, count(*)
FROM Customer c
JOIN Transact t ON c.Customer_id = t.Customer_id
where t.Customer_id not in (345)  -- Same customer(s)
GROUP BY 1
;

Be sure that Customer.Customer_id, Transact.Transact_id and Transact.Customer_id have statistics!

Rhis query is more complex, has more steps, scans Transact table 2 times, but runs much faster, you can check it.
But why? And how to determine which “top” customers worth to be handled separately?
Read ahead.

Explanation

Calculation

Let’s do some maths:
Assume that we are on a 125 AMP system.
Customer table contains 1M records with unique ID.
We have ~4.1M records in the Transact table, 100k for the top customer (ID=345), and 4 for each other customers. This matches the 2.5% we assumed above.

If the  Transact table is redistributed on hash(Customer_id) then we will get ~33k records on each AMPs, excluding AMP(hash(345)). Here we’ll get ~133k (33k + 100K).
That means that this AMP will process ~4x more data than others, therefore runs 4x longer.
With other words in 75% of this JOIN step’s time 124 AMPs will DO NOTHING with the query.

Moreover the preparation and subsequent steps are problematic also: the JOIN is prepared by a redistribution which produces a strongly skewed spool, and the JOIN’s result stays locally on the AMPs being skewed also.

Optimized version

This query will consume moderately more CPU, but it is distributed evenly across the AMPs, utilizing the Teradata’s full parallel capability.
It contains a product join also, but is it no problem it joins 1 records to the selected 100k records of Transacts, that will be lightning fast.

All

Look at the execution plan of the broken-up query:

 This query is optimized using type 2 profile T2_Linux64, profileid 21.
  1) First, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.t.
  2) Next, we lock a distinct D_DB_TMP.”pseudo table” for read on a
     RowHash to prevent global deadlock for D_DB_TMP.c.
  3) We lock D_DB_TMP.t for read, and we lock D_DB_TMP.c for read.
  4) We do a single-AMP RETRIEVE step from D_DB_TMP.c by way of the
     unique primary index “D_DB_TMP.c.Customer_ID = 345” with no
     residual conditions into Spool 4 (all_amps), which is duplicated
     on all AMPs.  The size of Spool 4 is estimated with high
     confidence to be 125 rows (10,625 bytes).  The estimated time for
     this step is 0.01 seconds.
  5) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of an
     all-rows scan, which is joined to D_DB_TMP.t by way of an all-rows
     scan with a condition of (“D_DB_TMP.t.Customer_ID = 345”).  Spool
     4 and D_DB_TMP.t are joined using a product join, with a join
     condition of (“Customer_ID = D_DB_TMP.t.Customer_ID”).  The result
     goes into Spool 3 (all_amps), which is built locally on the AMPs.
     The size of Spool 3 is estimated with low confidence to be 99,670
     rows (8,272,610 bytes).  The estimated time for this step is 0.09
     seconds.
  6) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 5.  The size of Spool 5 is
     estimated with no confidence to be 74,753 rows (11,736,221 bytes).
     The estimated time for this step is 0.20 seconds.
  7) We execute the following steps in parallel.
       1) We do an all-AMPs RETRIEVE step from Spool 5 (Last Use) by
          way of an all-rows scan into Spool 1 (all_amps), which is
          built locally on the AMPs.  The size of Spool 1 is estimated
          with no confidence to be 74,753 rows (22,052,135 bytes).  The
          estimated time for this step is 0.02 seconds.
       2) We do an all-AMPs RETRIEVE step from D_DB_TMP.t by way of an
          all-rows scan with a condition of (“D_DB_TMP.t.Customer_ID <>
          3454″) into Spool 9 (all_amps), which is redistributed by the
          hash code of (D_DB_TMP.t.Customer_ID) to all AMPs.  The size
          of Spool 9 is estimated with high confidence to be 4,294,230
          rows (73,001,910 bytes).  The estimated time for this step is
          1.80 seconds.
  8) We do an all-AMPs JOIN step from D_DB_TMP.c by way of an all-rows
     scan with a condition of (“D_DB_TMP.c.Customer_ID <> 3454”), which
     is joined to Spool 9 (Last Use) by way of an all-rows scan.
     D_DB_TMP.c and Spool 9 are joined using a single partition hash
     join, with a join condition of (“D_DB_TMP.c.Customer_ID =
     Customer_ID”).  The result goes into Spool 8 (all_amps), which is
     built locally on the AMPs.  The size of Spool 8 is estimated with
     low confidence to be 4,294,230 rows (356,421,090 bytes).  The
     estimated time for this step is 0.72 seconds.
  9) We do an all-AMPs SUM step to aggregate from Spool 8 (Last Use) by
     way of an all-rows scan , grouping by field1 (
     D_DB_TMP.c.Customer_name).  Aggregate Intermediate Results are
     computed globally, then placed in Spool 10.  The size of Spool 10
     is estimated with no confidence to be 3,220,673 rows (505,645,661
     bytes).  The estimated time for this step is 8.46 seconds.
 10) We do an all-AMPs RETRIEVE step from Spool 10 (Last Use) by way of
     an all-rows scan into Spool 1 (all_amps), which is built locally
     on the AMPs.  The size of Spool 1 is estimated with no confidence
     to be 3,295,426 rows (972,150,670 bytes).  The estimated time for
     this step is 0.32 seconds.
 11) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> The contents of Spool 1 are sent back to the user as the result of
     statement 1.  The total estimated time is 11.60 seconds.

Sample structures

The table structures (simplified for the example):

CREATE TABLE Customer
(
  Customer_ID   INTEGER
, Customer_name VARCHAR(200)
)
UNIQUE PRIMARY INDEX (Customer_id)
;

insert into Customer values (1,'Cust-1');
Run 20x: 

insert into Customer select mx + sum(1) over (order by Customer_id rows unbounded preceding) id, 'Cust-' || trim(id) from Customer cross join (select max(Customer_id) mx from Customer) x;

collect statistics using sample on customer column (Customer_id);

CREATE TABLE Transact
(
  Transaction_ID   INTEGER
, Customer_ID      INTEGER
)
UNIQUE PRIMARY INDEX (Transaction_id)
;

insert into Transact values (1,1);
Run 22x: 

insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, id mod 1000000 from Transact cross join (select max(Transaction_id) mx from Transact) x;

insert into Transact select mx + sum(1) over (order by Transaction_id rows unbounded preceding) id, 345 from Transact t cross join (select max(Transaction_id) mx from Transact) x where t.Transaction_id < 100000;

collect statistics using sample on Transact column (Customer_id);
collect statistics using sample on Transact column (Transaction_id) ;

Teradata Compress optimization

Teradata Compress optimization
Techniques and effects

What is Multi Value Compression (MVC)?

Teradata RDBMS supports a nice feature: multi value compression (aka. Teradata Compress). It enables to reduce the storage space allocated by the tables in the database, while – this is incredible – processing compressed data usually requires less resources (CPU and I/O) than the uncompressed.
The feature needs no additional licence or hardware components.

How does Teradata Compress work?

I give a short summary, if you are interested in the details please refer to Teradata documentation.

MVC can be defined in CREATE TABLE DDL or later added/modified by ALTER TABLE statements. User must define a 1..255 element list of values for each compressable columns. Those will be stored as compressed value, while others will be uncompressed.
If a column is compressed, each row has an additional area of 1..8 bits allocated (if N value is listed: upper(log2(N)) bits will be allocated). One of bit combinations means that the value is uncompressed (and allocates its corresponding space within the row layout), but all others mean compressed value, which will not allocate the value’s place in the row.
The compress bits are allocated in every rows regardless the actual value is compressed or not.
Compress bits are “compacted”, eg.: 3 + 8 + 4 = 15 compress bits will allocate 2 bytes with only 1 wasted bit instead of 3 byte aligned values.
The value belonging to each bit combinations are stored in the table header.

Multi Value Compression is:

  • Column level
    Have to be defined on each applicable columns of a table separately
  • “Manual”
    You have to calculate which values are worth to compress – Teradata gives no automatism
  • Static
    Once you defined the values it will not adapt to the changing conditions by itself

It is obvious that the current optimal settings of the compression depends on the data demography and the applied data types. Optimal setting may be different later, when data demography may be different.

Summary of most important properties of MVC once again:

  • Can be defined in the CREATE TABLE statement
  • Can be applied or modified later in an ALTER TABLE statement
  • Must be set on COLUMN level
  • Optimal value list must be calculated by you
  • Optimal settings may change in time. Optimize regularly.

Storage effects

Using MVC tables will allocate less PERM space, as can be calculated – simple.
What about the increment?
The table sizes usually grow along the time as more and more data is generated. The change in growth speed depands on the volatility of data demography. If it is stable then the growth speed will drop by the rate of compression. If typical values change in time than growth will not drop, or may speed up in extreme cases. However theese cases are when regular optimization is neccessary.

The growth look like this in stable demography cases:

Performance effects

It is a key question – what have to be payed for less storage

It is obvious that compression process requires resources during both compress and decompress phase.However there are processing gains also, which usually dominate the costs. How?

Compressed table will reside in proportionally less data blocks, therefore data fetching requires less I/O operations. In addition moving data in-memory (during processing) requires less CPU cycles.
While SELECTing table data usually small fragment of the row is used, and not used coulmns will not be decompressed.
Caching is a CPU intensive operation also, which is more effective if less data blocks are processed.
Compression helps tables to be treated as “small enough to cache 100% into memory”, which results more effective execution plans.

Summary:

  • INSERT into a compressed table usually consume more CPU by 10..50% (only final step!)
  • SELECT usually cost no more, or less CPU than at uncompressed tables
  • SELECT and INSERT usually cost proportionally less I/O like the compression ratio
  • System level CPU and I/O usage usually drops by 5..10% (!) when compressing the medium and big tables of the system (caused by more effective caching)

How to set MVC?

Setting up the MVC compression on a single table should consist of the following 4 steps:

  1. Analyze the data demography of each compressible columns of the table *1.
  2. Calculate the optimal compress settings for the columns. Notice that
    •   Optimum should be calculated not on separated columns, but on table level, since compress bits are packed into whole bytes.
    •   The more values are listed as compressed, the more overhead is on compress. Proper mathematical formula is to be used for calculating the optimum. *2
    •   Take care of the exceptions: PI / FK / etc.columns and some data types are not compressible (varies in different Teradata versions).
  3. Assemble the corresponding scripts
    CREATE TABLE DDL + INSERT SELECT + RENAME / ALTER TABLE DDL
  4. Implement the compression by running the script
    Concern to take good care of data protection like: backups, locking, documenting.

*1 Simplified sample: 
     select top 256 <columnX> , count(*), avg(<length(columnX)>) from <table> group by 1 order by 2 desc; for each columns
*2 About the algorithm: It is a maximum-seeking function (n) based on the expression of gains when specific TOP {(2^n)-1} frequent values are compressed. The expression is far more complex to discuss here because different datatypes, exceptions and internal storing constructions.

 One time or regular?

Optimal MVC setting is valid for a specific point in time, since your data changes along your business. The daily change is usually negligible, but it accumulates.
Practice shows that it is worth to review compress settings every 3..6 months, and continually optimize new tables, couple of weeks after coming into production.

Estimate how much space and processing capacity is lost if compress optimization is neglected!

Solution in practice

There are “magic excels” on the net, which can calculate the optimal settings if you load the data demography, but it requires lots of manual work in addition (Running the calculations, DDL assembling, transformation script writing, testing, etc.)

If you want a really simple solution, try PRISE Compress Wizard , that supplies a comprehensive solution:

  • Assists to collect good candidate tables to compress
  • Automatically analyses the tables, and gives feedback:
    • How much space can be saved by compress
    • What is the current compress ratio (if there is compress already applied)
    • How much resources were used for analysis
    • What is the optimal structure
  • Generates transforming script (+ checks, lock, logging) along with
    • Backup (arcmain)
    • Revert process (for safety and documentation)
    • Reverse engineering (for E/R documentation update)
  • Log implementation
    •  Reflect achieved space saving: success measurement
    •  Report used CPU and I/O resources for transformation