Avoiding Product Joins

How to eliminate Product Joins

What is product join?

Avoiding Product Cross Join in Oracle
Product join is one of the implementation methods of an SQL JOIN operation.
Do not mix up with cross join (Cartesian product), which is one type of SQL joins.

SQL join types, eg.: inner join, left outer join, full outer join, cross (Cartesian) join
Join implementation types, eg.: nested join, merge join, hash join, product join.

Product join (of tables A and B ) is the most simple method of join implementation:

  • Produce each of <A;B> record combinations, say take each records from A singly, and match it with each records of B one-by-one.
  • Test the join condition on each produced <A;B> record pairs, and eliminate those combinations where the condition fails.
The two steps are often combined, and the “testing phase” is executed right after a record combination is generated, and the non valid combinations right after dropped. This saves a lot of temp space.

Why don’t we like product joins?

Well, it has a really bad reputation. It is slow, stuffs CPU, etc.
Yes, it usually is, does. It is the brute force method for executing a join, with costs in order of N*M (where N, M are the record numbers of the joinable tables)

Indeed there are situations when it is the best choice, or the only feasible way.

When is it good or necessary?

Please note that product join is the method what is always applicable, independently of all circumstances.

Good

Product join is typically simple, dumb and slow algorithm, this is why we do not like it, but has a very important advantage: requires no pre-processing.* This is why we LIKE IT:)
If we have to join a really large table to a very small table (couple of records) product join is far the most effective method, since the sort of a very large table ( order of N*logN ) can cost a lot, while joining to 1-2 records is really not a big deal.

Necessary

There are join situations when the only way to go is the product join. Why? Because of the join condition. The “clever joins” (merge, hash) require some information and/or condition that somehow enables to cheat the A x B comparisons: reduce them to the ones that really necessary, and be done in a more effective manner.

* OK, in Teradata this means: only requires that the matchable records from both tables must be on the same AMP. This implies the “small” table to be duplicated to all AMPs.

Merge join example


from A
join  B on A.customer_id = B.customer_id
and A.trx_dt between B.eff_dt and B.exp_dt

  • Customer_id clause is in AND condition with the others
  • Customer_id is selective enough that hash(customer_id) can reduce the comparisons reasonably
  • Note that A and B must be sorted (re-sorted) by the hash of customer_id

Product join example


from A
join   B on substr(A.telephone_no,1,B.prefix_length) = B.telephone_no_prefix

  • There is no comparison reducing partial-condition
  • Note that neither of the tables required to be sorted in a specific order.

Unavoidable product joins

  • Non-eqality condition
  • Function used (eg. substr())
  • Dependent expression is used (eg. A.x+B.y = A.z)
  • Cross join: intentional Cartesian product

Avoidable product joins

Data type mismatch

The merge join example above works only if customer_no in A and B tables have the same “style” data types, since their hash value will match only in this case. Say hash(13674) <> hash(‘13674’), however integer is compatible with decimal, and char is compatible with varchar.
Pay attention on data type consistence during physical data modeling.
  • Use domains to eliminate the possibility of mismatch
  • Align to used data types when defining temp tables, or use “create table as …” statements
  • If you cannot avoid mismatch, relocate the necessary data to temp tables with proper data types during processing.

OR condition

Let’s assume the following join condition:
select ...
from A
join  B on A.col1 = B.Col1
OR 

           A.Col2 = B.Col2

This is equivalent, w/o compulsory product join :

select ... 
from A
join  B on A.col1 = B.Col1 

UNION 
select ...

from A
join  B on A.Col2 = B.Col2

Missing/stale statistics

As I mentioned before product join is the most effective join between a very large and a really small (couple of records) table. If the optimizer thinks that a table is pretty small, but it is not indeed, it may choose a product join in all good faith, misleaded by a stale or missing statistics.
Define and keep fresh those statistics by the optimizer can determine the size of the joinable record sets  properly.

How to find avoidable product joins

It is not trivial to list the avoidable product joins. Practically all product joins are required to be examined one-by-one and judged to be avoidable or not. And if avoidable, what to do for.

I strongly recommend to use PRISE Tuning Assistant for both finding the product joins and analyzing the possibility and necessity of elimination:

  • List top consuming queries with product join(s)
  • Check the PROD JOIN steps: which tables are processed that way
  • Check those join conditions for cases described above

What to do if cannot be avoided?

In this case I recommend to try the decomposition, described here.
It can help reducing the number of comparisons, saving CPU and runtime.

 
Have a successful optimization!
 

The “No more spool space” Error Explained

Why do I get “No more spool space” error?

This is the most familiar error message in Teradata world:
“Failure 2646 No more spool space”
What does it really mean, what is it caused by?
Let’s get back to the basics.

What is spool space?

Spool space is a temporary area, that can be used to store the part-results during query processing, as well as volatile tables. All free space in the database that are not allocated by PERM data, technically can be used for spool area, since a PERM data does not want to allocate that area.

Each database users may have a “spool limit” that restricts the user to allocate more spool area at a time, than its limit. Keep in mind that all active sessions of a username must share the spool limit together.

Teradata is a massive parallel system, therefore the spool limit must be interpreted on AMP level:
Eg: 100AMP system, a user with 10G spool limit means: 100M spool/AMP

What is spool space limit good for?

This limitation method is a quite simple way to cut those queries from the system that would suck too much resources. There is no exact relationship between high spool usage and ineffective query, but statistically the correlation is high.
Practically: a bad query is being kicked off before is consumes too much resources unnecessarily.

No more spool space scenarios

System ran out of spool space

This is the most rare situation, forget about. There are too few free space on the system, but this situation used to be avoided by defining a “SpoolReserve” database, where no objects are created, this way that area is always available for spool.
If many “big spool limit” users run high spool queries parallel, then this rare situation can yet occure.

Multiple session of the user are active together

This is a quite rare situation also. Check the active users from dbc.sessioninfo.

Volatile tables

All existing volatile tables reside in your available spool space, reducing the available. If you create many, and even with skewed distribution, you can stuff your spool up. Choose “primary index” carefully, when defining volatile tables also.

Improper execution plan

These are the >90% of cases that cause the “No more spool space” errors. Let’ see how:
  • “Duplication to all AMPs” of a non-small set of records
    The root cause is typically missing or stale statistics. Either system thinks that much less records will be duplicated than the real (sometimes billions of records get in this kind of spools), or knows this exactly, but on the other branch of the query there are more low quality estimations, and this execution seems to be cheaper.
  • Redistribution of records by a hash that causes skewed distribution
    Check the corresponding blog post: Accelerate skewed joins
  • Retrieve huge amount of records into spool (locally or redistributed onto the AMPs)
    Specific query structures imply this execution, like: join to a view that “union all”-s big tables.
I suggest to use PRISE Tuning Assistant to identify what is the problem. It spectacularly displays which execution step falls in the problems above.
Increasing the spool limit will not solve the problems in the most cases. 

Too big task

Sometimes a given SQL query requires big spool area to be performed, even with the best execution plan.

This is the only case when raising spool limit is the solution. But first you have to understand that the task is really big. PRISE Tuning Assistant is a good tool for identify this in a minute.

SET or MULTISET table?

Set

Choosing between SET and MULTISET table

The question is asked far less often than it should be.
The default setting is SET, which is applied if you do not specify, and everything goes fine.
And now, let’s see down there what all these things mean.
Relational datamodeling states that there must not be two records in a table, which have all corresponding fields equal. If we design a regular 3th normal form datamodel, the statement seems to be trivial, because Primary Key (PK) will ensure it. (PK not to be mixed up with the PI)

However we typically do not define PKs in the data warehouses, because of the possibility large storage and processing demands. PK can be interpreted as an:

  • UPI: no overhead, but often PK columns are not the best choice from access path point of view
  • USI: it is an extra subtable with its significant perm space and processing demands of maintenance.

Without a PK, nothing can ensure that all rows are distinct – this function is fulfilled by the “SET” property of a table. It will not allow to insert more records with absolutely equivalent layout (column contents), it simply swallows those kind of records. This is called “DUPLICATE ROW CHECKS” in the documentations.
There can be only one – like in the Highlander movie…

What is the difference between PK and SET?

  • Definition
    • PK: regards chosen subset of columns
    • SET: covers all columns of a table together
  • Efficiency/Performance
    • PK: only one similar record may exist int the table: 1 fetch, 0 or 1 match (few comparisons) only hash value and the PK columns must be compared (fast)
    • SET: many hash synonyms can exist (many comparisons), and the whole record layout must be compared. This be painfully slow, and in addition skewed
  • Result
    • PK: INSERT/UPDATE statement fails if uniqueness is violated (2801: Duplicate unique primary key error)
    • SET: swallows the duplicated records, no error message if an INSERT SELECT command wants to insert more similar or already existing record. If we use INSERT… VALUES statement, or try to UPDATE a record to be similar to another one, we get the 2802 message like at PK

The problem of SET tables was discussed in the post Curing slow INSERTs and CREATE TABLEs I

Why do we use SET tables then?

SET table takes good care of some cases of unintentional duplications, typically coming from bad join conditions or overlapping history tables.
Sometimes. And sometimes not, if other data (eg. running sequence or identity column) makes difference in some columns.
I think using SET table to eliminate programming mistakes is a bad practice. However the SET table ensures that we will not find any “dumb” duplication in our base tables gives a good calmness about our data.

FastLoad and MULTISET

You can be surprised, when you try to load into a MULTISET table with FastLoad utility. It will eliminate row duplications regardless if you want or not. You cannot switch off this function.
The duplications will be simply eliminated, you can only deduce from the logfile: check the
“Total Duplicate Rows” amount.

When should we avoid using SET tables?

The short answer is: NUPI tables with high hash-collision rate for both temporary and final tables.

What does it means?
Check the table’s PI values TOP frequent occurences:
Assume this table:
CREATE TABLE DB1.TBL1
(
c1 INTEGER,
c2 CHAR(10),
c3 VARCHAR(100)
) PRIMARY INDEX (c1,c2)

Query the top frequent values of the primary index (PI) columns together:

SELECT TOP 20 c1, c2, count(*) cnt FROM DB1.TBL1 GROUP by 1,2 ORDER by 3 desc

If we see that the most frequent value is greater than 100, I’d surely choose MULTISET.

I experience 30-50% of the performance problems based on hash collisions, significant part coming from misunderstood usage of SET feature, suffering from these side effects.
The other part is the “hidden product join”.
Take care! MULTISET will only solve the performance problem of INSERT-s, but will not help the other part: joins, and joined UPDATE or DELETE commands.

If a table has bad skewness, it will have bad hash collision level (HC) is 99% of the cases. But it is not reversible: a good skew itself will not ensure low HC.

+1 trick: If you just want to create a copy of a high HC table for save, choose MULTISET and will have much quicker run.

Using Join Index

How to use Join Index in Teradata

About Join Index (JI)

Join index is most like a “materialized view”, say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.

What are the main differences between a JI and a Secondary Index?

  • Different internal structure and purposes
  • Appears as separate object (not a sub-table) in the DBC tables/views
  • Can reside in different database than the base table(s)
  • Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured

What is the difference between JI and a regular table in which I can store the same query result? 

  • You can’t access it directly from an SQL query
  • Optimizer automatically decides to be used or not:
    • By logic: if the JI can support the query with its data content
    • By cost: if usage of JI will result a “cheaper” execution (always collect statistics!)
  • Maintained automatically if the content of the base table(s) change
  • “Locks” the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).

JI types

The Join Index types I will list below are not differentiated by SQL phrase, but the structure of the SQL SELECT used in the JI definition.

They can be combined also in reasonable ways, eg. <single table – aggregate – sparse> or <multi table – aggregate>, etc.

Let’s take these base tables for our examples:
CREATE TABLE TBL_A
(
  Col1 integer
, Col2 integer
, Col3 integer
PRIMARY INDEX (Col1)
;
CREATE TABLE TBL_B
(
  Col1 integer
, Col2 integer
PRIMARY INDEX (Col1)
;

Single table JI

This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:

  • Non-covering
    We select only the filtering column(s) (those will be the PI of the JI also) and the “Rowid” pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table’s appropriate records. The JI can be very small this way, but note that we have an additional “join phase”.
  • Covering
    The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some “central” tables.
This example shows a non-covering index for the query below:
create join index JI_1
as
SELECT Col1,Col2
FROM TBL_A
PRIMARY INDEX(Col2)

;

select Col3 from TBL_A where Col2=1;

 Multi table JI

This kind of JI is for accelerating frequent join statements. Technically it stores the result of a join. It can cover all the columns of just store the key-pairs, or somewhere between.

create join index JI_2
as
SELECT a.Col3,b.Col1
FROM TBL_A a

join     TBL_B b on a.Col2=b.Col2
PRIMARY INDEX(Col3)
;

Aggregate JI

The JI’s SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata’s optimizer is quite clever, because it can recognize “intermediate” aggregate JIs for further aggregation instead using the base table. Example:

create join index JI_3
as
SELECT Col1,Col2,sum(Col3) X
FROM TBL_A

GROUP BY 1,2
PRIMARY INDEX(Col2)
;

All three SELECTs can be served from the JI_3:
SQL1: select Col1        , sum(X) from TBL_A group by 1;
SQL1: select Col2        , sum(X) from TBL_A group by 1;

SQL1: select Col1,Col2, sum(X) from TBL_A group by 1,2;

Sparse JI

Thees JIs contain where clause, say the indexing is not for the whole table. If the where condition of the JI is a logical subset of the supported SQL’s where condition than the JI can support the query.
Typical usage is on the transactional tables: we have frequent accessed on a transaction table by Customer_id, but have PI of Trx_id. We can observe that 90% of the SELECTs fall onto the last 30 days.
We can put a sparse non covering single table JI on the table with PI:Customer_id
90% of the selects will finish in no time, and 10% of the queries will result in a full table scan, meanwhile our JI size remains pretty small: we index only 30 days instead of eg. 365 days.
Please note that where condition of a sparse JI can not contain “changing” values like current_date. That means JI must be regularly redefined if a moving window sparse JI is to be maintained.

create join index JI_4
as
SELECT Col1,Col2,Col3 X
FROM TBL_A

WHERE Col3 between 1 and 10
PRIMARY INDEX(Col2)
;

Serveable query:
select Col1 from TBL_A where Col2=20 and Col3=4;
Non serveable query:
select Col1 from TBL_A where Col2=20 and Col3=11;

When will join index help?

Join indices typically serve two purposes (or the combination of them):

  • Store the result of frequently produced (part-)results
  • Provide alternative access path to the data
Practical examples:
  • Frequent aggregation or join operation
  • Frequent redistribution of a table
  • Non primary index based filtering on a table (causing full table scan)
  • Non primary index based join of a table (causing redistribution)

Important things about JIs

Statistics

Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use them

Choosing PI

Te PI of the JI will be the most effective access path to the JI’s records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.

Teradata Compress optimization

Teradata Compress optimization
Techniques and effects

What is Multi Value Compression (MVC)?

Teradata RDBMS supports a nice feature: multi value compression (aka. Teradata Compress). It enables to reduce the storage space allocated by the tables in the database, while – this is incredible – processing compressed data usually requires less resources (CPU and I/O) than the uncompressed.
The feature needs no additional licence or hardware components.

How does Teradata Compress work?

I give a short summary, if you are interested in the details please refer to Teradata documentation.

MVC can be defined in CREATE TABLE DDL or later added/modified by ALTER TABLE statements. User must define a 1..255 element list of values for each compressable columns. Those will be stored as compressed value, while others will be uncompressed.
If a column is compressed, each row has an additional area of 1..8 bits allocated (if N value is listed: upper(log2(N)) bits will be allocated). One of bit combinations means that the value is uncompressed (and allocates its corresponding space within the row layout), but all others mean compressed value, which will not allocate the value’s place in the row.
The compress bits are allocated in every rows regardless the actual value is compressed or not.
Compress bits are “compacted”, eg.: 3 + 8 + 4 = 15 compress bits will allocate 2 bytes with only 1 wasted bit instead of 3 byte aligned values.
The value belonging to each bit combinations are stored in the table header.

Multi Value Compression is:

  • Column level
    Have to be defined on each applicable columns of a table separately
  • “Manual”
    You have to calculate which values are worth to compress – Teradata gives no automatism
  • Static
    Once you defined the values it will not adapt to the changing conditions by itself

It is obvious that the current optimal settings of the compression depends on the data demography and the applied data types. Optimal setting may be different later, when data demography may be different.

Summary of most important properties of MVC once again:

  • Can be defined in the CREATE TABLE statement
  • Can be applied or modified later in an ALTER TABLE statement
  • Must be set on COLUMN level
  • Optimal value list must be calculated by you
  • Optimal settings may change in time. Optimize regularly.

Storage effects

Using MVC tables will allocate less PERM space, as can be calculated – simple.
What about the increment?
The table sizes usually grow along the time as more and more data is generated. The change in growth speed depands on the volatility of data demography. If it is stable then the growth speed will drop by the rate of compression. If typical values change in time than growth will not drop, or may speed up in extreme cases. However theese cases are when regular optimization is neccessary.

The growth look like this in stable demography cases:

Performance effects

It is a key question – what have to be payed for less storage

It is obvious that compression process requires resources during both compress and decompress phase.However there are processing gains also, which usually dominate the costs. How?

Compressed table will reside in proportionally less data blocks, therefore data fetching requires less I/O operations. In addition moving data in-memory (during processing) requires less CPU cycles.
While SELECTing table data usually small fragment of the row is used, and not used coulmns will not be decompressed.
Caching is a CPU intensive operation also, which is more effective if less data blocks are processed.
Compression helps tables to be treated as “small enough to cache 100% into memory”, which results more effective execution plans.

Summary:

  • INSERT into a compressed table usually consume more CPU by 10..50% (only final step!)
  • SELECT usually cost no more, or less CPU than at uncompressed tables
  • SELECT and INSERT usually cost proportionally less I/O like the compression ratio
  • System level CPU and I/O usage usually drops by 5..10% (!) when compressing the medium and big tables of the system (caused by more effective caching)

How to set MVC?

Setting up the MVC compression on a single table should consist of the following 4 steps:

  1. Analyze the data demography of each compressible columns of the table *1.
  2. Calculate the optimal compress settings for the columns. Notice that
    •   Optimum should be calculated not on separated columns, but on table level, since compress bits are packed into whole bytes.
    •   The more values are listed as compressed, the more overhead is on compress. Proper mathematical formula is to be used for calculating the optimum. *2
    •   Take care of the exceptions: PI / FK / etc.columns and some data types are not compressible (varies in different Teradata versions).
  3. Assemble the corresponding scripts
    CREATE TABLE DDL + INSERT SELECT + RENAME / ALTER TABLE DDL
  4. Implement the compression by running the script
    Concern to take good care of data protection like: backups, locking, documenting.

*1 Simplified sample: 
     select top 256 <columnX> , count(*), avg(<length(columnX)>) from <table> group by 1 order by 2 desc; for each columns
*2 About the algorithm: It is a maximum-seeking function (n) based on the expression of gains when specific TOP {(2^n)-1} frequent values are compressed. The expression is far more complex to discuss here because different datatypes, exceptions and internal storing constructions.

 One time or regular?

Optimal MVC setting is valid for a specific point in time, since your data changes along your business. The daily change is usually negligible, but it accumulates.
Practice shows that it is worth to review compress settings every 3..6 months, and continually optimize new tables, couple of weeks after coming into production.

Estimate how much space and processing capacity is lost if compress optimization is neglected!

Solution in practice

There are “magic excels” on the net, which can calculate the optimal settings if you load the data demography, but it requires lots of manual work in addition (Running the calculations, DDL assembling, transformation script writing, testing, etc.)

If you want a really simple solution, try PRISE Compress Wizard , that supplies a comprehensive solution:

  • Assists to collect good candidate tables to compress
  • Automatically analyses the tables, and gives feedback:
    • How much space can be saved by compress
    • What is the current compress ratio (if there is compress already applied)
    • How much resources were used for analysis
    • What is the optimal structure
  • Generates transforming script (+ checks, lock, logging) along with
    • Backup (arcmain)
    • Revert process (for safety and documentation)
    • Reverse engineering (for E/R documentation update)
  • Log implementation
    •  Reflect achieved space saving: success measurement
    •  Report used CPU and I/O resources for transformation

Accelerate PRODUCT JOIN by decomposition

How to optimize slow product joins

Case description

There are SQL queries that cannot be executed any other way, but using product join method, because of the content of join condition. Eg:

  • OR predicate
    Examle:
    ON (a.x = b.x OR a.y = b.y)
  • BETWEEN / LIKE operator
    Examples:

        ON (a.x LIKE b.y)
        ON (a.x LIKE b.y || '%')
        ON (a.b between b.y and b.y)
  • Comparison (=) of different datatype fields
    Example (a.x is INTEGER, b.x is VARCHAR)
    ON (a.x = b.x)
  • Arithmetic expression usage
    Example
    ON (a.x = b.x + 1)
  • SQL function (eg. substr() ) or UDF usage
    Example
    ON (substr(a.x,1,characters(b.x)) = b.x)

Product join is a technique when the execution will match each record combinations from the two joinable tables and evaluates the join condition on each of them. Product join usually causes huge CPU consumption and long response time.

How to identify

The query typically runs long time, contains a “PRODUCT JOIN” step in the Explain description, and that  step consumes high AMPCPUTime and lasts long. Those queries usually have >>1 LHR index (Larry Higa Ratio, showing the CPU and I/O rate), typicall 10s, 100s or more.

In the DBQL you should find high CPU usage queries ( dbc.DBQLogtbl.AMPCPUTime > 1000 , depends on system size) which also has “product join” expression in the execution plan text (dbc.DBQLExplaintbl.ExplainText like ‘%product join%’)

PRISE Tuning Assistant supplies easy-to-use GUI based search function.

Explanation of product join execution

Let’s assume that we join tables: Table1 (N records, bigger table) and Table2 (M records, smaller table) Join processing assumes that the matchable record pairs must reside on the same AMP. Since product join compares each Table1 records to each Table2 records, one of the tables’ all records must reside on all AMPs, therefore PRODUCT JOIN is preceded by a “Duplicated to all AMPs” step of the smaller table.
Each record pairs will be evaluated, if the JOIN condition satisfies, the result gets to the result spool, otherwise discarded.
The number of required comparisons: (N x M), and the cost (approx. the required CPU time) of one comparison depends on the complexity of the join expression.

Solution

In most cases the JOIN condition of the product join satisfies only small fraction of all possible combinations. In practice we can identify an often situation:
Significant subset of the bigger table’s records will fit to a small subset of the smaller table’s records.
Telco example: Number analysis. Most of the Call records are directed to national number areas (>90%), but the number area describing contains dominantly international number regions (>80..95%). We can declare that national calls will never fit to international areas. In addition it is very simple to identify both a “Number” and a “Number area” if it is national or international.
The base query looks like that:

select Table1.Number,Table2.Area_code
from Table1
join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number;

Let’s decompose the query into two parts:

select Table1.Number,Table2.Area_code
from Table1
join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number
where substr(Table1.Number,1,1) = '1'   -- USA area code
and substr(Table2.Area_Start_number,1,1) = '1' -- USA area code
and substr(Table2.Area_end_number,1,1)   = '1' -- USA area code
UNION ALL
select Table1.Number,Table2.Area_code
from Table1
join Table2 ON Table1.Number BETWEEN Table2.Area_Start_number and Table2.Area_end_number
where
NOT (substr(Table1.Number,1,1) = '1')   -- non-USA area code
and NOT
(
substr(Table2.Area_Start_number,1,1) = '1' -- non-USA area code
and substr(Table2.Area_end_number,1,1)   = '1' -- non-USA area code
);



This way we added some low cost operations (full scan on tables to identify national/international) , and the const of UNIONing the results, but we eliminated lots of trivially not satisfing comparisions.

The following figures show the processing cost, the red area represents the number of comparisons, therefore the cost:

Figure1: Original case
Figure2: Decomposed case

Let’s do some maths, with imagined combinations:
N1: International calls
N2: National calls
M1: International area descriptions
M2: National area descriptions
90% of calls (N) are national (N2)
90% of area descriptions (M) are international (M1).
Originall we have to do N x M comparisons.
The decomposed query must do
((0.9 x N) x (0.1 x M)) + ((0.1 x N) x (0.9 x M)) = 0.09 x N x M + 0.09 x N x M = 0.18 x N x M

The optimized query will do only 18% of the original comparisons, with tradeoff
of two full scans (I/O intensive) of the base tables and one UNION ALL-ing (low cost)
of the results.

In this case we will get ~4 times faster and CPU saving execution.

Sometimes it can be worth to decompose to 3 or more process phases, depending on data.

It is important, if there are further joins or transformations on the result data, they should be done on the UNION ALL-ed result, and should not be dupliated on the decomposed phases, due to code management reasons.

Why can not do the Teradata PE the same?
The decomposition requires the knowledge of the data, and will vary from query to query, which is currently out of scope and intelligence of an automatic optimizer.

Summary

Eliminate trivially invalid record pairs from the PRODUCT JOIN by breaking the query in more parts.

What’s next

Next post will discuss slow OUTER (LEFT/RIGHT) JOIN.

Introduction

Who am I?

My name is Ákos Lévai, I work in Teradata environments since 2000.
I am Teradata V2R5 master and my focus area is Teradata performance optimization.

Goal of this blog

My experience is that Teradata is a mysterious black box for lots of users and even developers: sometimes it is a lightning fast rocket, but sometimes as slow as a snail.
But really do not know why.

If we look behind the scenes, everything gets clear. My goal is to demystify Teradata and help my fellows to easily write effective SQLs instead of suffering from lack of resources or keep buying new hardware.

This blog is not a theoretical lecture will not explain algorithms, but discusses practical problems and solutions in the following structure:

  • What is the problem?
  • How to identify it?
  • What is the solution?

This blog is not a structured tutorial. The topics are sorted randomly, in the order they came into my mind.
Search if you are interested in a specific topic, or read through for just getting useful tricks and ideas.

Comments are welcome!

So let’s uncover the mysteries…