Avoiding Product Joins

How to eliminate Product Joins

What is product join?

Avoiding Product Cross Join in Oracle
Product join is one of the implementation methods of an SQL JOIN operation.
Do not mix up with cross join (Cartesian product), which is one type of SQL joins.

SQL join types, eg.: inner join, left outer join, full outer join, cross (Cartesian) join
Join implementation types, eg.: nested join, merge join, hash join, product join.

Product join (of tables A and B ) is the most simple method of join implementation:

  • Produce each of <A;B> record combinations, say take each records from A singly, and match it with each records of B one-by-one.
  • Test the join condition on each produced <A;B> record pairs, and eliminate those combinations where the condition fails.
The two steps are often combined, and the “testing phase” is executed right after a record combination is generated, and the non valid combinations right after dropped. This saves a lot of temp space.

Why don’t we like product joins?

Well, it has a really bad reputation. It is slow, stuffs CPU, etc.
Yes, it usually is, does. It is the brute force method for executing a join, with costs in order of N*M (where N, M are the record numbers of the joinable tables)

Indeed there are situations when it is the best choice, or the only feasible way.

When is it good or necessary?

Please note that product join is the method what is always applicable, independently of all circumstances.

Good

Product join is typically simple, dumb and slow algorithm, this is why we do not like it, but has a very important advantage: requires no pre-processing.* This is why we LIKE IT:)
If we have to join a really large table to a very small table (couple of records) product join is far the most effective method, since the sort of a very large table ( order of N*logN ) can cost a lot, while joining to 1-2 records is really not a big deal.

Necessary

There are join situations when the only way to go is the product join. Why? Because of the join condition. The “clever joins” (merge, hash) require some information and/or condition that somehow enables to cheat the A x B comparisons: reduce them to the ones that really necessary, and be done in a more effective manner.

* OK, in Teradata this means: only requires that the matchable records from both tables must be on the same AMP. This implies the “small” table to be duplicated to all AMPs.

Merge join example


from A
join  B on A.customer_id = B.customer_id
and A.trx_dt between B.eff_dt and B.exp_dt

  • Customer_id clause is in AND condition with the others
  • Customer_id is selective enough that hash(customer_id) can reduce the comparisons reasonably
  • Note that A and B must be sorted (re-sorted) by the hash of customer_id

Product join example


from A
join   B on substr(A.telephone_no,1,B.prefix_length) = B.telephone_no_prefix

  • There is no comparison reducing partial-condition
  • Note that neither of the tables required to be sorted in a specific order.

Unavoidable product joins

  • Non-eqality condition
  • Function used (eg. substr())
  • Dependent expression is used (eg. A.x+B.y = A.z)
  • Cross join: intentional Cartesian product

Avoidable product joins

Data type mismatch

The merge join example above works only if customer_no in A and B tables have the same “style” data types, since their hash value will match only in this case. Say hash(13674) <> hash(‘13674’), however integer is compatible with decimal, and char is compatible with varchar.
Pay attention on data type consistence during physical data modeling.
  • Use domains to eliminate the possibility of mismatch
  • Align to used data types when defining temp tables, or use “create table as …” statements
  • If you cannot avoid mismatch, relocate the necessary data to temp tables with proper data types during processing.

OR condition

Let’s assume the following join condition:
select ...
from A
join  B on A.col1 = B.Col1
OR 

           A.Col2 = B.Col2

This is equivalent, w/o compulsory product join :

select ... 
from A
join  B on A.col1 = B.Col1 

UNION 
select ...

from A
join  B on A.Col2 = B.Col2

Missing/stale statistics

As I mentioned before product join is the most effective join between a very large and a really small (couple of records) table. If the optimizer thinks that a table is pretty small, but it is not indeed, it may choose a product join in all good faith, misleaded by a stale or missing statistics.
Define and keep fresh those statistics by the optimizer can determine the size of the joinable record sets  properly.

How to find avoidable product joins

It is not trivial to list the avoidable product joins. Practically all product joins are required to be examined one-by-one and judged to be avoidable or not. And if avoidable, what to do for.

I strongly recommend to use PRISE Tuning Assistant for both finding the product joins and analyzing the possibility and necessity of elimination:

  • List top consuming queries with product join(s)
  • Check the PROD JOIN steps: which tables are processed that way
  • Check those join conditions for cases described above

What to do if cannot be avoided?

In this case I recommend to try the decomposition, described here.
It can help reducing the number of comparisons, saving CPU and runtime.

 
Have a successful optimization!
 

The “No more spool space” Error Explained

Why do I get “No more spool space” error?

This is the most familiar error message in Teradata world:
“Failure 2646 No more spool space”
What does it really mean, what is it caused by?
Let’s get back to the basics.

What is spool space?

Spool space is a temporary area, that can be used to store the part-results during query processing, as well as volatile tables. All free space in the database that are not allocated by PERM data, technically can be used for spool area, since a PERM data does not want to allocate that area.

Each database users may have a “spool limit” that restricts the user to allocate more spool area at a time, than its limit. Keep in mind that all active sessions of a username must share the spool limit together.

Teradata is a massive parallel system, therefore the spool limit must be interpreted on AMP level:
Eg: 100AMP system, a user with 10G spool limit means: 100M spool/AMP

What is spool space limit good for?

This limitation method is a quite simple way to cut those queries from the system that would suck too much resources. There is no exact relationship between high spool usage and ineffective query, but statistically the correlation is high.
Practically: a bad query is being kicked off before is consumes too much resources unnecessarily.

No more spool space scenarios

System ran out of spool space

This is the most rare situation, forget about. There are too few free space on the system, but this situation used to be avoided by defining a “SpoolReserve” database, where no objects are created, this way that area is always available for spool.
If many “big spool limit” users run high spool queries parallel, then this rare situation can yet occure.

Multiple session of the user are active together

This is a quite rare situation also. Check the active users from dbc.sessioninfo.

Volatile tables

All existing volatile tables reside in your available spool space, reducing the available. If you create many, and even with skewed distribution, you can stuff your spool up. Choose “primary index” carefully, when defining volatile tables also.

Improper execution plan

These are the >90% of cases that cause the “No more spool space” errors. Let’ see how:
  • “Duplication to all AMPs” of a non-small set of records
    The root cause is typically missing or stale statistics. Either system thinks that much less records will be duplicated than the real (sometimes billions of records get in this kind of spools), or knows this exactly, but on the other branch of the query there are more low quality estimations, and this execution seems to be cheaper.
  • Redistribution of records by a hash that causes skewed distribution
    Check the corresponding blog post: Accelerate skewed joins
  • Retrieve huge amount of records into spool (locally or redistributed onto the AMPs)
    Specific query structures imply this execution, like: join to a view that “union all”-s big tables.
I suggest to use PRISE Tuning Assistant to identify what is the problem. It spectacularly displays which execution step falls in the problems above.
Increasing the spool limit will not solve the problems in the most cases. 

Too big task

Sometimes a given SQL query requires big spool area to be performed, even with the best execution plan.

This is the only case when raising spool limit is the solution. But first you have to understand that the task is really big. PRISE Tuning Assistant is a good tool for identify this in a minute.

QUICKTIPS – DELETE or CREATE+INSERT?

Delete or Insert? Which one is faster?

The question

If I got to delete rows from a table, practically I have two options:

  • Issue a DELETE command with the desired WHERE / JOIN condition
  • Create an empty table with the same structure and INSERT the remaining records using the negated WHERE condition and the rename tables
The first one is simple and straightforward, the second is complicated and will require more perm space. So it is a question?
Yes, it is. Because of the backgrounds of the execution:
  • DELETE
    It will modify each blocks that contain record(s) to be deleted. It has to write Transient Journal (transaction log) to enable transaction related functionality. In addition those data blocks which fill-up-rate falls below a threshold (eg. 50%) they will be merged. All these may cause really lot of I/O overhead. In some cases the deletion of the rows will happen in “random-like” order therefore the I/O access will be random, which is painfully slow.
  • INSERT
    In this case it is quite simple: we insert into an empty table, which causes almost zero Transient Journal data, and the system has to write only that amount of data blocks, what can store the records that will remain, once each block. An additional advantage: the I/O will always be “linear” instead of “random-like” therefore it can be multi-block operation, which is pretty fast.

Shall we forget about the delete?

The answer

No, absolutely not. DELETE is good enough if the amount of the records that must be deleted is not that much.
We have two factors that will help us in the decision:

  • How many records to be deleted?
  • How many records will remain after deletion?

I do not have well defined universal thresholds where it is worth to choose this or that, it depends.
I used to use the CREATE+INSERT+RENAME method if

    • The table is big enough (> > 0,1-0,5Gbytes or 10-50k rows / AMP)
      and
      I got to delete more than 20..40% of the records
    OR
  • The table has high hash collision (HC) rate on the PI
    and
    The table is MULTISET*
    and
    The delete is a JOIN-ed one

So it depends, got to measure. PRISE Tuning Assistant can help you to build up your own rule of thumb, applicable on your environment, by showing the resource consumption in pretty details.

* If the HC rate is high and the table is SET then the INSERT will be very slow because of the duplication check method

QUICKTIPS – DELETE+INSERT – together or separetely?

Shall I run DELETE and INSERT separately?

We often have permanent or temporary tables in our data warehouse which must be purged before they are populated again. We can do the operation within a transaction or separately. It is not the same…

Options – (example)

The examples assume Teradata (BTET)  mode, which is Auto-commit by default

  1. Together: both statements run in the same transaction
    1. Explicit transaction
      begin transaction;
      delete from TABLE1;
      insert into TABLE1 SELECT …..;
      end transaction;

    2. Multi-statement
      delete from TABLE1
      ;insert into TABLE1 SELECT …..;
  2. Separately: the statements run in separate transaction
    delete from TABLE1 [ALL];
    insert into TABLE1 SELECT …..;

     (Please note if you run it in Teradata Administrator aka. WinDDI, this will be run in one explicit transaction, since the application will place a BT; before the script and an ET; after.)

    In this case the ALL clause is only optics: the DELETE without a WHERE condition is handled internally as “ALL”.

What is the difference?

The result

If everything is OK, we will not experience any difference at the end – from result aspect. The same data will be found in TABLE1 after we finished each ways. The difference shows if the INSERT step fails eg. for running out of perm space:

  • In “A.” cases the all transaction will be rolled back, therefore we will see all the data in the table we had before the delete.
  • In “B.” case we will delete the data from the tab, independently the success of the INSERT phase. After a failed INSERT the table will be empty

We have another additional transient difference:

  • In A. case we will not “see” an empty table from another session in any point of time: until a point the table seems as a “before” version, and when the transaction finished it will switch to an “after” image for other sessions in no time.
  • In B. case when the DELETE finished, the table seems to be empty up to the moment of COMMITing the INSERT statement. So if the INSERT lasts 2 hours, others will face and empty TABLE1 for 2 hours.

So I do not recommend to use option B. if the “empty table” state is not valid.

The performance

The A. and B. options will be executed significantly different manner. The DELETE operation will know that all the data will be dropped, so internally it just administers the data blocks to be free instead of clearing their content record-by-record, BUT only if the statement is the only statement in the transaction! In this case only the data block administration must be logged (to be able to roll back for some reason) instead of every block changes. This results far less Transient Journal activity besides less block I/Os.

Let’s see some measurements.
I’ve put 10M records into a table, and then executed the same DELETE + INSERT statement on it. The results are the following:

  1. CPU:    9  sec I/O: 39K
  2. CPU  >0.1sec I/O:   4K

It is worth to mind the difference if you got to do it with billions of records.

Monitor ongoing SQLs – progress indication

How can I monitor the progress of my SQL?

We often have SQL operations that last hours or more, and we do want to know when will they expectedly finish. There is no progress bar or something similar tool in Teradata, but we have some useful information parcels to be used to make our estimations.

Let’s see how.

Execution steps

SQL statements are broken down to execution steps by the Parsing Engine (PE), which steps are executed sometimes sequentially, sometimes parallel to each other.
This level of monitoring can easily be done by either Teradata Viewpoint or DB_MON tools. It will show you which step(s) was/is/will be executed, and some information about them:
  • What is the estimated runtime (for all steps)
  • How long did the finished ones run (only for finished ones)
  • For how long the currently executed is/are running (only for the running ones)
First we can compare the estimations and the fact runtimes for the finished steps. It can give a guess rate we can use for the further estimations. Naturally it is far from exact calculation, but is just a guideline. The difference comes from estimation bias and server load/priority conditions.
So at the end we can use the estimations’ sum multiplied by the finished steps’ fact/estimation rate as a very rough guess

Within a step

Typically a long running SQL statement contains a few number of long running steps besides more lightweight ones. So the runtime is dominated by some few heavy steps, and in this case the interesting question is: what is the progress rate of the actually running step?
The answer is not that very simple, one must know what happens behind – more or less.
Before examining  exact cases, I provide some basic information and tricks.

Tricks to know

Typically used SQL scripts by myself

Current spool used by a user (1)

We can query the data dictionary for the actually allocated spool space by a user. This is very important that this is related to a user, not a session (sum of all logged in sessions of that specific user). 
The current spool covers all objects currently stored in spool: 
  • used by the currently executed query step
  • all the other spools generated by previous steps and still not dropped
  • all existing volatile tables either. 

So be careful with it.

The current spool information is usually interesting from two aspects:
  • Total spool usage of a user, now (a):
    SELECT cast(sum(currentspool) /1024/1024/1024 as decimal(15,3)) Current_spool,
    Current_spool / (hashamp()+1) Average_spool, cast(max(currentspool) /1024/1024/1024 as decimal(15,3)) MaxAMP_spool,
    cast(100.0 - ((Average_spool / nullifzero(MaxAMP_spool)) * 100.0) as decimal(4,1)) Spool_skew
    FROM dbc.diskspace WHERE databasename = '
    <user>';
  • Spool usage by AMP of a user, now (b):
    SELECT vproc, cast(currentspool /1024/1024 as decimal(15,3)) Current_AMP_spool_MB
    FROM dbc.diskspace WHERE databasename = '
    <user>'
    ORDER BY 2 DESC;

Current size of a table (2)

  • Total size of a table (a):
    SELECT databasename,tablename,cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Current_perm_GB
    FROM dbc.allspace
    WHERE databasename = '
    <db_name>' and tablename = '<table_name>'
    GROUP BY 1,2;
  • Table size by AMP (b)
    SELECT databasename,tablename,vproc,cast(currentperm /1024/1024/1024 as decimal(15,3)) Current_perm_MB
    FROM dbc.allspace
    WHERE databasename = '
    <db_name>' and tablename = '<table_name>'
    ORDER BY 4 DESC;

Current allocated and free space in a database (3)

  • Total allocated/free space in a database, currently (a):
    SELECT databasename,
    cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Used_perm_GB,
    cast(sum(maxperm-currentperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB
    FROM dbc.diskspace
    WHERE databasename = '
    <db_name>'
    GROUP BY 1;
  • Sizes in a database by AMP (b)
    Run before the operation:
    CREATE VOLATILE TABLE CPERM as
    (SELECT vproc,currentperm, maxperm-currentperm freeperm FROM dbc.diskspace WHERE databasename = '<db_name>') with data PRIMARY INDEX( vproc) on commit preserve rows;

    Run during the operation any time of interest:
    SELECT databasename,
    cast(sum(a.currentperm-b.currentperm) /1024/1024/1024 as decimal(15,1)) Used_perm_GB,
    cast(sum(a.maxperm-a.currentperm-b.freeperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB
    FROM dbc.diskspace a join CPERM b on a.vproc=b.vproc
    WHERE databasename = '<db_name>' GROUP BY 1

Transient journal size and free space in DBC (4)

When running an SQL statement which modifies table data, it will generate more or less transaction log (called TransientJournal in Teradata). Its size sometimes helps to determine the progress of an operation. Please note that this information is very-very informative, since we do not know how much journal will be written, and in addition all the transactions use the same journal table, and the purge of the released journal area is an asynchronous task. It is important, that the journal must fit into DBC, so the following query shows the DBC’s free space along with the current journal size. Please note: this query examines the conditions on total level instead of AMP level, so if you run out of DBC space on one AMP the system will fail, in spite you see lots of total free spaces.
  • How’s going with the transient journal:
    SELECT tj.Current_perm_GB TransientJournal_current_GB, dbspc.Free_perm_GB DBC_free_perm_GB FROM
    (SELECT databasename,tablename,cast(sum(currentperm) /1024/1024/1024 as decimal(15,1)) Current_perm_GB
    FROM dbc.allspace WHERE databasename = 'DBC' and tablename = 'TransientJournal' GROUP BY 1,2 ) tj cross join
    (SELECT cast(sum(maxperm-currentperm)/1024/1024/1024 as decimal(15,1)) Free_perm_GB FROM dbc.diskspace WHERE databasename = 'DBC') dbspc;

Within a step, continued…

The progress examination method will differ depending on the type of the operation – unfortunately this is how it goes. The reasons are hidden behind the internal processes of Teradata… 

INSERT INTO xxx SELECT … FROM …

This type of operation typically consists of the following steps:

  1. Retrieve data from the source into a SpoolX
    This step can be examined by checking the executing user’s spool space (query 1.a/1.b). If the data is skewed, it is worth to check the spool on AMP level (1.b), mainly after the growth of the total will get slower. This is because most of the AMPs are finished already, but a few of them must process the “peak” 
  2. Merge SpoolX into “xxx” table
    I monitor this step in two sections: first check the target table’s size (query 2.a/2.b), which will increase as the merge process advances. The target table’s size will increase by the size of the source spool, (except in some special cases like SET table duplication elimination) so if the target table was not empty before the insert, just watch the difference between initial and current size. It goes linearly along the time up to a specific point, when the first AMP finishes its work.When an AMP has been finished the merge, it will release the corresponding source spool area ( which was produced by step 1. ). So when the growth of the target slows down, switch to check the executing user’s spool space by AMP (query 1.b), and watch how many AMPs of the all will have zero or minimal spool occupied (the non-finished ones show the full spool size and do not decrement until they finish). This phase is quite short if the data is not skewed AND the system is well balanced (co-existing node generations…) AND another sessions do not cause skewed load to the system, otherwise the first and last finishing AMP may differ significantly. In addition, the last AMP can run extremely long when the hash collision problem escalates. Unfortunately the progress within a single AMP is not visible* (see note below).
    When the target table was non empty when the INSERT started, we usually experience significant increase in the size of the Transient Journal (query 4). This is not a good progress indicator, since we do not have a guideline how big is the target journal size, and other transactions use the same journal. In addition the release of the unused journal area is an asynchronous process, which is may be delayed from the SQL operations finish. The purpose of monitoring this object – together with the DBC free space – is to ensure that we do not eat up all the free DBC space. Of course it is a rare situation, but is is a very unpleasant situation when half-a-day run is rolled back because we run out of DBC space…
  3. Commit transaction
    This step usually lasts a very short time, except when the table is very skewed. Usually I do not monitor this step.

CREATE TABLE AS …

The basic execution structure (from our aspect) will not differ in major things, except one. The “preparation step(s)” eg. retrieve into a spool can be measured exactly the same way as we did at the “INSERT… SELECT…” case, but the MERGE is different: at the time of execution we do not have the target object in the data dictionary, which size we could measure. However we are not absolutely lost: the free space in the target database (query 3.a/3.b) is kept maintained during the MERGE, so if we measure it, and compare to the amount when the process started, we get a good progress indicator of the MERGE step. Take care, it only works if nobody modifies the Maxperm (size) of the target database neither other process puts or removes data into/from it during our operation.

  1. Save the initial condition into a volatile table (it is very small spool)
  2. Query the current increment globally

When the size increasing slows down we can go back to check the spool size at AMP level, and watch how many AMPs has already been finished.

* Note: The progress of a single AMP is not measurable using the provided methods, but we can have some indirect deductions at the MERGE step. If we know the expected size or record number ratio of the smallest and biggest AMP, and we can measure the finish time of the smallest AMP (when the spool size on an AMPs begin to fall to zero or very low value (the MERGE is finished on that AMP), we can extrapolate the runtime on the biggest AMP (this will be the total time). But only if hash collision is not significant: MULTISET TABLE, or low “high-mode-frequency” of the PI columns, otherwise the runtime is dominantly determined by the hash collision level (where Nx is the top mode-frequency values of the PI; O(sum(Nx*Nx)) ), not the number of records

SET or MULTISET table?

Set

Choosing between SET and MULTISET table

The question is asked far less often than it should be.
The default setting is SET, which is applied if you do not specify, and everything goes fine.
And now, let’s see down there what all these things mean.
Relational datamodeling states that there must not be two records in a table, which have all corresponding fields equal. If we design a regular 3th normal form datamodel, the statement seems to be trivial, because Primary Key (PK) will ensure it. (PK not to be mixed up with the PI)

However we typically do not define PKs in the data warehouses, because of the possibility large storage and processing demands. PK can be interpreted as an:

  • UPI: no overhead, but often PK columns are not the best choice from access path point of view
  • USI: it is an extra subtable with its significant perm space and processing demands of maintenance.

Without a PK, nothing can ensure that all rows are distinct – this function is fulfilled by the “SET” property of a table. It will not allow to insert more records with absolutely equivalent layout (column contents), it simply swallows those kind of records. This is called “DUPLICATE ROW CHECKS” in the documentations.
There can be only one – like in the Highlander movie…

What is the difference between PK and SET?

  • Definition
    • PK: regards chosen subset of columns
    • SET: covers all columns of a table together
  • Efficiency/Performance
    • PK: only one similar record may exist int the table: 1 fetch, 0 or 1 match (few comparisons) only hash value and the PK columns must be compared (fast)
    • SET: many hash synonyms can exist (many comparisons), and the whole record layout must be compared. This be painfully slow, and in addition skewed
  • Result
    • PK: INSERT/UPDATE statement fails if uniqueness is violated (2801: Duplicate unique primary key error)
    • SET: swallows the duplicated records, no error message if an INSERT SELECT command wants to insert more similar or already existing record. If we use INSERT… VALUES statement, or try to UPDATE a record to be similar to another one, we get the 2802 message like at PK

The problem of SET tables was discussed in the post Curing slow INSERTs and CREATE TABLEs I

Why do we use SET tables then?

SET table takes good care of some cases of unintentional duplications, typically coming from bad join conditions or overlapping history tables.
Sometimes. And sometimes not, if other data (eg. running sequence or identity column) makes difference in some columns.
I think using SET table to eliminate programming mistakes is a bad practice. However the SET table ensures that we will not find any “dumb” duplication in our base tables gives a good calmness about our data.

FastLoad and MULTISET

You can be surprised, when you try to load into a MULTISET table with FastLoad utility. It will eliminate row duplications regardless if you want or not. You cannot switch off this function.
The duplications will be simply eliminated, you can only deduce from the logfile: check the
“Total Duplicate Rows” amount.

When should we avoid using SET tables?

The short answer is: NUPI tables with high hash-collision rate for both temporary and final tables.

What does it means?
Check the table’s PI values TOP frequent occurences:
Assume this table:
CREATE TABLE DB1.TBL1
(
c1 INTEGER,
c2 CHAR(10),
c3 VARCHAR(100)
) PRIMARY INDEX (c1,c2)

Query the top frequent values of the primary index (PI) columns together:

SELECT TOP 20 c1, c2, count(*) cnt FROM DB1.TBL1 GROUP by 1,2 ORDER by 3 desc

If we see that the most frequent value is greater than 100, I’d surely choose MULTISET.

I experience 30-50% of the performance problems based on hash collisions, significant part coming from misunderstood usage of SET feature, suffering from these side effects.
The other part is the “hidden product join”.
Take care! MULTISET will only solve the performance problem of INSERT-s, but will not help the other part: joins, and joined UPDATE or DELETE commands.

If a table has bad skewness, it will have bad hash collision level (HC) is 99% of the cases. But it is not reversible: a good skew itself will not ensure low HC.

+1 trick: If you just want to create a copy of a high HC table for save, choose MULTISET and will have much quicker run.

The hidden product join

Hiding enemy: the hidden “product” join

Introduction

When we check query performance one of the first things we search for is the PRODUCT JOIN. Usually this is the component we hate and pursue, because we belive that it is the root problem. It is not always true, but it is a different topic.
Sometimes we find merge join, but it is still very slow, in spite there is no skewness, PIs seem OK.

So the question is: how can a simple merge join be so pathetically slow sometimes?

What can be the problem?

The real evil in this case is – my favourite – the hash collision.
Let’s imagine the tables below (transactions’ columns splitted into two tables):

CREATE TABLE TRX_A  (
  User_id INTEGER,
  User_trx_seq INTEGER
  ColA...
  ColB...
  ...
) WITH DATA
PRIMARY INDEX (User_id)

CREATE TABLE TRX_B  (
  User_id INTEGER,
  User_trx_seq INTEGER
  ColC...
  ColD...
  ...
  ...
) WITH DATA
PRIMARY INDEX (User_id)


The individual users produce generally equal number of transaction in our cited case, so table skewness is not experienced. The INSERTs into these tables are not that fast, but the daily increment is not that big, so it makes no outstanding problem – in our imagined case.
We have all required statistics either.

Let’s take this query:

SELECT
  …
  …
FROM TRX_1 a
JOIN TRX_2   b on a.User_id =b.User_id and a.User_trx_seq = b.User_trx_seq

How does the optimizer think?
Distribution of the tables are excellent, both have equal PI and in addition the column is part of an “AND”-ed join condition. The matchable records reside on the same AMP – the only thing to do is evaluate the “residual” condition: a.User_trx_id = b.User_trx_id.
However it is not that simple.
In the beginning each user has few transactions, eg. 10.
What does the system do? Each records of User1 in TRX_1 table must be compared with 10 records from TRX_2. That is 100 comparisons.
What if each user has 1000 transactions? 1million comparisons for each user!!
The growth is in the order of X2, where X is the number of records per PI (exactly PI hash) value.
In this type of cases the number of records per PI value grows linearly or quicker as time goes by, what means longer and longer execution times.

Please note thet the total time required for the join step is determined by the maximum number of records per PI value, since all these records will fall to the same AMP, therefore that one will perform the slowest.

 

What is the solution?

We can experience this problem in two cases:

  • Sources are regular tables of our datamodel
  • Sources (at least one of them) is a Spool during a query execution

In the first case we have to think on changing the PI, if we can (to <User_id,User_trx_seq> in our case). This will speed up the INSERT procedures, so does the JOINED DELETE and UPDATE operations. The disadvantage is that the simple User_id based joins will collapse. They will probably require redistribution of out TRX_x table. The User_id based filters will turn to full-table-scans instead of PI access, which can be handled by secondary- or join indices.
Another option is to do a pre-fetch from the TRX_1 and TRX_2 tables into volatile tables having PI: (User_id,User_trx_seq). But note that it may be expensive, if we cannot apply a pretty selective filtering. This means redistribution of the tables.

In the second case we may apply the method of storing the partial result. Put the Spool into a volatile table with the appropriate PI (User_id,User_trx_seq), and this will hopefully enforce the other table/spool to be redistributed to this hash-collision free distribution.

 

How can we catch those guys?

Unfortunately the “explain plan” will not have a blinking LED to highlight this situation, we need to have an experience to suspect that someting’s wrong. The main indicator is the cost/input ratio. Check how much records are in the bigger source table and compare with the step’s impactCPU, while your other eye is on the sources’ and target’s skew. If the skews are low, but the CPU (or I/O) is unreasonably high, you probably got it!

Does it seem complicated? Maybe yes, but quite simple if using PRISE Tuning Assistant:

 hidden_product_join_pta

Using Queryband

How to use querybanding in Teradata?

What is queryband?

Teradata is a diligent RDBMS that runs sometimes millions of SQLs a day. You will see them in the DBQL (DataBase Query Logging area) – if it is switched on – but it’s a hard job to know around in that mess of queries. How can I find a specific query? What did that query run by? If I want to analyze or modify something I need to find the source of the execution as exactly as can be.
Queryband is a labelling possibility to flag the queries to let their source job/report/etc. be easily found.

Who defines the queryband?

Setting the queryband is usually the responsibility of the query runner:

  • ETL software or solution that executes it
  • OLAP tool that issues it
  • Person, who runs it ad-hoc

How to set the queryband?

Technically it is a quite simple stuff: Teradata provides a command to set it:
SET QUERY_BAND = {‘<variable1>=<value1>;<variable2>=<value2>;…’ / NONE} [UPDATE] for SESSION/TRANSACTION;
, where:
<variable1>=<value1>;
Queryband can consist of arbitrary number of “variable”-“value” pairs. Both are string values. Do not forget to put the semicolon after each variable-value pair!
NONE: clears the queryband 
UPDATE: is specified, then those variables that has been previously defined are updated by the new value, others are added with the given value. Empty value string is a valid content and will not remove the variable. Please note that deleting a value is only possible by redefining the queryband without that specific variable.

SESSION/TRANSACTION: what it says…

Where can I check queryband?

The values are reflected in the dbc.SessionfoX.QueryBand and the dbc.DBQLogtbl.QueryBand. The following example shows its content:

SET QUERY_BAND=’PROJECT=TeraTuningBlog;TASK=QB_example;’ for session;

(For the logged in session)
SELECT queryband FROM dbc.sessioninfoX WHERE sessionNo=session;
—————————————————-
PROJECT=TeraTuningBlog;TASK=QB_example;

(For the formerly ran queries)

SELECT queryband FROM dbc.dbqlogtbl WHERE Queryid=…;
—————————————————-
=S> PROJECT=TeraTuningBlog;TASK=QB_example;

(For a specific variable, eg. “PROJECT”)
SELECT QB_PROJECT FROM
(
   SELECT CAST((case when index(queryband,’PROJECT=’) >0 then substr(queryband,index(queryband,’PROJECT=’) ) else ” end) AS VARCHAR(2050)) tmp_PROJECT
     ,CAST( (substr(tmp_PROJECT,characters(‘PROJECT=’)+1, nullifzero(index(tmp_PROJECT,’;’))-characters(‘PROJECT=’)-1)) AS VARCHAR(2050)) QB_PROJECT
   FROM dbc.sessioninfoX 
WHERE sessionNo=session
) x ;

—————————————————-
TeraTuningBlog

(Which queries has been run by the “LoadCustomers” project?)

   SELECT a.*, CAST((case when index(queryband,’PROJECT=’) >0 then substr(queryband,index(queryband,’PROJECT=’) ) else ” end) AS VARCHAR(2050)) tmp_PROJECT
     ,CAST( (substr(tmp_PROJECT,characters(‘PROJECT=’)+1, nullifzero(index(tmp_PROJECT,’;’))-characters(‘PROJECT=’)-1)) AS VARCHAR(2050)) QB_PROJECT
   FROM dbc.dbqlogtbl a 
WHERE QB_PROJECT=”LoadCustomers”
;

Designing querybanding

We know how to set the queryband, it’s quite easy to build in / configure in the ETL tool, OLAP software and other query running applications. But what variables should we define, and how should we populate them? I give a best practice, but it is just a recommendation, can be modified due according to your taste.

First of all, some things to mind:

  • Use short variable names and values, since they will be logged in each DBQL records
  • Define consistent structure in each source systems to easily analyze data
  • Record as detailed information as you need, not more, not less. Define unique values for those items you later want to differentiate. Using a lookup/hierarchy table you can easily merge what you need, but never can drill down what is aggregated.

I recommend these variables to be defined:

  • SYS: Maximum of 3 characters ID of the system that ran the Query, like INF (Informatica), MST (Microstrategy), SLJ (SLJM), BO (Business Objects), AH (ad-hoc query tool)
  • ENV: P (Production) / Tx (Test x) / Dx (Development x), the identifier of environment. x may be neglected, if it does not matter
  • JOB: Which job or report contains that specific query (the name of it)
  • STP: (Step) Which SQL script, or other sub-structure does the query belong to (name of it)
  • VER: Version of the JOB. This will determine the version of the script (if available)

Using Partitioned Primary Index II

How to choose partitioned primary index (PPI)

This post is an expand to my PPI basic post.

What is the difference between NPPI and PPI?

  • NPPI: Non partitioned primary index
    The good old regular PI. The rows are distributed by HASHAMP(HASHBUCKET(HASHROW(PI))), and ordered by HASHROW(PI), nothing special
  • PPI: Partitioned primary index
    Distribution is the same, but ordering different: <PartitionID><HASHROW(PI)>. The <PartitionID> is a stored value in each rows, allocating 2 or 8 bytes (see below).

The only difference is the storing order of the records (and the 2/8 bytes overhead).

What is PPI good for?

The partitioning feature – like in many other databases – usually solves some performance issues, say enables to eliminate some needless work in specific situations.
  • SELECT
    Eligible “where conditions” result serious partition-elimination, which means that usually only a small fraction of the table should be scanned instead of the whole one.
  • INSERT
    Check the storing order of the NPPI tables: the records are in “hash” order, that is if I want to insert a series of records into a Teradata table, they will reside in spreadly distributed data blocks. If the table is big enough, my new eg. 1000 records will get into ~1000 different data blocks, what means 1000 pieces of expensive “random writes”. However if my 1000 records got to a PPI table and they will have the same PartitionID, they will get into far less than 1000 data blocks with high probability. In real life situations we often will write to continuous data blocks with much cheaper “sequential write” 
  • DELETE
    Same as INSERT
  • BACKUP
    Teradata allows archiving only one or more partitions, saving lot of time and tape. Older data in transaction tables usually does not change therefore it is unnecessary to backup them every time

“Good-to-know”s

Costs of partitioning

Like all good things in the world, partitioning has trade-offs also:

  • Extra 2/8 bytes per record allocated storage space
    Depending on maximal number of partitions. See “Number of partitions” chapter
  • Slower “SAMPLE” scans
    Proper random sampling is more complex, since the physical storing order is in correlation with partitioning value
  • Extra sort operations / Sliding window joins
    If joined to a table which has NPPI or PPI with not exactly same definition will result a preparation “sort” step, or leads to a “sliding window merge join”, which is technically N x M merge joins between the partitions of TableA and TableB.

Number of partitions

How many partitions should I have?
How many partitions do I have?
How is an empty partition looks like?
They are all interesting questions, let’s analyze the implementation of Teradata implementation.

Partition is not an object, it is just a calculated (and stored) value in the record, which will determine the physical storing order of the record. A partition will not allocate space, an “empty partition” technically means that no record exists with that partition’s partitionID, nothing else.
How many partitions I have in the table? As many different PartitionID in the existing records occure, which depends on the occurring values of the partitioning column.
How many partitions can I have in the table? It depends on the table definition. One must use the RANGE_N or the CASE_N function to define the PartitionID calculation. Its definition unambiguously determines how many different PartitionID values may occur. In versions up to V13.10 65535 is allowed, from V14.00 we can have as many as 9.2 Quintillion (8 bytes PartitionID). The table definition cannot be altered to switch between 2 and 8 bytes layout.

What is the drawback of having many partition? The sliding-window merge join. Mind including partitioning column into the PI if possible (otherwise PI based filtering will cause as many accesses as many partitions exist).

What happens with the out-of-range records?

We have the clauses NO RANGE and NO CASE in the PPI definition. They mean an ID value for that partition that is out of the defined range or case, those records got into this partition. It can be a hidden trap, if you forget to maintain your date partition definition on a transaction table, and all records got to get into this partition from a moment. And the partition keeps fattening, queries keep go slowing somehow…

Multi level partitioning

This is a good trick. One can define partitioning “hierarchically”, which is simply a “Cartesian product” of the partitions at each levels, the result is a single PartitionID. In case of 2 bytes partitioning, the “Cartesian product” should fall below 65535.

What is sensational in the Teradata implementation of multi level PPI? You can filter only lower level partitioning key(s) also, partition elimination will happen. How? It calculates all possible combinations, and produces the PartitionID list to be scanned, excellent.

Partitioning granularity

The next good question is: how fine should I define partitioning?
It depends. Basically I’d branch to two main cases:
  • “Temporal” (date) partitioning
    The best partition size is the day. Most of the filtering is on day level, and we have ~365 days a year, not too much partitions for your lifetime. If we partition on monthly units, then the partition elimination ranges are more rough, and we have 12 partitions a year, which is also too much in case of a PI-NPPI join.
  • All others
    It really depends. Depends on the goal, and the value demographics. It’s good to correlate with the filtering pattern (what is the frequent relevant ‘where’ condition parcel).

Hope it helped, please ask, if something is missing or confusing.

Storing date&time columns

How to store date and time info effectively

Introduction

Data Warehouse databases usually contain significant amount of date/time information. Physical modeling technique can seriously influence their storage space and usability.

Aspects

Date/time information can be stored in different ways/data types, each of them will have its own specialities.
Basic options:

  • Joint storage: Timestamp
    • Timestamp(n) , when n means the fractional digits of seconds
  • Separate storage: Date & Time
    • Date column + Time column

Storage space

The data types require the following space (if uncompressed)

Type Space
Date 4 bytes
Integer time (integer format ’99:99:99′) 4 bytes
Time(n) 6 bytes, independent of n*, where n:[0..6]
Time(n) with time zone 8 bytes, independent of n*, where n:[0..6]
Timestamp(n) 10 bytes, independent of n*, where n:[0..6]
Timestamp(n) with time zone 12 bytes, independent of n*, where n:[0..6]

* n means the precision digits of second

Usage complexity

Teradata is not the most ergonomic for handling date-time data. Operations with these data types are typically tricky and sometimes hides traps (try add_months(‘2014-01-31’,1) ). Conversion of a date and a timestamp is different, decisions must be made by considering storage and usage aspects.
  • Conversions
    • Date: implicit conversions work, easy and comfortable
    • Integer time: works fine, but insert-select will loose the formatting, only the integer value will remain
    • Time(n): implicit conversion to string is not working. This fails: select cast(‘2014-01-31’ as date) || ‘ ‘ ||cast(’12:00:00’ as time(0))
    • Timestamp(n): brrr. Different precisions will not convert automatically either. I don’t like it.
  • Filtering: comparing date/datetime values with < / <= /between operators
    • Joint storage (timestamps)
      Straightforward, just use the values – if they are equivalent data types
    • Separate storage
      You have to convert to a “joint” format, either a string or a timestamp before
  • Arithmetic
    • Date: ok, adding a constant, subtracting dates work fine
    • Integer time: do not use arithmetic, results are bad!
    • Time(n): interval types accepted. Not really comfortable, eg max 99 second long interval is accepted (V13.10)
    • Timestamp(n): same as Time(n)

    Regarding arithmetic I suggest building your own UDF library, that will ease your life.

    Recommendations

    Choosing data type

    I recommend to choose data types depending on the table type and usage purposes.
    I differentiate “transaction” and “all other” table types, because transaction tables are usually allocate most of the PERM space, while others are many in number, but allocate “negligible” space.

    • Transaction
      • Separate storage
      • Integer time
    • All others
      • Joint type (timestamp)

    Saving space – store “delta”

    The biggest tables in the data warehouses are the “transaction tables” (call/purchase/transfer/etc. transactions depending on industry), and most of them contain several date fields, most of them w/strong correlation. I explain what I mean. Let’s assume a call record (telco), that will have the following date(&time) columns:
    • Channel_seizure
    • Call_start
    • Call_end
    • Bill_cycle_start
    • Bill_cycle_end
    The date component of the first three columns are the same in 99% of the records, and the last ones differ from the first ones with max. of 30 days.

    My recommendation is the following:
    • Choose a “primary date”
      Must be not null, and typically used as partitioning key also, since it is the most often date filtering condition.In our case this will be the Call_start
    • Choose separate date-time storing
      Eg. Date and Integer time , as this combination requires the least space
    • Store the non-primary dates as delta, multi value comressed
      Compute it in the load process, like this:
      Call_end_delta := Call_end-Call_start
    • Compress the “delta” columns
      They will reflect low deviation, highly compressible, use PRISE Compress Wizard
    • Convert to absolute dates back in the view layer
      Call_start + Call_end_delta as “Call_end”
    Example:

    CREATE TABLE T2000_CALL_TRX
    (

    Call_start_date Date NOT NULL
    Call_end_date_delta Integer COMPRESS (0)

    ) PRIMARY INDEX (…,Call_start_date)
    PARTITION BY RANGE_N ( Call_start_date BETWEEN date ‘2010-01-01’ AND date ‘2020-12-31’ EACH interval ‘1’ day, NO RANGE, UNKNOWN);
    ;

    CREATE VIEW V2000_CALL_TRX
    as
    SELECT

    , Call_end_date_delta +Call_start_date as “Call_end_date

    FROM
    T2000_CALL_TRX

    ;