Using Join Index

How to use Join Index in Teradata

About Join Index (JI)

Join index is most like a “materialized view”, say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.

What are the main differences between a JI and a Secondary Index?

  • Different internal structure and purposes
  • Appears as separate object (not a sub-table) in the DBC tables/views
  • Can reside in different database than the base table(s)
  • Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured

What is the difference between JI and a regular table in which I can store the same query result? 

  • You can’t access it directly from an SQL query
  • Optimizer automatically decides to be used or not:
    • By logic: if the JI can support the query with its data content
    • By cost: if usage of JI will result a “cheaper” execution (always collect statistics!)
  • Maintained automatically if the content of the base table(s) change
  • “Locks” the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).

JI types

The Join Index types I will list below are not differentiated by SQL phrase, but the structure of the SQL SELECT used in the JI definition.

They can be combined also in reasonable ways, eg. <single table – aggregate – sparse> or <multi table – aggregate>, etc.

Let’s take these base tables for our examples:
CREATE TABLE TBL_A
(
  Col1 integer
, Col2 integer
, Col3 integer
PRIMARY INDEX (Col1)
;
CREATE TABLE TBL_B
(
  Col1 integer
, Col2 integer
PRIMARY INDEX (Col1)
;

Single table JI

This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:

  • Non-covering
    We select only the filtering column(s) (those will be the PI of the JI also) and the “Rowid” pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table’s appropriate records. The JI can be very small this way, but note that we have an additional “join phase”.
  • Covering
    The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some “central” tables.
This example shows a non-covering index for the query below:
create join index JI_1
as
SELECT Col1,Col2
FROM TBL_A
PRIMARY INDEX(Col2)

;

select Col3 from TBL_A where Col2=1;

 Multi table JI

This kind of JI is for accelerating frequent join statements. Technically it stores the result of a join. It can cover all the columns of just store the key-pairs, or somewhere between.

create join index JI_2
as
SELECT a.Col3,b.Col1
FROM TBL_A a

join     TBL_B b on a.Col2=b.Col2
PRIMARY INDEX(Col3)
;

Aggregate JI

The JI’s SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata’s optimizer is quite clever, because it can recognize “intermediate” aggregate JIs for further aggregation instead using the base table. Example:

create join index JI_3
as
SELECT Col1,Col2,sum(Col3) X
FROM TBL_A

GROUP BY 1,2
PRIMARY INDEX(Col2)
;

All three SELECTs can be served from the JI_3:
SQL1: select Col1        , sum(X) from TBL_A group by 1;
SQL1: select Col2        , sum(X) from TBL_A group by 1;

SQL1: select Col1,Col2, sum(X) from TBL_A group by 1,2;

Sparse JI

Thees JIs contain where clause, say the indexing is not for the whole table. If the where condition of the JI is a logical subset of the supported SQL’s where condition than the JI can support the query.
Typical usage is on the transactional tables: we have frequent accessed on a transaction table by Customer_id, but have PI of Trx_id. We can observe that 90% of the SELECTs fall onto the last 30 days.
We can put a sparse non covering single table JI on the table with PI:Customer_id
90% of the selects will finish in no time, and 10% of the queries will result in a full table scan, meanwhile our JI size remains pretty small: we index only 30 days instead of eg. 365 days.
Please note that where condition of a sparse JI can not contain “changing” values like current_date. That means JI must be regularly redefined if a moving window sparse JI is to be maintained.

create join index JI_4
as
SELECT Col1,Col2,Col3 X
FROM TBL_A

WHERE Col3 between 1 and 10
PRIMARY INDEX(Col2)
;

Serveable query:
select Col1 from TBL_A where Col2=20 and Col3=4;
Non serveable query:
select Col1 from TBL_A where Col2=20 and Col3=11;

When will join index help?

Join indices typically serve two purposes (or the combination of them):

  • Store the result of frequently produced (part-)results
  • Provide alternative access path to the data
Practical examples:
  • Frequent aggregation or join operation
  • Frequent redistribution of a table
  • Non primary index based filtering on a table (causing full table scan)
  • Non primary index based join of a table (causing redistribution)

Important things about JIs

Statistics

Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use them

Choosing PI

Te PI of the JI will be the most effective access path to the JI’s records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.

DBQL analysis IV – Monitor index usage

Analyze “Index usage” in DBQL

Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on. “Object” option in DBQL is required “on” to use the scripts provided.

About Indices

Teradata provides possibility of creating INDEX objects for allowing alternative access path to the data records. They are quite different structures then the good old B*Tree or Bitmap indices (common in non-MPP RDBMSes)
The main goal of the index objects is to improve data access performance in exchange for storage and maintenance processing capacity.
Having indices is not free (storage and maintenance resources), those ones that bring not enough gain is better being dropped.

Index usage footprint in DBQL

If a query uses an index for accessing the data it is declared in the “explain text”, and also registered in the DBQL: dbqlobjtbl. The appearance type of indices depend on the type of index. In case of primary/secondary index only IndexId and the columns of the index are registered, while join/hash indices appear like a regular table: at database, object and column levels all.

If the join/hash index is covering, the base table may not be listed in the DBQL objects, therefore be careful if analyze table usage from DBQLobjtbl.
I recommend to use PRISE Tuning Assistant to easily find all type of access to a table data.

Examples

Primary index access (2 columns):
Plan:
  1) First, we do a single-AMP RETRIEVE step from d01.z by way of the
     primary index “d01.z.i = 1, d01.z.j = 1” with no residual

DBQLobjtbl:
ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
D01 Z i 1 Idx
D01 Z j 1 Idx

Secondary index access (1 column):
Plan:
  3) We do an all-AMPs RETRIEVE step from d01.x by way of index # 4
     without accessing the base table “d01.x.j = 1” with no residual

DBQLobjtbl:
ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
D01 X 4 Idx

Join index:
create join index d01.ji as sel b from d01.q primary index (b);
select b from d01.q where a=1;
Plan:
  1) First, we do a single-AMP RETRIEVE step from D01.JI by way of the
     primary index “D01.JI.a = 1” with no residual conditions into

DBQLobjtbl:
ObjectDatabaseName ObjectTableName ObjectColumnName ObjectNum ObjectType
D01 JI 0 Jix
D01 JI a 1 Idx
D01 JI b 1026 Col

Please note that

  • ObjectNum identifies the index (refers to dbc.indices.Indexnumber)
  • As many rows appeas as many columns the index has
  • Eg. in V13.10 Teradata Express the single column secondary index lacks the column name in the logs

Analyzing DBQL data 

Prepare data

CREATE VOLATILE TABLE DBQLIdx_tmp1 AS (
SELECT databasename,tablename,indexnumber,indextype,uniqueflag,Columnposition,Columnname
, SUM (1) OVER (partition BY databasename,tablename,indexnumber ORDER BY Columnname ROWS UNBOUNDED PRECEDING) ABCOrder
FROM dbc.indices WHERE indextype IN ('K','P','Q','S','V','H','O','I')
) WITH DATA
PRIMARY INDEX (databasename,tablename,indexnumber)
ON COMMIT PRESERVE ROWS


CREATE VOLATILE TABLE DBQLIdx_tmp2 AS (
WITH RECURSIVE idxs (Databasename,Tablename,Indexnumber,Indextype,Uniqueflag,Indexcolumns,DEPTH)
AS (
SELECT
databasename,tablename,indexnumber,indextype,uniqueflag,TRIM (Columnname) (VARCHAR (1000)),ABCorder
FROM DBQLIdx_tmp1 WHERE ABCorder = 1
UNION ALL
SELECT
b.databasename,b.tablename,b.indexnumber,b.indextype,b.uniqueflag,b.Indexcolumns||','||TRIM (a.Columnname),a.ABCOrder
FROM DBQLIdx_tmp1 a
JOIN idxs b ON a.databasename = b.databasename AND a.tablename = b.tablename AND a.indexnumber = b.indexnumber AND a.ABCOrder = b.Depth + 1
)
SELECT databasename db_name,tablename table_name,indextype,uniqueflag,indexcolumns
,indexnumber
,CASE WHEN uniqueflag = 'Y' AND indextype IN ('P','Q','K') THEN 'UPI'
WHEN uniqueflag = 'N' AND indextype IN ('P','Q') THEN 'NUPI'
WHEN uniqueflag = 'Y' AND indextype IN ('S','V','H','O') THEN 'USI'
WHEN uniqueflag = 'N' AND indextype IN ('S','V','H','O') THEN 'NUSI'
WHEN indextype = 'I' THEN 'O-SI'
ELSE NULL
END Index_code
FROM idxs
QUALIFY SUM (1) OVER (partition BY db_name,table_name,indexnumber ORDER BY DEPTH DESC ROWS UNBOUNDED PRECEDING) = 1
) WITH DATA
PRIMARY INDEX (db_name,table_name)
ON COMMIT PRESERVE ROWS
;

UPDATE a
FROM DBQLIdx_tmp2 a,DBQLIdx_tmp2 b
SET Index_code = 'PK'
WHERE a.db_name = b.db_name
AND a.table_name = b.table_name
AND a.Index_code = 'UPI'
AND a.indextype = 'K'
AND b.Index_code = 'NUPI'
AND b.indextype <> 'K'
;

Report: How many times have the indices been used?

You may need to modify the script:
  • Date filtering (use between for interval)
  • Online/archived DBQL: use commented section for archived

SELECT
  COALESCE(usg.objectdatabasename,idx.db_name) db
, COALESCE(usg.objecttablename,idx.table_name) tbl
, COALESCE(usg.ObjectNum,idx.IndexNumber) idxNo
, idx.Index_code
, idx.Indexcolumns Index_columnss
, coalesce(usg.drb,0) Nbr_of_usg
FROM
(SELECT objectdatabasename,objecttablename,objecttype,ObjectNum,COUNT (*) drb

—  Archived DBQL
—  FROM dbql_arch.dbqlobjtbl_hst WHERE logdate = ‘2014-02-20’ (date)

—  Online DBQL
  FROM dbc.dbqlobjtbl WHERE  cast(collecttimestamp as char(10)) = ‘2014-02-20’
AND objecttablename IS NOT NULL
AND ((objecttype IN (‘JIx’,’Hix’) AND objectcolumnname IS NULL)
OR
(objecttype IN (‘Idx’))
)
AND objectnum <> 1
GROUP BY 1,2,3,4
) usg
FULL OUTER JOIN
( SELECT db_name,table_name,Indextype,Uniqueflag,indexcolumns,Indexnumber,Index_code 

  FROM DBQLIdx_tmp2 a WHERE indextype NOT IN (‘P’,’Q’,’K’)
union all
SELECT databasename,tablename,tablekind,cast(null as char(1))

              ,cast(null as varchar(1000)),cast(null as smallint)
,case when tablekind=’I’ then ‘JIX’ else ‘HIX’ end from dbc.tables where tablekind in (‘I’,’N’)
) idx ON usg.objectdatabasename = idx.db_name
AND usg.objecttablename = idx.table_name
AND ((usg.ObjectNum = idx.IndexNumber) or usg.objecttype IN (‘JIx’,’Hix’))
ORDER BY 6 DESC
;