How to use Join Index in Teradata
About Join Index (JI)
Join index is most like a “materialized view”, say it is a stored result of an SQL SELECT , like a table: you can define the primary index (PI) of the stored result.
What are the main differences between a JI and a Secondary Index?
- Different internal structure and purposes
- Appears as separate object (not a sub-table) in the DBC tables/views
- Can reside in different database than the base table(s)
- Maintenance is logged in separate DBQL STEP, therefore cost can be easily measured
What is the difference between JI and a regular table in which I can store the same query result?
- You can’t access it directly from an SQL query
- Optimizer automatically decides to be used or not:
- By logic: if the JI can support the query with its data content
- By cost: if usage of JI will result a “cheaper” execution (always collect statistics!)
- Maintained automatically if the content of the base table(s) change
- “Locks” the base table(s) against DROP, RENAME, MLOAD, FASTLOAD, RESTORE and ALTER the indexed column(s).
They can be combined also in reasonable ways, eg. <single table – aggregate – sparse> or <multi table – aggregate>, etc.
Let’s take these base tables for our examples:
CREATE TABLE TBL_A
, Col2 integer
, Col3 integer
) PRIMARY INDEX (Col1)
CREATE TABLE TBL_B
, Col2 integer
) PRIMARY INDEX (Col1)
Single table JI
This is the most simple case of a join index. We include only one table, and typically choose different PI than the base table has. There are two significantly different kinds of usage:
We select only the filtering column(s) (those will be the PI of the JI also) and the “Rowid” pseudo column in the JI definition. In this case the filter is strongly selective, and the rowids will be put to a spool to be joined to the base table’s appropriate records. The JI can be very small this way, but note that we have an additional “join phase”.
The JI is selecting all columns (or all columns required by the SQL to be supported) . That means that the base table is not required to satisfy the query at all. This will result very fast operation. This case is typically used for eliminating frequent table redistributions of some “central” tables.
select Col3 from TBL_A where Col2=1;
Multi table JI
create join index JI_2
FROM TBL_A a
join TBL_B b on a.Col2=b.Col2
The JI’s SELECT contains GROUP BY clause. This case is for caching frequent aggregations. Typically can be very useful for supporting those OLAP applications, that do not have internal aggregation-caching methods Teradata’s optimizer is quite clever, because it can recognize “intermediate” aggregate JIs for further aggregation instead using the base table. Example:
create join index JI_3
SELECT Col1,Col2,sum(Col3) X
GROUP BY 1,2
All three SELECTs can be served from the JI_3:
SQL1: select Col1 , sum(X) from TBL_A group by 1;
SQL1: select Col2 , sum(X) from TBL_A group by 1;
create join index JI_4
SELECT Col1,Col2,Col3 X
WHERE Col3 between 1 and 10
select Col1 from TBL_A where Col2=20 and Col3=4;
Non serveable query:
select Col1 from TBL_A where Col2=20 and Col3=11;
When will join index help?
Join indices typically serve two purposes (or the combination of them):
- Store the result of frequently produced (part-)results
- Provide alternative access path to the data
- Frequent aggregation or join operation
- Frequent redistribution of a table
- Non primary index based filtering on a table (causing full table scan)
- Non primary index based join of a table (causing redistribution)
Important things about JIs
Always collect statistics on the PI of the JIs, it is essential for pushing optimizer to use them
Te PI of the JI will be the most effective access path to the JI’s records, like in case of tables, use the regular PI choosing methodology. Mind frequent access (bí which column(s) are filtered or joined), distribution(skewness) and hash collision.