SET or MULTISET table?

Set

Choosing between SET and MULTISET table

The question is asked far less often than it should be.
The default setting is SET, which is applied if you do not specify, and everything goes fine.
And now, let’s see down there what all these things mean.
Relational datamodeling states that there must not be two records in a table, which have all corresponding fields equal. If we design a regular 3th normal form datamodel, the statement seems to be trivial, because Primary Key (PK) will ensure it. (PK not to be mixed up with the PI)

However we typically do not define PKs in the data warehouses, because of the possibility large storage and processing demands. PK can be interpreted as an:

  • UPI: no overhead, but often PK columns are not the best choice from access path point of view
  • USI: it is an extra subtable with its significant perm space and processing demands of maintenance.

Without a PK, nothing can ensure that all rows are distinct – this function is fulfilled by the “SET” property of a table. It will not allow to insert more records with absolutely equivalent layout (column contents), it simply swallows those kind of records. This is called “DUPLICATE ROW CHECKS” in the documentations.
There can be only one – like in the Highlander movie…

What is the difference between PK and SET?

  • Definition
    • PK: regards chosen subset of columns
    • SET: covers all columns of a table together
  • Efficiency/Performance
    • PK: only one similar record may exist int the table: 1 fetch, 0 or 1 match (few comparisons) only hash value and the PK columns must be compared (fast)
    • SET: many hash synonyms can exist (many comparisons), and the whole record layout must be compared. This be painfully slow, and in addition skewed
  • Result
    • PK: INSERT/UPDATE statement fails if uniqueness is violated (2801: Duplicate unique primary key error)
    • SET: swallows the duplicated records, no error message if an INSERT SELECT command wants to insert more similar or already existing record. If we use INSERT… VALUES statement, or try to UPDATE a record to be similar to another one, we get the 2802 message like at PK

The problem of SET tables was discussed in the post Curing slow INSERTs and CREATE TABLEs I

Why do we use SET tables then?

SET table takes good care of some cases of unintentional duplications, typically coming from bad join conditions or overlapping history tables.
Sometimes. And sometimes not, if other data (eg. running sequence or identity column) makes difference in some columns.
I think using SET table to eliminate programming mistakes is a bad practice. However the SET table ensures that we will not find any “dumb” duplication in our base tables gives a good calmness about our data.

FastLoad and MULTISET

You can be surprised, when you try to load into a MULTISET table with FastLoad utility. It will eliminate row duplications regardless if you want or not. You cannot switch off this function.
The duplications will be simply eliminated, you can only deduce from the logfile: check the
“Total Duplicate Rows” amount.

When should we avoid using SET tables?

The short answer is: NUPI tables with high hash-collision rate for both temporary and final tables.

What does it means?
Check the table’s PI values TOP frequent occurences:
Assume this table:
CREATE TABLE DB1.TBL1
(
c1 INTEGER,
c2 CHAR(10),
c3 VARCHAR(100)
) PRIMARY INDEX (c1,c2)

Query the top frequent values of the primary index (PI) columns together:

SELECT TOP 20 c1, c2, count(*) cnt FROM DB1.TBL1 GROUP by 1,2 ORDER by 3 desc

If we see that the most frequent value is greater than 100, I’d surely choose MULTISET.

I experience 30-50% of the performance problems based on hash collisions, significant part coming from misunderstood usage of SET feature, suffering from these side effects.
The other part is the “hidden product join”.
Take care! MULTISET will only solve the performance problem of INSERT-s, but will not help the other part: joins, and joined UPDATE or DELETE commands.

If a table has bad skewness, it will have bad hash collision level (HC) is 99% of the cases. But it is not reversible: a good skew itself will not ensure low HC.

+1 trick: If you just want to create a copy of a high HC table for save, choose MULTISET and will have much quicker run.