Storing date&time columns

How to store date and time info effectively

Introduction

Data Warehouse databases usually contain significant amount of date/time information. Physical modeling technique can seriously influence their storage space and usability.

Aspects

Date/time information can be stored in different ways/data types, each of them will have its own specialities.
Basic options:

  • Joint storage: Timestamp
    • Timestamp(n) , when n means the fractional digits of seconds
  • Separate storage: Date & Time
    • Date column + Time column

Storage space

The data types require the following space (if uncompressed)

Type Space
Date 4 bytes
Integer time (integer format ’99:99:99′) 4 bytes
Time(n) 6 bytes, independent of n*, where n:[0..6]
Time(n) with time zone 8 bytes, independent of n*, where n:[0..6]
Timestamp(n) 10 bytes, independent of n*, where n:[0..6]
Timestamp(n) with time zone 12 bytes, independent of n*, where n:[0..6]

* n means the precision digits of second

Usage complexity

Teradata is not the most ergonomic for handling date-time data. Operations with these data types are typically tricky and sometimes hides traps (try add_months(‘2014-01-31’,1) ). Conversion of a date and a timestamp is different, decisions must be made by considering storage and usage aspects.
  • Conversions
    • Date: implicit conversions work, easy and comfortable
    • Integer time: works fine, but insert-select will loose the formatting, only the integer value will remain
    • Time(n): implicit conversion to string is not working. This fails: select cast(‘2014-01-31’ as date) || ‘ ‘ ||cast(’12:00:00’ as time(0))
    • Timestamp(n): brrr. Different precisions will not convert automatically either. I don’t like it.
  • Filtering: comparing date/datetime values with < / <= /between operators
    • Joint storage (timestamps)
      Straightforward, just use the values – if they are equivalent data types
    • Separate storage
      You have to convert to a “joint” format, either a string or a timestamp before
  • Arithmetic
    • Date: ok, adding a constant, subtracting dates work fine
    • Integer time: do not use arithmetic, results are bad!
    • Time(n): interval types accepted. Not really comfortable, eg max 99 second long interval is accepted (V13.10)
    • Timestamp(n): same as Time(n)

    Regarding arithmetic I suggest building your own UDF library, that will ease your life.

    Recommendations

    Choosing data type

    I recommend to choose data types depending on the table type and usage purposes.
    I differentiate “transaction” and “all other” table types, because transaction tables are usually allocate most of the PERM space, while others are many in number, but allocate “negligible” space.

    • Transaction
      • Separate storage
      • Integer time
    • All others
      • Joint type (timestamp)

    Saving space – store “delta”

    The biggest tables in the data warehouses are the “transaction tables” (call/purchase/transfer/etc. transactions depending on industry), and most of them contain several date fields, most of them w/strong correlation. I explain what I mean. Let’s assume a call record (telco), that will have the following date(&time) columns:
    • Channel_seizure
    • Call_start
    • Call_end
    • Bill_cycle_start
    • Bill_cycle_end
    The date component of the first three columns are the same in 99% of the records, and the last ones differ from the first ones with max. of 30 days.

    My recommendation is the following:
    • Choose a “primary date”
      Must be not null, and typically used as partitioning key also, since it is the most often date filtering condition.In our case this will be the Call_start
    • Choose separate date-time storing
      Eg. Date and Integer time , as this combination requires the least space
    • Store the non-primary dates as delta, multi value comressed
      Compute it in the load process, like this:
      Call_end_delta := Call_end-Call_start
    • Compress the “delta” columns
      They will reflect low deviation, highly compressible, use PRISE Compress Wizard
    • Convert to absolute dates back in the view layer
      Call_start + Call_end_delta as “Call_end”
    Example:

    CREATE TABLE T2000_CALL_TRX
    (

    Call_start_date Date NOT NULL
    Call_end_date_delta Integer COMPRESS (0)

    ) PRIMARY INDEX (…,Call_start_date)
    PARTITION BY RANGE_N ( Call_start_date BETWEEN date ‘2010-01-01’ AND date ‘2020-12-31’ EACH interval ‘1’ day, NO RANGE, UNKNOWN);
    ;

    CREATE VIEW V2000_CALL_TRX
    as
    SELECT

    , Call_end_date_delta +Call_start_date as “Call_end_date

    FROM
    T2000_CALL_TRX

    ;

    How to optimize a Teradata query?

    Teradata SQL optimization techniques

    Introduction

    The typical goal of an SQL optimization is to get the result (data set) with less computing resources consumed and/or with shorter response time. We can follow several methodologies depending on our experience and studies, but at the end we have to get the answers for the following questions:

    • Is the task really heavy, or just the execution of the query is non-optimal?
    • What is/are the weak point(s) of the query execution?
    • What can I do to make the execution optimal?

    Methodologies

    The common part of the methodologies that we have to understand – more or less – what is happening during the execution. The more we understand the things behind the scenes the more we can feel the appropriate point of intervention. One can start with the trivial stuff: collect some statistics, make indices, and continue with query rewrite, or even modifying the base table structures.

    What is our goal?

    First of all we should branch on what do we have to do:
    1. Optimize a specific query that has been running before and we have the execution detail info
      Step details clearly show where were the big resources burnt
    2. In general, optimize the non optimal queries: find them, solve them
      Like a.,but first find those queries, and then solve them one-by-one
    3. Optimize a query, that has no detailed execution info, just the SQL (and “explain”)
      Deeper knowledge of the base data and “Teradata way-of-thinking” is required, since no easy and trustworthy resource peak-detecting is available. You have to imagine what will happen, and what can be done better

    Optimization in practice

    This section describes the case b., and expects available detailed DBQL data.
    In this post I will not attach example SQL-s, because I also switched to use PRISE Tuning Assistant for getting all the requested information for performance tuning, instead of writing complex SQL queries and making heaps of paper notes.

    Prerequisites

    My opinion is that DBQL (DataBase Query Logging) is the fundamental basis of a Teradata system performance management – from SQL optimization point of view. I strongly recommend to switch DBQL comprehensively ON (SQL, Step, Explain, Object are important, excluding XML, that is huge, but actually has not too much extra), and use daily archiving from the online tables – just follow Teradata recommendation.

    Finding good candidate queries

    DBQL is an excellent source for selecting “low hanging fruits” for performance tuning. The basic rule: we can gain big save on expensive items only, let’s focus on the top resource consuming queries first. But what is high resource consumption? I usually check top queries by one or more of these properties:

    • Absolute CPU (CPU totals used by AMPs)
    • Impact CPU (CPU usage corrected by skewness)
    • Absolute I/O (I/O totals used by AMPs)
    • Impact I/O   (Disk I/O usage corrected by skewness)
    • Spool usage
    • Run duration
    PRISE Tuning Assistant supplies an easy to use and quick search function for that:

    Finding weak point of a query

    Examining a query begins with the following steps:
    • Does it have few or many “peak steps”, that consume much resources? 
      • Which one(s)?
      • What type of operations are they?
    • Does it have high skewness?
      Bad parallel efficiency, very harmful
    • Does it consume extreme huge spool?
      Compared to other queries…
    PRISE Tuning Assistant again.
    Check the yellow highlights in the middle, those are the top consuming steps:

      Most of the queries will have one “peak step”, that consumes most of the total resources. Typical cases:

      • “Retrieve step” with redistribution
        Large number of rows and/or skewed target spool
      • “Retrieve step” with “duplication-to-all-AMPs”
        Large number of rows duplicated to all AMPs
      • Product join
        Huge number of comparisons: N * M
      • Merge or Hash join
        Skewed base or prepared (spool) data
      • OLAP function
        Large data set or skewed operation
      • Merge step
        Skewness and/or many hash collisions
      • Any kind of step
        Non small, but strongly skewed result

      What can we do?

      Teradata optimizer tries its best when produces the execution plan for a query, however it sometimes lacks proper information or its algorithms are not perfect. We – as humans – may have additional knowledge either of the data or the execution, and we can spoil the optimizer to make better decisions. Let’s see our possibilities.
      • Supplement missing / refresh stale statistics
      • Drop disturbing statistics (sometimes occurs…)
      • Restructure the query
      • Break up the query, place part result into volatile table w/ good PI and put statistics on
      • Correct primary index of target / source tables
      • Build secondary/join index/indices
      • Add extra components to the query.
        You may know some additional “easy” filter that lightens the work. Eg. if you know that the join will match for only the last 3 days data of a year-covering table, you can add a date filter, which cost pennies compared to the join.
      • Restrict the result requirements to the real information demand.
        Do the end-user really need that huge amount of data, or just a record of it?

      What should we do?

      First of all, we have to find the root cause(s). Why does that specific top step consume that huge amount or resources or executes so skewed? If we find the cause and eliminate, the problem is usually solved.
      My method is the following:
      1. Find the top consuming step, and determine why it it high consumer
        • Its result is huge
        • Its result is skewed
        • Its work is huge
        • Its input(s) is/are huge
      2. Track the spool flow backwards from the top step, and find
        • Low fidelity results (row count falls far from estimated row count)
        • NO CONFIDENCE steps, specifically w/low fidelity
        • Skewed spool, specifically non small ones
        • Big duplications, specifically w/NO CONFIDENCE
      3. Find the solution
        • Supplement missing statistics, typically on PI, join fields or filter condition
          NO CONFIDENCE, low fidelity, big duplications
        • Break up the query
          Store that part result into a volatile table, where fidelity is very bad, or spool is skewed. Choose a better PI for that
        • Modify PI of the target table
          Slow MERGE step, typical hash-collision problem.
        • Eliminate product joins
        • Decompose large product join-s
        • E.T.C.
      Have a good optimization! 🙂