5 tips for Teradata Fastload

teradata fastload

Teradata Fastload – use effectively

Teradata Fastload is a very fast and efficient tool to load huge amount of data into a Teradata database. It has good documentation to browse as a reference, however there are a few small things good to know to use it more comfortably – either in challenged situations.

Tip1: How to understand “Error1” table

When a record is failed to load because of any Teradata error (conversion, null problem, etc), the record get into “Error1” table, which is set in the “BEGIN LOADING” section of the control script.
The structure of the “Error1” table is fix:

  1. ErrorFieldName:
    Which column was problematic
  2. ErrorCode:
    The Teradata error code, lookup in DBC.ErrorMsgs if documentation is not at your hands
  3. DataParcel
    This filed contains the record transformed into hexa string format

Unfortunately DataParcel’s hexadecimal codes are unreadable for a human in Queryman or other GUIs.
With this short BTEQ script you can export the original record layout into a file:

.SET RECORDMODE ON
.EXPORT DATA FILE=badrecord.out
SELECT DataParcel FROM <Errortable1>;

Tip2: How to switch off deduplication?

There is no way to switch it off. Teradata Fastload will do a full-record-layout deduplication. No matter we use nonunique primary index (NUPI) and MULTISET table definition Teradata will deduplicate eagerly.

If you must load each records, use a preprocessor script (PERL, AWK, etc), which adds a running number column to your data. This way there will be no identical records at all, surely.

Tip3: Deduplication vs. “Error2” table

Why do we have “Error2” table, if Teradata Fastload always does a deduplication? This error table contains those records which passed the full-record-layout deduplication, but failed at the unique primary index (UPI) duplication check. So the data contained other record with the same PI value, but differ in one or more other columns. To load those records either, user NUPI or NO PI table.

Tip4: What is the good “Sessions” number?

SESSIONS is a parameter to be specified at the “BEGIN LOADING” section. Can fall between 1 and “Number-of-AMPs”. The bigger is the number, the better is the loading performance, but the longer is the logon time. So for a small file use 1, for large number of rows, use the maximum allowed. For medium files I generally use 10.

If a lot of small different files to different tables are to be loaded, mind using other technique, since Teradata Fastload will not have advantages in this league.

Tip5: Table is “Being loaded”…

When a Teradata Fastload session is stopped for reaching error limit, the target and the error tables cannot be accessed due to locking:
Error 2652: Operation not allowed: <db>.<table> is being Loaded.

Run this Fastload script to release the locks and get access to the tables:

BEGIN LOADING <db>.<table>;

END LOADING;

After this script finished the error tables are readable to identify the load problem root causes, see the “Error1” table tip above.

Teradata performance optimization project
I. Basics and prerequisites

Teradata performance optimization

Introduction

The Teradata database is a pretty expensive box, which serves a lot of people in the company – in a direct or indirect way. The efficient utilization of its resources is essential, from both financial and user satisfaction aspects.


This series of posts will follow a real-life Teradata optimization project’s methodology, steps and results, with this achievement:

System level gain: 31,6%

Levels of performance optimization

We can typically apply 3 levels of Teradata performance optimization:

Application

Meaning: Analyze user requests and system activity, then refine them, eliminating unnecessary activities.
This level checks if the executed processes are really needed, and they do what the user really needs. Eg: eliminate unnecessary data loads or reports; limit a report to a shorter period

  • Big gains possible w/o any trade-off
  • Comprehensive analysis (incl. find and interview appropriate persons)

SQL

Meaning: Pick individual SQLs and optimize its execution
This approach finds the top resource consumer SQL statements and reduces the resource consumption, while produces the same results.
Eg: Change PI; add/refresh statistics; rewrite SQL

  • Easy analysis
  • Low risk (small affect, good testing possibilities)
  • Needs SQL level analysis

Infrastructure

Meaning: Tune low level system (RDBMS, OS, Hardware) parameters
Tuning on infrastructure level is a global intervention regardless of users/business areas or importance

  • Miracle in some special cases
  • High risk (affects the all system)
  • Complex cross-dependencies

I prefer and recommend SQL level optimization (b.) in the beginning, because it delivers high value along with low risk and scaleablity of project. Later on can the others come into focus.

Goal of SQL optimization

The goals of an optimization project can be different. Some examples:
“Finish load procedure until 6am”
“Present my report 2 hours earlier”
“Reduce CPU load by 20%”
“Let hardware extension postponed by 6 months”

The goal of our project was: speed up the “load batch” process, in order to let a specific identified data area load finished and available for further usage 1 hour earlier than before.

Prerequisites

The key input for an SQL optimization is the information about what exactly happens:

What, when, how:

Which SQL statements ran, when, what steps were executed, how much resources did the consume with what skewness, etc.
All these can be found in the DBQL tables of Teradata (dbc.dbql*)
Switching on them will not cause significant extra load, but it is required to archive/purge them regularly (see PDCR tool in Teradata package).
Teradata Administrator enables to switch logging on (Tools/Query logging…) and one can check the actual settings:

select * from dbc.dbqlruletbl;
Explain/Obj/Sql/Step flags should be set to ‘T’, others are not necessary. Global (1 record with UserID: 00000000 ) logging is recommended.
If continuous logging is not possible, logging should be switched on 1 week before starting the optimization project

What / Who runs the statements?

Teradata provides a great feature for this: Query Banding
All executed SQL statements can be flagged with several information elements regarding the environment, job, script, version etc. which helps identifying a specific one.
The queryband data will appear in the DBQL logs – if the running environment is configured to flag the queries being run.
Querybanding is not essential for an optimization, but helps really a lot to identify the exact point in the data warehouse ecosystem, where the modifications should be done (which job/jobstep/script).

What are the environmental conditions?

Eg. Data volumes, statistics, etc.
This information is available in the data dictionary (DBC tables and views), but only as an actual – not for the time the statement run.
Most of the cases it is OK for us, but this property needs to be kept in mind.

When all these are finished, we are ready to do a successful performance optimization project.

The next post is about how to identify the good candidate SQL statements for optimization.

Stay tuned!

How to speed up Teradata Fastexport

About Teradata Fastexport

Fastexport1
Teradata Fastexport is a nice, powerful tool to export mass volume of data to file. Its default execution flow is the following:

  1. Apply locks on the affected tables
  2. Execute the SELECT statement
  3. Places the result in a SPOOL
  4. Releases locks
  5. Exports data from the SPOOL

It has an advantage: the release of locks will happen immediately after the result data is copied into the SPOOL. Terefore need not to wait until the data is written to the export file outside the database – it is typically a single process operation, can last for a long time.

The price to be payed for this advantage: all the exported data must be copied first to the SPOOL. When the SELECT is a simple operation (no join, sort or other extras), the copy phase can be saved, which can approach 50% of the total resource and time consumption.

The NOSPOOL trick

For these simple cases one can use the NOSPOOL/NOSPOOLONLY mode.
While NOSPOOL is a soft switch, which is ignored if not applicable, NOSPOOLONLY will make statement failed if the nospool option is not available for the given query.

In NOSPOOL mode the database will not make a copy of result data. Instead it directly passes the data to the client, which has another benefit: the execution of the SELECT runs dominantly parallel with the file write operation, instead of successively like in SPOOL (default) case.

Usage:
.BEGIN EXPORT
SESSIONS …
[SPOOLMODE SPOOL|NOSPOOL|NOSPOOLONLY];

The tradeoff

The NOSPOOL operation only supports the “simple” SELECT statements. The not supported operations:

  • SORT (ORDER BY, HAVING)
  • JOIN
  • Aggregations (GROUP BY, SUM())
  • OLAP functions
  • Non-table data (date, session, user)
  • Multiple statements

But allows:

  • Sampling
  • Column selection
  • Scalar functions (like col1 + col2, substr())
  • “Where” clause (incl. partition elimination)

The disadvatages of NOSPOOL mode:

  • No ActivityCount is available
  • Locks maintained till the end of the export session
  • Data conversion errors can happen during the exporting phase
  • (Only available from V13.10)

What is it good for indeed?
It can almost double the throughput in case of:

  • DataMart export (to another database)
  • Archive dumps
  • Copy data to test environment
  • Migration to another database systems

Measures

We’ve executed a Teradata to Netezza migration, where got to transport ~12TBytes of data through file interface. After we activated the NOSPOOL option, the average export speed grew up by 84%.