Teradata performance optimization project
V. Multi Value Compression

Compress your Teradata by MVC

Focus on the storage

We already finished with the SQL tuning part, where the resource gain was in the order of 100k USD (projected onto a 1 year operation) within a 2-3 weeks project.

There is another really good opportunity to increase our achievements in savings in the same order of magnitude: the data storage.

Why do we discuss it as performance optimization?
You will be convinced later.

Teradata DBs store huge amounts of data, sometimes Petabytes. The efficiency of storing and processing this incredible volume of data seriously affects the Data Warehouse costs. Fortunately Teradata includes marvelous features for optimizing this area either.

This post will show a project, gained 16% on storage space, in turn manual compression was already applied.

Beliefs regarding compression

There are a lot of fears regarding compression, coming from the old days.
Compression is slow“, “Compression is unsafe“, “We use compression only for the archives“.
These beliefs have some basis, but it is time to re-assess them.

  • Speed
    The biggest fear of the compression is speed. Why do we discuss a method regarding performance, which slows things down?
    Because compression not only slows down. Some things speed up, even though it is hard to belive. Anyway, a lot of new, effective compression algorithms and codecs has been developed so it is time to consider compression in the production stack
  • Speed II
    The speed of compression and decompression is not equal, the second typically less costly. The write and read access frequency significantly differ in a Data Warehouse environment – in favour or reads.
    When we access the data the amount of I/O required, depends on the number data blocks the table allocates. If the data is compressed, we have to access far less data blocks with I/O operations. I/O is the most expensive resource in the Data Warehouse.
    We do not have to compress everything. There are big and rarely used data elements, the compressed storage of them are very advisable.
    We do not have to decompress everything. Teradata is very clever to decompress only the required columns. In addition it will decompress data only at the appropiate stage of execution, until then data is “travelling” compressed, consuming less spool and cache memory.
  • Reliablity and comfort
    Embedded compression methods will hide all these kind of things from the users. Everything can be handled the same way as before compression: fully transparent. User will not even catch sight of compression exists.
  • Types of compression
    There is life over the “ZIP”. There are compression methods that require way less resource for encoding and decoding than the “ZIP” style algorithms. They are efficient in special cases, and will achieve less compress rate, but they are the ones we are looking for.

The Teradata compression

Teradata offers 3 types of compression

  • BLC – Block Level Compression
    This is like a “ZIP” for each data blocks. All the data is compressed on a high rate. The whole data has to be decompressed to reach any data within. Very costly, and requires “compression card” hardware within the Teradata servers for good performance.
    Very simple, got to be switched on/off. Can be set on system level or finer grains.
  • ALC – ALgorihmic Compression
    This is defined on table columns. Several compression algorithms are offered, must be chosen manually. Got to know the data and the algorithms, which are good for which types of data, what are the resource costs.
    ALC is useful in special cases (like long URLs), but the usage is not really wide-spread
  • MVC – Multi Value Compression
    MVC is the magic in Teradata, regarding the efficient storage and processing. It is a “frequent value” type simple compression. You specify those values which occurence is stored on max 8 bits (1byte), regardless of the length of the field value. Tereadata will solve the internals – the encoding and decoding – the usage is full transparent. SQLs are not affected, and query execution plans do not change!

    But you are responsible to define the values to be compressed. Bad choice may result bigger table as the original, however good choice results an average 50% compress, or even more.

    The really cool advantages are:

    • Fast setup
      Can be switched on with simple* SQL operations, and 100% transparent, so no other modifications or even tests are required.
    • Fast operation
      It can be used without additional hardware with good performance. Quite slight CPU overhead will come rather on encoding.
    • I/O saving
      As less data blocks the tables have, the less data blocks to be moved when the data is processed. This is rationally reflected in the table scans’ I/O consumption. Queries may run faster
    • CPU saving (!!!)
      How can it save CPU? Through the caching. There will be significantly less data blocks to be processed during the data reads. This enables less cache activity (which is quite CPU intensive operation) Queries may run faster. Even those, which are not affected by the compress – by having better cache hit ratio. This is the way we gain the CPU back.
    • Uncompress only when required
      Teradata only decompresses the filed values when it is really required. MVC compression in charge throughout the spools either, and the value is uncompressed when it really needed, sometimes many execution steps later then the record is read from the table – opposed to the BLC.

    * Simple means ALTER TABLE commands, but the exact parameters got to be chosen carefully.

MVC is the quickest and cheapest way to release free space in a Teradata Database – without losing any data.

The compress optimization project

This section will present a real-life compress optimization project, which gained 16% of storage space, while did not require more than a couple of man working days – and the PCW software.

This amount of storage resource would cost in the order of 100k USD to be purchased on the market. In addition, installation of new hardware components implies system outage.
The project discussed below had no database downtime at all.

Scoping

The project scope in this case is quite simple: which tables to deal with. The DBA typically has a pattern in mind which tables can be good candidates:

Big tables
Which are not in system or temporary databases
Not working tables (naming conventions…)
Those ones that haven’t examined/optimized for a while
Not planned to drop the next month
And are not re-created every day by the ETL process
And so on…

A lot of information which all are accessible from the DBC or some custom repository or from can be hardcoded in an SQL.

PRISE Compress Wizard

has a customizable SQL based table-selection feature to have a tailored one-click tool that will present your custom preference tables every time.
Since it is SQL, you can use any tables in the database, like:

  • DBC tables
    Table sizes, owner, creation, last use
  • DBQL tables
    Usage frequency
  • PCW repository
    Last optimization time
  • Custom tables
    Any infromation previously set, like exclusions, inclusions.


Fig. 1 - Custom SQL based table selector
Fig. 1 – Custom SQL based table selector

Analysis

When the scope (=table list) is fixed, the next step is the analysis: what would be the optimal MVC settings, and calculate how much space gain can be reached with.

It is a straightforward, however exhaustive work – if done manually. Statistics calculation for ten thousands of columns, optimum calculation with tons of exceptions and version-dependent knowledge. SQL creation and measurements. A set of tasks born to be performed by a dedicated software.

PCW executes all these tasks on even thousands of tables, without continuous supervision. Works on sample data, which is good enough for a very precise calculation, meanwhile saves huge amount of CPU and I/O.

The Analysis was running like normal reports, had no interference with the daily life of the Data Warehouse.

The results got into the PCW repository including the actual compression level and consumed CPU and I/O resources (fetched from the DBQL logs automatically if required). Excel report was exported by the PCW GUI, which is the basis of the decision paper: which tables to be involved in the implementation phase.

Yes, PCW delivers a very accurate estimation of reachable gain (GBytes), before any modification is done.


Fig. 2 - Result of the Analisis phase
Fig. 2 – Result of the Analisis phase

Decision

When the analysis report is in hands, the only thing is to make a selection of the tables: which ones provide good savings possibility.

The current filtering was:

  • The savings reach 1G on a table
  • The gain reached 10%
    The less is the percent, the more data must be processed for the specific gain. Below a limit it is not worth, except for the very big tables.

In this project we selected 178 tables to implement.

Implementation

The implementation is the most exciting part of the project: the gains are realized.

The Multi value compression setup on a table is pure SQL procedure: ALTER TABLE or CREATE TABLE + INSERT SELECT commands can execute it.

The tricks are in the details:

  • Environmental conditions:
    Statistics: some Teradata versions do not allow ALTER if statistics exist
    Indexes / constraints: must be re-created if CREATE + INSERT is applied
  • Business continuity
    Is the data accessibility ensured continually? What level of locking is applied?
  • Process must be 100% safe
    No data loss or inconsistency is allowed. Load must be blocked during the operation. Revert possibility in case of emergency.
  • Logging
    Everything must be logged and measured to be able to report the achievements. Even if thousands of tables are involved.
  • Order of execution
    If the space is very limited: what is the optimal sequence of optimalization. The gained space can be rolled ahead to give place for further tables’ optimization

This seems not that simple, however PCW will do these all. Meanwhile there is no black box: the transformations are done by BTEQ SQL scripts, that can be reviewed before executed – from the PCW GUI or in the company standard patch procedure.

These scripts contain all the above: safety checks, locks, statistics/index maintenance and logging, DBA does not need to modify the scripts they are ready made.

The implementation can be done even table-by-table as it fits to the system’s patch timing schedule. The storage gains will arise table-by-table. The results can continually tracked on the PCW GUI.

The analysis and the implementation are executed with different users with the appropriate access level for the highest data safety.

In the project we used the company’s standard patch process: created BTEQ scripts and delivered to the DBAs to be checked, scheduled and executed.


Fig. 3 - Generating the optimizer BTEQ SQL scripts
Fig. 3 – Generating the optimizer BTEQ SQL scripts

Evaluation

When implementation is ready all data is available for a comprehensive “achievements report” including space gained, data volume touched, CPU and I/O consumed and time span of the project.

This was one of the best ROI project in the Data Warehouse: some day of human work in total, almost a terabyte of free space, which is ~10% of the total database volume.
We reached this on a system, which had a comprehensive, however manual compress optimization before.

How much can you buy a GBytes of storage space in Teradata?
Enter it into PCW and it will calculate how much you saved with the project.


Fig. 4 - Realized achievements of the project
Fig. 4 – Realized achievements of the project

Summary

  • Storage space
    16% gain on the optimized tables, resulting 937GBytes of database free space increase
  • Runtime
    The Analysis phase ran 42 hours, scanning 320 tables, allocating 10TBytes in total
    The Implementation phase ran 22 hours, transforming 178 tables, originally allocating 5.8TBytes
  • Resources
    The implementation consumed 79 CPU hours and 692M I/Os
  • Gains in $$$
    The space gained was needed by an urgent development. Compress optimization solved this without HW extension. That storage space released by the project could have been bought for ~144k USD assuming 1Gb is $150.

 

The next post will demonstrate how PCW supports each phases of the compress optimization project.

Stay tuned for the demo!

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *