Focus on the storage
We already finished with the SQL tuning part, where the resource gain was in the order of 100k USD (projected onto a 1 year operation) within a 2-3 weeks project.
There is another really good opportunity to increase our achievements in savings in the same order of magnitude: the data storage.
Why do we discuss it as performance optimization?
You will be convinced later.
Teradata DBs store huge amounts of data, sometimes Petabytes. The efficiency of storing and processing this incredible volume of data seriously affects the Data Warehouse costs. Fortunately Teradata includes marvelous features for optimizing this area either.
This post will show a project, gained 16% on storage space, in turn manual compression was already applied.
Beliefs regarding compression
There are a lot of fears regarding compression, coming from the old days.
“Compression is slow“, “Compression is unsafe“, “We use compression only for the archives“.
These beliefs have some basis, but it is time to re-assess them.
The biggest fear of the compression is speed. Why do we discuss a method regarding performance, which slows things down?
Because compression not only slows down. Some things speed up, even though it is hard to belive. Anyway, a lot of new, effective compression algorithms and codecs has been developed so it is time to consider compression in the production stack
- Speed II
The speed of compression and decompression is not equal, the second typically less costly. The write and read access frequency significantly differ in a Data Warehouse environment – in favour or reads.
When we access the data the amount of I/O required, depends on the number data blocks the table allocates. If the data is compressed, we have to access far less data blocks with I/O operations. I/O is the most expensive resource in the Data Warehouse.
We do not have to compress everything. There are big and rarely used data elements, the compressed storage of them are very advisable.
We do not have to decompress everything. Teradata is very clever to decompress only the required columns. In addition it will decompress data only at the appropiate stage of execution, until then data is “travelling” compressed, consuming less spool and cache memory.
- Reliablity and comfort
Embedded compression methods will hide all these kind of things from the users. Everything can be handled the same way as before compression: fully transparent. User will not even catch sight of compression exists.
- Types of compression
There is life over the “ZIP”. There are compression methods that require way less resource for encoding and decoding than the “ZIP” style algorithms. They are efficient in special cases, and will achieve less compress rate, but they are the ones we are looking for.
The Teradata compression
Teradata offers 3 types of compression
- BLC – Block Level Compression
This is like a “ZIP” for each data blocks. All the data is compressed on a high rate. The whole data has to be decompressed to reach any data within. Very costly, and requires “compression card” hardware within the Teradata servers for good performance.
Very simple, got to be switched on/off. Can be set on system level or finer grains.
- ALC – ALgorihmic Compression
This is defined on table columns. Several compression algorithms are offered, must be chosen manually. Got to know the data and the algorithms, which are good for which types of data, what are the resource costs.
ALC is useful in special cases (like long URLs), but the usage is not really wide-spread
MVC – Multi Value Compression
MVC is the magic in Teradata, regarding the efficient storage and processing. It is a “frequent value” type simple compression. You specify those values which occurence is stored on max 8 bits (1byte), regardless of the length of the field value. Tereadata will solve the internals – the encoding and decoding – the usage is full transparent. SQLs are not affected, and query execution plans do not change!
But you are responsible to define the values to be compressed. Bad choice may result bigger table as the original, however good choice results an average 50% compress, or even more.
The really cool advantages are:
- Fast setup
Can be switched on with simple* SQL operations, and 100% transparent, so no other modifications or even tests are required.
- Fast operation
It can be used without additional hardware with good performance. Quite slight CPU overhead will come rather on encoding.
As less data blocks the tables have, the less data blocks to be moved when the data is processed. This is rationally reflected in the table scans’ I/O consumption. Queries may run faster
CPU saving (!!!)
How can it save CPU? Through the caching. There will be significantly less data blocks to be processed during the data reads. This enables less cache activity (which is quite CPU intensive operation) Queries may run faster. Even those, which are not affected by the compress – by having better cache hit ratio. This is the way we gain the CPU back.
- Uncompress only when required
Teradata only decompresses the filed values when it is really required. MVC compression in charge throughout the spools either, and the value is uncompressed when it really needed, sometimes many execution steps later then the record is read from the table – opposed to the BLC.
* Simple means ALTER TABLE commands, but the exact parameters got to be chosen carefully.
- Fast setup
MVC is the quickest and cheapest way to release free space in a Teradata Database – without losing any data.
The compress optimization project
This section will present a real-life compress optimization project, which gained 16% of storage space, while did not require more than a couple of man working days – and the PCW software.
This amount of storage resource would cost in the order of 100k USD to be purchased on the market. In addition, installation of new hardware components implies system outage.
The project discussed below had no database downtime at all.
The project scope in this case is quite simple: which tables to deal with. The DBA typically has a pattern in mind which tables can be good candidates:
- Big tables
- Which are not in system or temporary databases
- Not working tables (naming conventions…)
- Those ones that haven’t examined/optimized for a while
- Not planned to drop the next month
- And are not re-created every day by the ETL process
- And so on…
A lot of information which all are accessible from the DBC or some custom repository or from can be hardcoded in an SQL.
has a customizable SQL based table-selection feature to have a tailored one-click tool that will present your custom preference tables every time.
Since it is SQL, you can use any tables in the database, like:
- DBC tables
Table sizes, owner, creation, last use
- DBQL tables
- PCW repository
Last optimization time
- Custom tables
Any infromation previously set, like exclusions, inclusions.
When the scope (=table list) is fixed, the next step is the analysis: what would be the optimal MVC settings, and calculate how much space gain can be reached with.
It is a straightforward, however exhaustive work – if done manually. Statistics calculation for ten thousands of columns, optimum calculation with tons of exceptions and version-dependent knowledge. SQL creation and measurements. A set of tasks born to be performed by a dedicated software.
PCW executes all these tasks on even thousands of tables, without continuous supervision. Works on sample data, which is good enough for a very precise calculation, meanwhile saves huge amount of CPU and I/O.
The Analysis was running like normal reports, had no interference with the daily life of the Data Warehouse.
The results got into the PCW repository including the actual compression level and consumed CPU and I/O resources (fetched from the DBQL logs automatically if required). Excel report was exported by the PCW GUI, which is the basis of the decision paper: which tables to be involved in the implementation phase.
Yes, PCW delivers a very accurate estimation of reachable gain (GBytes), before any modification is done.
When the analysis report is in hands, the only thing is to make a selection of the tables: which ones provide good savings possibility.
The current filtering was:
- The savings reach 1G on a table
- The gain reached 10%
The less is the percent, the more data must be processed for the specific gain. Below a limit it is not worth, except for the very big tables.
In this project we selected 178 tables to implement.
The implementation is the most exciting part of the project: the gains are realized.
The Multi value compression setup on a table is pure SQL procedure: ALTER TABLE or CREATE TABLE + INSERT SELECT commands can execute it.
The tricks are in the details:
- Environmental conditions:
Statistics: some Teradata versions do not allow ALTER if statistics exist
Indexes / constraints: must be re-created if CREATE + INSERT is applied
- Business continuity
Is the data accessibility ensured continually? What level of locking is applied?
- Process must be 100% safe
No data loss or inconsistency is allowed. Load must be blocked during the operation. Revert possibility in case of emergency.
Everything must be logged and measured to be able to report the achievements. Even if thousands of tables are involved.
- Order of execution
If the space is very limited: what is the optimal sequence of optimalization. The gained space can be rolled ahead to give place for further tables’ optimization
This seems not that simple, however PCW will do these all. Meanwhile there is no black box: the transformations are done by BTEQ SQL scripts, that can be reviewed before executed – from the PCW GUI or in the company standard patch procedure.
These scripts contain all the above: safety checks, locks, statistics/index maintenance and logging, DBA does not need to modify the scripts they are ready made.
The implementation can be done even table-by-table as it fits to the system’s patch timing schedule. The storage gains will arise table-by-table. The results can continually tracked on the PCW GUI.
The analysis and the implementation are executed with different users with the appropriate access level for the highest data safety.
In the project we used the company’s standard patch process: created BTEQ scripts and delivered to the DBAs to be checked, scheduled and executed.
When implementation is ready all data is available for a comprehensive “achievements report” including space gained, data volume touched, CPU and I/O consumed and time span of the project.
This was one of the best ROI project in the Data Warehouse: some day of human work in total, almost a terabyte of free space, which is ~10% of the total database volume.
We reached this on a system, which had a comprehensive, however manual compress optimization before.
How much can you buy a GBytes of storage space in Teradata?
Enter it into PCW and it will calculate how much you saved with the project.
16% gain on the optimized tables, resulting 937GBytes of database free space increase
The Analysis phase ran 42 hours, scanning 320 tables, allocating 10TBytes in total
The Implementation phase ran 22 hours, transforming 178 tables, originally allocating 5.8TBytes
The implementation consumed 79 CPU hours and 692M I/Os
Gains in $$$
The space gained was needed by an urgent development. Compress optimization solved this without HW extension. That storage space released by the project could have been bought for ~144k USD assuming 1Gb is $150.
The next post will demonstrate how PCW supports each phases of the compress optimization project.
Stay tuned for the demo!