Please note that the solutions found in the article works on the DBQL logs, which covers only that users’ activity, for whom the logging is switched on.
If you are not comfortably familiar with Skewness or DBQL, please read the corresponding posts before:
As you can see the difference between Impact[Resource] and Total[Resource] is a net technological loss, which should be minimized.
Skewness is a good indicator for highlight those workload that wastes lots of valuable CPU and I/O resources because of inefficient parallelism.
Find “Top skewed queries” in DBQL
With this SQL you can filter top impacting queries (replace the date value or maybe you have to adjust the date filtering according to local settings):
select top 50 ProcID , QueryID , AMPCPUTime , MaxAMPCPUTime * (hashamp () + 1) CPUImpact , CAST (100 - ((AmpCPUTime / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPCPUTime)) AS INTEGER) "CPUSkew%" , TotalIOCount , MaxAMPIO * (hashamp () + 1) IOImpact , CAST (100 - ((TotalIOCount / (hashamp () + 1)) * 100 / NULLIFZERO (MaxAMPIO) ) AS INTEGER) "IOSkew%" , AMPCPUTime * 1000 / nullifzero (TotalIOCount) LHR , TotalIOCount / nullifzero (AMPCPUTime * 1000) RHL , ParserCPUTime , Queryband , Substr(QueryText,1,2000) QueryText from /* For archived DBQL dbql_arch.dbqlogtbl_hst where logdate=1131201 and ampcputime > 0 */ /* For online DBQL*/ dbc.dbqlogtbl where cast(cast(starttime as char(10)) as date) = '2013-12-18' (date) and ampcputime>0 order by CPUImpact desc
Explanation of extra fields:
- Time parser spent on producing the execution plan. This can be high if SQL is too complex or too many random AMP sampling has to be done.
- Larry Higa ( inverse Larry Higa) index. Empirical index that shows the CPU vs I/O rate. By experience it should be usually around one (can be different depending on your system configuration, but is a constant). If it is far from 1, that indicates CPU or I/O dominance, which means unbalanced resource consumption, but it is a different dimension that skew.
- Labels that sessions use to identify themselves within the DBQL logs
QueryText: First 200 characters of the query (depending on DBQL log settings)
OK, we’ve listed the terrible top consumers, but what’s next?
Have to identify those queries. If your ETL and Analytics software is configured to user QueryBand properly (this area deserves a separate post…), you can find which job or report issued that SQL, anyway, you can see the QueryText field.
If you want to get the full SQL text, select it from the DBQLSQLTbl (SQL logging needs to be switched on), replace the appropriate <procid> and <queryid> values:
select SQLTextInfo from dbc.dbqlsqltbl where procid = <procid> and queryid = <queryid> order by SQLRowNo asc
You will get the SQL in several records, broken up to 30K blocks, simply concatenate them. Unfortunately the SQL will have very ugly make up, you can use PRISE Tuning Assistant to beautify and highlight it for easy reading.
System level Skewness
We have found those bad queries, nice. But what can we say about the whole system? What is the total parallel efficiency? Can we report how much resources were wasted due to bad parallel efficiency?
The answer is: yes, we can estimate quite closely. The exact value we cannot calculate because DBQL does not log AMP information for the query execution, but the most important metrics.
We can not calculate that situation when more skewed queries run the same time, but have peaks on different AMPs. This reduces the system level resource wasting, but is hard to calculate with, however its probability and effect is negligible now.
select sum(AMPCPUTime) AMPCPUTimeSum , sum(MaxAMPCPUTime * (hashamp () + 1)) CPUImpactSum , sum(TotalIOCount) TotalIOCountSum , sum(MaxAMPIO * (hashamp () + 1)) IOImpactSum , cast(100 - (AMPCPUTimeSum / CPUImpactSum) * 100 as integer) "CPUSkew%" , cast(100 - (TotalIOCountSum / IOImpactSum) * 100 as integer) "IOSkew%" from /* For archived DBQL dbql_arch.dbqlogtbl_hst where logdate = '2013-12-18' (date) and (ampcputime > 0 or TotalIOCount > 0) */ /* For online DBQL*/ dbc.dbqlogtbl where cast(cast(starttime as char(10)) as date) = '2013-12-18' (date) and (ampcputime > 0 or TotalIOCount > 0)
Look at the last two columns. That percent of your CPU and I/O goes to the sink…
Top bad guys
OK, let’s check how many queries accumulate 5%, 10%, 25%, 50%, 75%, 90% of this loss?
Here you are (CPU version, transform for I/O implicitly):
select 'How many queries?' as "_" , min(limit5) "TOP5%Loss" , min(limit10) "TOP10%Loss" , min(limit25) "TOP25%Loss" , min(limit50) "TOP50%Loss" , min(limit75) "TOP75%Loss" , min(limit90) "TOP90%Loss" , max(rnk) TotalQueries , sum(ResourceTotal) "TotalResource" , sum(ResourceImpact) "ImpactResource" from ( select case when ResRatio < 5.00 then null else rnk end limit5 ,case when ResRatio < 10.00 then null else rnk end limit10 ,case when ResRatio < 25.00 then null else rnk end limit25 ,case when ResRatio < 50.00 then null else rnk end limit50 ,case when ResRatio < 75.00 then null else rnk end limit75 ,case when ResRatio < 90.00 then null else rnk end limit90 ,rnk , ResourceTotal , ResourceImpact from ( select sum(ResourceLoss) over (order by ResourceLoss desc ) totalRes , sum(ResourceLoss) over (order by ResourceLoss desc rows unbounded preceding) subtotalRes , subtotalRes *100.00 / totalRes Resratio , sum(1) over (order by ResourceLoss desc rows unbounded preceding) rnk , ResourceTotal , ResourceImpact from ( select AMPCPUTime ResourceTotal , (MaxAMPCPUTime * (hashamp () + 1)) ResourceImpact , ResourceImpact - ResourceTotal ResourceLoss /* For archived DBQL from dbql_arch.dbqlogtbl_hst where logdate=1131207 and ampcputime > 0 */ /* For online DBQL*/ from dbc.dbqlogtbl where cast(cast(starttime as char(10)) as date) = '2013-12-18' (date) and ampcputime > 0 ) x ) y ) z group by 1
I expect you are a bit shocked now, how few queries waste how much golden resources.
and you save in orders of $100K..M USD for your company annually, am I right?
PRISE Tuning Assistant helps you to find those queries and to get the hang of how to accelerate them.
Typical reasons of skewness – in a nutshell
- Skewed tables: Bad choice of PI, Skewed data
- Bad execution plans (typically skewed redistributions)
- Bad data model (normalization,data types,PI, etc.)
- Missing or stale statistics
- Too many joins (break up the query!)
- Hash collision (load time problem)