It was very interesting to work on a performance optimization task for one of our clients in Azure Data warehouse architecture. I am delighted to share my experience with you all in this exercise.
The client had a huge data warehouse with billions of rows in a fact table while it had only couple of dimensions in the star schema. Though the fact table had billions of rows, it did not even have 10 columns. The main problem was the queries that was issued to the fact table were running for more than 3 minutes though the result set was a few rows only. The clients' Azure DW was running on lower service levels with lower DWUs. It was indeed required to scale up for higher cDWUs. However, prior to that, existing table design had to be optimized.
The main columns in the fact table were Location, SensorID , SensorValue and EffectiveDateTime while other columns were supportive data.
CREATE TABLE [abc].[FactMyDataCurrent] (
[SensorID] [varchar](100) NULL,
[SensorValue] [float] NULL,
[EffectiveDateTime] [datetime] NULL,
[Location] [varchar](100) NULL,
[ProcessedTimestamp] [datetime] NULL,
[OtherFields] [float] NULL
)
WITH
( DISTRIBUTION = HASH ( [Location] ),
CLUSTERED COLUMNSTORE INDEX
)
Initially, the table was distributed by Hash of Location column while the table was stored as clustered column store. Data distribution checks revealed that the data was in just 5 distributions as distribution was set to Location field that had only 5 distinct values. Furthermore, most of the queries issued to this fact table had Location as the filter as below.
Select * from abc.FactMyData
where [Location] = 'MyLocation'
AND
(
(SensorID = 'ncuiyr8277fwcbu...' and EffectiveDateTime Between '2019-01-01' to '2019-01-31')
OR (SensorID = 'nssvs7fwcbu...' and EffectiveDateTime Between '2018-12-01' to '2019-01-31')
OR (SensorID = 'hdg77fwdegds..' and EffectiveDateTime Between '2018-11-01' to '2019-12-31')
OR (SensorID = 'ncgeggdgbu...' and EffectiveDateTime Between '2019-02-01' to '2019-02-28')
OR .....
)
Thus, each query was served by a single distribution.
Azure DW has 60 distributions to parallelize the queries. It is recommended to distribute evenly for each distribution so that the queries can be served from all distributions in parallel in Azure MPP architecture. Selection of the distribution column must be done very carefully as it determines the ultimate performance. Distribution column should not be part of the where clause in the typical queries as it would limit the parallelism. It should also have at least 60 distinct values to be evenly distributed in the Azure DW 60 distributions so that maximum parallelism can be leveraged. Location column as it had just 5 distinct values was no longer recommended. Distribution column should not be a datetime column either. Thus, the client was advised to distribute the table by SensorID as it had huge number of distinct values that ultimately helped to distribute evenly.
As the first step, a new table was created as below.
CREATE TABLE [abc].[FactMyData]
(
[SensorID] [varchar](100) NULL,
[SensorValue] [float] NULL,
[EffectiveDateTime] [datetime] NULL,
[Location] [varchar](100) NULL,
[ProcessedTimestamp] [datetime] NULL,
[OtherFields] [float] NULL
)
WITH
(
DISTRIBUTION = HASH ( [SensorID] ),
CLUSTERED COLUMNSTORE INDEX
)
When we analyzed the queries, they were all targeted for a few months' data rather than summary values from many rows in the table. Thus, it was advised to partition the table by EffectiveDateTime column. Partitioning for Azure DW also must be decided carefully as over partitioning may decrease the performance. For example, if an Azure DW table is partitioned to 12, it would be similar to 12 X 60 partitions as there are 60 distributions. Each partition unit in a distribution should have at least 1 million rows to be optimal in column store index. Generally, column stores move data from delta storage to permanent storage when it reaches 1 million rows in the partition. Therefore, a proper row count analysis was performed prior to this decision. Challenge on partitioning was on how to build the partitions dynamically in future for the fact table. It is not practical to build partitions for all periods at once. Following is the Microsoft guidance on dynamic partitions for Azure DW.
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
To achieve this, initially the fact table was created without specifying the partition boundaries. Another table was used as the input for partition boundary values. Let's call this as the partition config table for reference. Dynamic SQL statements were used to loop through the partition config table values and split the partitions in the fact table each time. (Please note that partitions must be split before inserting data to the partition.)
Now, the new fact table was changed as below.
CREATE TABLE [abc].[FactMyData]
(
[SensorID] [varchar](100) NULL,
[SensorValue] [float] NULL,
[EffectiveDateTime] [datetime] NULL,
[Location] [varchar](100) NULL,
[ProcessedTimestamp] [datetime] NULL,
[OtherFields] [float] NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
,DISTRIBUTION = HASH([SensorID])
,PARTITION ( [EffectiveDateTime] RANGE RIGHT FOR VALUES () )
)
CREATE TABLE [abc].[PartitionConfig]
WITH
(
DISTRIBUTION = HASH(PartitionDate)
)
AS
SELECT PartitionDate
, ROW_NUMBER() OVER (ORDER BY (PartitionDate)) as seq_no
FROM (
SELECT CAST('20180101' AS datetime) PartitionDate
UNION ALL
SELECT CAST('20180201' AS datetime)
UNION ALL
SELECT CAST('20180301' AS datetime)
UNION ALL
SELECT CAST('20180401' AS datetime)
UNION ALL
SELECT CAST('20180601' AS datetime)
UNION ALL
......<partition values up to now>
)
As the initial step, this table would contain updated and latest partitions. Later, new partition boundaries can be populated as required. Below SQL code was used to split the partitions in the main fact table according to the config table values.
-- Iterate over the partition boundaries and split the table
DECLARE @c INT = (SELECT COUNT(*) FROM [abc].[PartitionConfig])
, @i INT = 1 --iterator for while loop
, @q NVARCHAR(4000) --query
, @p datetime = '1900-01-01' --partition date
, @s NVARCHAR(128) = N'abc' --schema
, @t NVARCHAR(128) = N'FactMyData' --table
;
WHILE @i <= @c
BEGIN
SET @p = (SELECT PartitionDate FROM [abc].[PartitionConfig] WHERE seq_no = @i);
SET @q = (SELECT N'ALTER TABLE ' + @s + N'.' + @t + N' SPLIT RANGE (''' + CONVERT(NVARCHAR(20), @p) + N''');');
-- PRINT @q;
EXECUTE sp_executesql @q;
SET @i +=1;
END
-- clean-up
TRUNCATE TABLE [abc].[PartitionConfig];
This created partitions as per the boundary values defined in the config table. For future partitions, it should be populated with new boundary values on future date values and this dynamic SQL code should be executed before the last partition in the fact table is populated with data. After this process, more than 50% of query execution time improvement was noticed.
Hence we come to the conclusion that the queries that had performance issues were mainly non-aggregate queries as shown in the above example which had SensorID and EffectiveDateTime as filters along with Location as the first filter. SensorID is covered by distribution while EffectiveDateTime column is covered by partitions. Thus, the next optimization required was on Location column. We indexed Location column as a non-clustered index. As the final step, statistics were built for each column in the fact table as part of tuning. Once the full data load from old table to the new fact was completed, all indexes were rebuilt with the new data.
Finally, we managed to bring the query execution time to less than 10 seconds for the query that was initially running for more than 3 minutes. Scale up of Azure DW also would further reduce the query execution time now.
Optimization is a step by step evolving process as per the query requirements.