SQL Data Warehouse: lesson learned and practical implementation tips. Joe Yong. Sr. Program Manager. SQL Data Warehouse. BRK3377. Before we begin.

Daniel发布于2018/06/18

注脚

展开查看详情

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.SQL Data Warehouse Resources SQL DW Free Trial https://azure.microsoft.com/en-us/services/sql-data-warehouse/extended-trial/ Migration Guide Public Preview datamigration.microsoft.com Azure Database Migration Service (Limited Preview) Preview signup: aka.ms/migrating Channel 9 Video: Oracle migrations ; Azure SQL Database migrations Best practices for Azure SQL Data Warehouse https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices Azure SQL Data Warehouse loading patterns and strategies https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading-patterns-and-strategies/ Azure feature pack for SSIS https://docs.microsoft.com/en-us/sql/integration-services/azure-feature-pack-for-integration-services-ssis Ask questions or help others in the community: https://social.msdn.microsoft.com/forums/azure/en-US/home?forum=AzureSQLDataWarehouse https://dba.stackexchange.com/questions/tagged/azure-sql-data-warehouse

17.SQL Data Warehouse Resources SQL DW Free Trial https://azure.microsoft.com/en-us/services/sql-data-warehouse/extended-trial/ Migration Guide Public Preview datamigration.microsoft.com Azure Database Migration Service (Limited Preview) Preview signup: aka.ms/migrating Channel 9 Video: Oracle migrations ; Azure SQL Database migrations Best practices for Azure SQL Data Warehouse https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices Azure SQL Data Warehouse loading patterns and strategies https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading-patterns-and-strategies/ Azure feature pack for SSIS https://docs.microsoft.com/en-us/sql/integration-services/azure-feature-pack-for-integration-services-ssis Ask questions or help others in the community: https://social.msdn.microsoft.com/forums/azure/en-US/home?forum=AzureSQLDataWarehouse https://dba.stackexchange.com/questions/tagged/azure-sql-data-warehouse

18.Row Store & Column Store

19.Before we begin Goals Azure SQL Data Warehouse architecture Good and bad workloads for SQL DW Key lessons learned and recommended practices Non-goals (but feel free to ask questions) Every detail about SQL DW Every possible scenario applicable to SQL DW Read every bullet in every slide Flashy demos with pretty charts, browsing PB of data with hololens , etc … Pre-requisites Working knowledge of SQL Server and data warehouse scenarios and workloads Thanks to SQL CAT John Hoang & Murshed Zaman

20.Query Performance Recommendations Check for SKEW (DBCC PDW_SHOWSPACEUSED) Statistics CETAS or CTAS large return operation Denormalize Tables if needed DSQL Query Plan Minimize data movement operations Distribution & aggregation compatible Minimize size of data movement Check for predicate pushdown. Rewrite query if needed Use higher resource class for memory intensive queries Load large external tables rather than querying directly All data is brought back, no push down

21.Provision, scale, pause

22.Changing to DW1000 Compute Dist_DB_1 Dist_DB_2 Dist_DB_6 SQL DW Architecture Queries Control Engine DMS SQL DB DMS SQL DB … Compute Dist_DB_7 Dist_DB_8 Dist_DB_12 DMS SQL DB … Compute Dist_DB_13 Dist_DB_14 Dist_DB_18 DMS SQL DB … Compute Dist_DB_19 Dist_DB_20 Dist_DB_24 DMS SQL DB … Compute Dist_DB_25 Dist_DB_26 Dist_DB_30 DMS SQL DB … Compute Dist_DB_31 Dist_DB32 Dist_DB_26 DMS SQL DB … Compute Dist_DB_37 Dist_DB_38 Dist_DB_42 DMS SQL DB … Compute Dist_DB_43 Dist_DB_44 Dist_DB_48 DMS SQL DB … Compute Dist_DB_49 Dist_DB_50 Dist_DB_54 DMS SQL DB … Compute Dist_DB_55 Dist_DB_56 Dist_DB_60 DMS SQL DB … DW1000 DMS = Data Movement Service Dist_DB_1.mdf Dist_DB_13.mdf Dist_DB_37.mdf Dist_DB_55.mdf Premium storage

23.Selecting a Distribution Method For large fact tables, best option is to Hash Distribute Distribute on column that is joined to other fact tables Primary or surrogate key However, be mindful of … Hash column should have highly distinct values (Minimum >60 distinct values) Avoid distributing on a date column Avoid distributing on column with high frequency of NULLs and default values (e.g. -1) Distribution column is NOT updatable For compatible joins use the same data types for two distributed tables If there are no distribution columns that make sense, then use Round Robin as last resort

24.Selecting a Distribution Method For large fact tables, best option is to Hash Distribute Distribute on column that is joined to other fact tables Primary or surrogate key However, be mindful of … Hash column should have highly distinct values (Minimum >60 distinct values) Avoid distributing on a date column Avoid distributing on column with high frequency of NULLs and default values (e.g. -1) Distribution column is NOT updatable For compatible joins use the same data types for two distributed tables If there are no distribution columns that make sense, then use Round Robin as last resort

25.Optimizing with Indexes Clustered Index Optimal for tables < 60M rows Sorting operation slows down load Optimal choice for large tables Limits scans to columns in the query Optimal compression Slower to load than Heap Keep partitions large enough to compress (> 1 million rows) Clustered ColumnStore (SQL DW Default) Non-clustered Index Use sparingly Optimize single row lookups Will slow down load Optimal choice for temporary or staging tables Fastest load performance Heap

26.PolyBase characteristics Single PolyBase load provides best performance for non-compressed files Load performance scales as you increase service level objective Automatically parallelizes data load process; no need to manually break the input data into multiple files and issue concurrent loads Each reader will slice 512 MB block from data files Max throughput depends on number of readers available on the DWU level Multiple readers will not work against a compressed text file ( gzip ) Only a single reader is used per compressed file since uncompressing the file in the buffer is single threaded Alternatively, generate multiple compressed files Number of files should be greater than or equal to the total number of readers of your service level objective (SLO)

27.Important lessons learned Avoid noisy, interactive clients like PowerBI direct query; implement caching layer or hub/spoke architecture Drain transactions before pausing/scaling Avoid real time data ingestion (ASA, Storm) Concurrency is not the root to all scalability challenges Verify your application is at least MPP aware, preferably optimized Server admin, SQL or AAD, are placed in SmallRC ; not changeable If you AAD groups, be careful with its resource class assignment and group membership Pre-populate cache for replicated tables (e.g. SELEC TOP 1..) after resume, DML or scaling operation

28.Data Migration Recommendations Data Format Conversion Date Format, Field delimiters, escaping, field order, encoding Compression Use Gzip , ORC, Parquet 7-Zip utility, .NET/JAVA libraries Export BCP for fast export Multiple files per large table, one folder per table Copy AZCopy Data Movement Library

29.Single Gated Client Parallelised Client Compute Node DMS Bridge Compute Node DMS Bridge Compute Node DMS Bridge Control Node DMS Client Client