Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi ...

Azure Data Lake Analytics (U-SQL). Azure Data Lake Store. Apache Spark on HDInsight. BI Models. Reports and Dashboards. Apache Hadoop on HDInsight.















15.Phone: download and use the Microsoft Ignite mobile app Please evaluate this session From your PC or tablet: visit MyIgnite Your input is important! Please expand notes window at bottom of slide and read. Then Delete this text box.

16.Announcing: Processing Parquet in Azure Data Lake with U-SQL

17.When to use Tables Benefits of Table clustering and distribution Faster lookup of data provided by distribution and clustering when right distribution/cluster is chosen Data distribution provides better localized scale out Used for filters, joins and grouping Benefits of Table partitioning Provides data life cycle management (“expire” old partitions) Partial re-computation of data at partition level Query predicates can provide partition elimination Do not use when… No filters, joins and grouping No reuse of the data for future queries If in doubt: use sampling (e.g., SAMPLE ANY(x) ) and test.

18.Azure Data Lake Analytics

19.JSON Processing How do I extract data from JSON documents?

20.Submit Azure Data Lake Jobs with Azure Data Factory ADF v1: Data Lake Analytics U-SQL Activity against ADLA linked service Service Principal or OAuth AAD credential Provides Parameter model Can specify special runtime ADF v2 (in preview): Data Lake Analytics U-SQL Activity against ADLA linked service Service Principal (via Application entity in AAD) Simplified with same power to ADF v1 Adds support for ADLA pipeline and recurring job insights { "name": " ComputeEventsByRegionPipeline ", "properties": { "description": "This is a U-SQL pipeline.", "activities": [ { "type": " DataLakeAnalyticsU -SQL", " typeProperties ": { " scriptPath ": "scripts\ kona \SearchLogProcessing.txt", " scriptLinkedService ": " StorageLinkedService ", " degreeOfParallelism ": 3, "priority": 100, "parameters": { "in": "/ datalake /input/ SearchLog.tsv ", "out": "/ datalake /output/ Result.tsv " } }, "inputs": [ { "name": " DataLakeTable " } ], "outputs": [ { "name": " EventsByRegionTable " } ], "policy": { "timeout": "06:00:00", "concurrency": 1, " executionPriorityOrder ": " NewestFirst ", "retry": 1 }, "scheduler": { "frequency": "Day", "interval": 1 }, "name": " EventsByRegion ", " linkedServiceName ": " AzureDataLakeAnalyticsLinkedService " } ], "start": "2015-08-08T00:00:00Z", "end": "2015-08-08T01:00:00Z", " isPaused ": false } }

21.Procedures CREATE PROCEDURE P (@ arg string = "default“) AS BEGIN …; OUTPUT @res TO …; INSERT INTO T …; END; Provides parameterization No result but writes into file or table Can contain multiple statements Can contain user-code (needs assembly reference) Will always be inlined Can contain DDL (but no CREATE, DROP FUNCTION/PROCEDURE) Allows encapsulation of U-SQL scripts

22.Cook Data and share with others

23.Scales out your custom code in .NET, Python, R over your Data Lake Familiar syntax to millions of SQL & .NET developers Unifies Declarative nature of SQL with the imperative power of C# Processing of structured, semi-structured and unstructured data Querying multiple Azure Data Sources (Federated Query) Analyzing with Batch, Interactive, Streaming, & Machine Learning in one language Introducing U-SQL A framework for Big Data

24.Load-Extract-Transform-Store & Share Cook data for other users (LETS and Share) As unstructured data As structured data Large-scale custom processing with custom code Augment big data with high-value data from where it lives Schematizing unstructured data (Load—Extract — Transform — Store) for analysis Cloud-scale Cognitive Processing

25.Cloud-scale Cognitive Processing Car Green Parked Outdoor Racing See my session BRK3350 for more details and many more examples

26.Schematize unstructured data

27.Agenda Modern Data Warehouse architectures with Data Lake: From ETL to LETS Introduction to Azure Data Lake and U-SQL Job submission with Azure Data Factory Core U-SQL “LETS”: Schematizing unstructured data Other data formats Scaling out over many files Sharing schematized data Examples of scaling out your custom ETL processing code with U-SQL: JSON Processing Custom Image processing

28.Azure Data Lake Store DATA Business apps Custom apps Sensors and devices INTELLIGENCE ACTION People Preparation, analytics and machine learning Ingestion Bulk ingestion Event ingestion Discovery Azure Data Catalog Visualization Power BI Big data pipeline and data flow in Azure HDInsight (Hadoop and Spark) Stream Analytics Data Lake Analytics Machine Learning SQL DW