Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Med
展开查看详情
1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2.Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services Donghwa John Kim, NewWave #UnifiedAnalytics #SparkAISummit
3.Customers #UnifiedAnalytics #SparkAISummit 3
4.About NewWave CMMI Level 4 for Prime Contract Mid-Size Business Services & Development Vehicles 300+ Employees ISO 9001:2015 CMS SPARC – 8(a) & Small 11 Prime Contracts GSA 8(A) STARS II Databricks Gold Level Partner GSA Schedule 70 & Health IT Support 7 CMS Centers SIN Microsoft Gold Cloud Platform AWS Advanced Consulting Partner #UnifiedAnalytics #SparkAISummit 4
5.Technology Vendor Partners #UnifiedAnalytics #SparkAISummit 5
6.Centers for Medicare & Medicaid Services (CMS) CMS is the largest healthcare payer in the country, with a budget of $793.7B. NewWave is its trusted partner and leading innovator. #UnifiedAnalytics #SparkAISummit 6
7.A unique customer that sets the standard for industry & defines the market in healthcare #UnifiedAnalytics #SparkAISummit 7
8.Data Challenge • 2 billion data points* annually to store, analyze and disseminate • Privacy requirements (PHI, PII) without compromising agility • Central view of available data on multiple systems * Just on Medicare data #UnifiedAnalytics #SparkAISummit 8
9.The Objectives The vision is to provide a simple and reliable technology and data experience for all of CMS IT Portfolio stakeholders. Center-wide shared data services Robust data governance Single cloud-native architecture #UnifiedAnalytics #SparkAISummit 9
10.The Definition of Genius Is Taking the Complex and Making it Simple – Albert Einstein #UnifiedAnalytics #SparkAISummit 10
11.Solution from a Bird’s Eye View #UnifiedAnalytics #SparkAISummit 11
12.Data as a Service Data Agility Improved Data Quality Cost Effectiveness #UnifiedAnalytics #SparkAISummit 12
13.Agility - Dremio Virtual Datasets • Built on top of the immutable physical datasets found in sources • A layered stack of data transformations that have been performed on top of one or more physical datasets • Each virtual dataset is ultimately described by a SQL query • Chaining of datasets are possible. • Data Lineage - a history of all the applied transformations is available #UnifiedAnalytics #SparkAISummit 13
14.Agility - Dremio Virtual Dataset Example #UnifiedAnalytics #SparkAISummit 14
15.Simplicity - SQL for [almost] EVERYTHING • Ability to join data from multiple data sources including JSON, CSV, Parquet, relational database and NoSQL • Unified interface for the data And suddenly ... SQL is sexy again! #UnifiedAnalytics #SparkAISummit 15
16.Simplicity - SQL for [almost] EVERYTHING #UnifiedAnalytics #SparkAISummit 16
17.Privacy - Row Level Masking Use query_user() and is_member() for selective filtering of rows for different users or groups without having to create multiple datasets. #UnifiedAnalytics #SparkAISummit 17
18.Privacy - Column Level Masking #UnifiedAnalytics #SparkAISummit 18
19.Privacy - Column Level Masking - VDS #UnifiedAnalytics #SparkAISummit 19
20.Centralized View - Data Catalog • Ability to search for the data • Collaboration experience using Wiki and content tagging #UnifiedAnalytics #SparkAISummit 20
21.Data Lineage #UnifiedAnalytics #SparkAISummit 21
22.Looker’s LookML = “SQL Evolved” LookML is a language for describing dimensions, aggregates, calculations and data relationships in a SQL database. #UnifiedAnalytics #SparkAISummit 22
23.Looker’s Explorer #UnifiedAnalytics #SparkAISummit 23
24.LookML => SQL #UnifiedAnalytics #SparkAISummit 24
25.Data Modeling with Looker SQL models generated by Looker from LookML can be exported into Dremio to create virtual datasets. #UnifiedAnalytics #SparkAISummit 25
26.Accessing Dremio from Databricks • Adding Dremio JDBC Driver jar in Databricks #UnifiedAnalytics #SparkAISummit 26
27.Accessing Dremio from Databricks Use it! * Driver Virtual Dataset Parallelism level * https://docs.databricks.com/user-guide/secrets/example-secret-workflow.html #UnifiedAnalytics #SparkAISummit 27
28.Demo #UnifiedAnalytics #SparkAISummit 28
29.Demo #UnifiedAnalytics #SparkAISummit 29