Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Med

In today’s digital age of data exploration, Apache Spark has become the de facto platform of choice for processing large volume of data from variety of sources in diverse formats, serving equally disparate destinations for Business Intelligence and Advanced Analytics. Centers for Medicare and Medicaid Services (CMS) is a federal health agency under Health and Human Services (HHS). It is the single largest payer for health care in the United States, serving nearly 90 million Americans who rely on health care benefits through Medicare, Medicaid, and the State Children’s Health Insurance Program (CHIPS). CMS recently adopted Apache Spark as its big data processing platform to ingest and analyze clinical and claims data from various data sources to produce healthcare models designed to improve patient’s health and reduce costs at the same time. The data come from multiple sources and contain Personally Identifiable Information (PII) and Protected Health Information (PHI). Thus a data governance that includes robust security controls is a must. At the same time, it must be able to serve multiple business units with several roles within each of those units requiring different levels of access to the data. This presentation will cover best data governance practices including data security, data stewardship and data quality management using both open source and commercial tools based on lessons learned from the Apache Spark implementation at CMS.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Medicare and Medicaid Services Donghwa John Kim, NewWave #UnifiedAnalytics #SparkAISummit

3.Customers #UnifiedAnalytics #SparkAISummit 3

4.About NewWave CMMI Level 4 for Prime Contract Mid-Size Business Services & Development Vehicles 300+ Employees ISO 9001:2015 CMS SPARC – 8(a) & Small 11 Prime Contracts GSA 8(A) STARS II Databricks Gold Level Partner GSA Schedule 70 & Health IT Support 7 CMS Centers SIN Microsoft Gold Cloud Platform AWS Advanced Consulting Partner #UnifiedAnalytics #SparkAISummit 4

5.Technology Vendor Partners #UnifiedAnalytics #SparkAISummit 5

6.Centers for Medicare & Medicaid Services (CMS) CMS is the largest healthcare payer in the country, with a budget of $793.7B. NewWave is its trusted partner and leading innovator. #UnifiedAnalytics #SparkAISummit 6

7.A unique customer that sets the standard for industry & defines the market in healthcare #UnifiedAnalytics #SparkAISummit 7

8.Data Challenge • 2 billion data points* annually to store, analyze and disseminate • Privacy requirements (PHI, PII) without compromising agility • Central view of available data on multiple systems * Just on Medicare data #UnifiedAnalytics #SparkAISummit 8

9.The Objectives The vision is to provide a simple and reliable technology and data experience for all of CMS IT Portfolio stakeholders. Center-wide shared data services Robust data governance Single cloud-native architecture #UnifiedAnalytics #SparkAISummit 9

10.The Definition of Genius Is Taking the Complex and Making it Simple – Albert Einstein #UnifiedAnalytics #SparkAISummit 10

11.Solution from a Bird’s Eye View #UnifiedAnalytics #SparkAISummit 11

12.Data as a Service Data Agility Improved Data Quality Cost Effectiveness #UnifiedAnalytics #SparkAISummit 12

13.Agility - Dremio Virtual Datasets • Built on top of the immutable physical datasets found in sources • A layered stack of data transformations that have been performed on top of one or more physical datasets • Each virtual dataset is ultimately described by a SQL query • Chaining of datasets are possible. • Data Lineage - a history of all the applied transformations is available #UnifiedAnalytics #SparkAISummit 13

14.Agility - Dremio Virtual Dataset Example #UnifiedAnalytics #SparkAISummit 14

15.Simplicity - SQL for [almost] EVERYTHING • Ability to join data from multiple data sources including JSON, CSV, Parquet, relational database and NoSQL • Unified interface for the data And suddenly ... SQL is sexy again! #UnifiedAnalytics #SparkAISummit 15

16.Simplicity - SQL for [almost] EVERYTHING #UnifiedAnalytics #SparkAISummit 16

17.Privacy - Row Level Masking Use query_user() and is_member() for selective filtering of rows for different users or groups without having to create multiple datasets. #UnifiedAnalytics #SparkAISummit 17

18.Privacy - Column Level Masking #UnifiedAnalytics #SparkAISummit 18

19.Privacy - Column Level Masking - VDS #UnifiedAnalytics #SparkAISummit 19

20.Centralized View - Data Catalog • Ability to search for the data • Collaboration experience using Wiki and content tagging #UnifiedAnalytics #SparkAISummit 20

21.Data Lineage #UnifiedAnalytics #SparkAISummit 21

22.Looker’s LookML = “SQL Evolved” LookML is a language for describing dimensions, aggregates, calculations and data relationships in a SQL database. #UnifiedAnalytics #SparkAISummit 22

23.Looker’s Explorer #UnifiedAnalytics #SparkAISummit 23

24.LookML => SQL #UnifiedAnalytics #SparkAISummit 24

25.Data Modeling with Looker SQL models generated by Looker from LookML can be exported into Dremio to create virtual datasets. #UnifiedAnalytics #SparkAISummit 25

26.Accessing Dremio from Databricks • Adding Dremio JDBC Driver jar in Databricks #UnifiedAnalytics #SparkAISummit 26

27.Accessing Dremio from Databricks Use it! * Driver Virtual Dataset Parallelism level * https://docs.databricks.com/user-guide/secrets/example-secret-workflow.html #UnifiedAnalytics #SparkAISummit 27

28.Demo #UnifiedAnalytics #SparkAISummit 28

29.Demo #UnifiedAnalytics #SparkAISummit 29