- 微博 QQ QQ空间 贴吧
Apache Spark Data Governance Best Practices—Lessons Learned from Centers for Med
收藏 1下载 1
In today’s digital age of data exploration, Apache Spark has become the de facto platform of choice for processing large volume of data from variety of sources in diverse formats, serving equally disparate destinations for Business Intelligence and Advanced Analytics. Centers for Medicare and Medicaid Services (CMS) is a federal health agency under Health and Human Services (HHS). It is the single largest payer for health care in the United States, serving nearly 90 million Americans who rely on health care benefits through Medicare, Medicaid, and the State Children’s Health Insurance Program (CHIPS). CMS recently adopted Apache Spark as its big data processing platform to ingest and analyze clinical and claims data from various data sources to produce healthcare models designed to improve patient’s health and reduce costs at the same time. The data come from multiple sources and contain Personally Identifiable Information (PII) and Protected Health Information (PHI). Thus a data governance that includes robust security controls is a must. At the same time, it must be able to serve multiple business units with several roles within each of those units requiring different levels of access to the data. This presentation will cover best data governance practices including data security, data stewardship and data quality management using both open source and commercial tools based on lessons learned from the Apache Spark implementation at CMS.