Building Real-Time Data Pipeline
展开查看详情
1.Building Real-Time Data Pipeline For Diabetes Medication Recommender System Using Databricks Arivoli Tirouvingadame Data Platform Engineer, Qventus Jayaradha Natarajan Sr. Data Engineer, Change Healthcare #DevSAIS17
2.$whoami • Jayaradha Natarajan Sr. Data Engineer, Change Healthcare Arivoli Tirouvingadame www.github.com/jayaradha Data Platform Engineer, Qventus Open Source Committer http://www.github.com/olisource https://l10n.gnome.org/teams/ta/ Organizer, Data Riders meetup group Organizer, Data Riders meetup group www.meetup.com/datariders www.meetup.com/datariders
3.AI/ML in Healthcare “AI will be ubiquitous in healthcare by 2025” https://www.techemergence.com/machine-learning-in-healthcare-executive-consensus/
4. Patient Visit Prescription Lab Healthcare Data IoT Sensors R&D
5.“We are in the early days of AI assisting Physicians better prescribe medication” https://www.truveris.com/resources/ai-in-healthcare-helping-physicians-better-prescribe-treatments
6. - Current: 1 in 11 adults are diabetic - By 2040: Diabetes population is expected to be 2 times population of USA https://www.alchemyfoodtech.com/copy-of-diabetes-epidemic
7.Life in a day … of a Diabetes patient Problem Challenge Symptoms
8.How can we prescribe Diabetes medication better in near real-time?
9.Solution - Use Big Data pipeline to collect patient's Blood glucose level and medication before/after food and predict better medication in near real-time Data Model Predict Collection Predict Collect Medication Model data Sensor Data & alert using ML (Wearable patient’s Algorithms devices) mobile device
10. Non-meter test strips Glucose Monitors Hospital glucose meters Blood testing with meters using test strips Noninvasive meters Continuous glucose monitors
11.Ingestion data o Typically, raw data can be structured/semi- structured/unstructured with/without errors o IoT devices (from Continuous Glucose Monitors) produce structured data with/without errors
12.Data Storage and Cleansing Cleansed Data Storage Blood glucose Calorie intake level Model Storage Sensor Raw Data Data Storage Age Recommendation/ Data Cleansing Score storage
13.Data Cleansing and modeling o Data cleansing uses statistical analysis tools to read and audit data based on a list of pre-defined constraints. Streaming Range Validate check Split Data Data Training Test data data
14.ARCHITECTURE
15.Reference Architecture Train Transformation/ Cleansing EMR Raw Data Clean Data Model
16.Reference Architecture Train Transformation/ Cleansing EMR Raw Data Clean Data Model Prediction
17.Architecture components o Kafka: Get sensor data in real-time from Wearable devices o Apache Spark: Ingest raw data through Kafka. Use Structured Streaming (Data verification, validation, cleansing, enrichment, etc.), and store it in S3 buckets o MLlib: Process data stored in S3 buckets via Machine Learning libraries. Insulin intake can be recommended o AWS: Deploy model and other related services in EC2, EMR, etc.. o Mobile or Web App: Notify patients with medication recommendation o D3/Tableau: Visualize via charts/dashboards
18.Pain points o Maintaining multiple root accounts for Dev, Pre-Prod and Prod environments is expensive o Choosing HIPAA compliant services (most of the server-less technologies are not HIPAA compliant) o We have to build secured network from scratch and maintain them (for example: using terraform, cloud formation, etc.). o End-to-end encryption: Data-in-flight and Data-at-rest encryption
19.HIPAA Challenges o HIPAA requires Healthcare Data to be protected. o Ensure the confidentiality, integrity, and availability of Protected Health Information (PHI) created, received, maintained, or transmitted. o Protect against any reasonably anticipated threats and hazards to the security or integrity of PHI. o Protect against reasonably anticipated uses or disclosures of PHI not permitted by the Privacy Rule.
20.DATABRICKS PIPELINE
21.Databricks – Kinesis - Connector Kinesis Structured Streaming Spark ML AWS Lambda API Gateway
22.Databricks – Kafka - Connector Spark to Spark ML clean data data Kafka Train Connector Raw Cleansed Data Data Model Prediction
23.Deployment o Hybrid only or single tenant o Selected AWS BAA HIPAA services o Databricks auxiliary services (Web app and cluster management software) would be in a Databricks-owned AWS account and run on dedicated VPC instance. o Spark clusters would continue to be deployed to customers AWS account and on dedicated instances. o End to End Encryption: Data-in-flight and Data-at-rest encryption o Logging and Monitoring o Audit https://docs.databricks.com/user-guide/advanced/hipaa-compliant-deployment.html
24.DEMO
25.Mobile App
26.Visualizations
27.
28.
29.Future directions o Health: Extend it to apply to any medication management based solutions and emergency medication management o Wellness: Predict calorie intake o Fitness: Predict workouts needed to be done