利用APACHE Spark实现边缘机器学习

KubNeNes上的APACHE Spark本身支持了提供隔离、弹性和缩放的资源管理器上的APACHE Spark。在Kubernetes,容器化的微服务、机器学习框架以及更多的工作负载是很好的支持。通过使用Kubernetes作为统一资源管理器,Spark可以与其他资源管理器或服务通信,以便在健壮的云基础设施上处理复杂的数据管道。
展开查看详情

1.Operationalizing Edge Machine Learning with Apache Spark ParallelM

2.Growing AI Investments; Few Deployed at Scale 20% AI in Out of 160 reviewed But successful early Production AI use cases: AI adopters report: 88% did not Profit margins progress 3–15% beyond the higher than experimental industry 80% average Developing, stage Experimenting, Contemplating Source: “Artificial Intelligence: The Next Digital Frontier?”, McKinsey Global Institute, June 2017 Survey of 3073 AI-aware C-level Executives 2

3.Challenges of Deploying & Managing ML in Production • Diverse focus and expertise of Data Science & Ops teams • Increased risk from non-deterministic nature of ML • Current Operations solutions do not address uniqueness of ML Apps 3

4.Challenges of Edge/Distributed Topologies IoT is Driving Explosive Growth in Data Volume Data Lake “Things” Edge and Network Data Center/Cloud • Varied resources at each level • Scale, heterogeneity, disconnected operation 4

5.What We Need For Operational ML • Accelerate deployment & facilitate collaboration between Data & Ops teams • Monitor validity of ML predictions, diagnose data and ML performance issues • Orchestrate training, update, and configuration of ML pipelines across distributed, heterogeneous infrastructure with tracking 5

6.What We Need For Edge Operational ML Edge/Cloud Orchestration Central/Cloud Intelligence Edge Intelligence Streams Sources Batches Data Lake • Distribute analytics processing to the optimal point for each use case • Flexible management framework enables: • Secure centralized and/or local learning, prediction, or combined learning/prediction • Granular monitoring and control of model update policies • Support multi-layer topologies to achieve maximum scale while accommodating low bandwidth or unreliable connectivity 6

7.MLOps – Managing the full Production ML Lifecycle ML Orchestration Continuous Machine Learning Database Integration/ ML Health Models Deployment Model Business Governance Impact Business Value 7

8.Our Approach MCenter MCenter Developer Connectors Data Science Platforms and more… MCenter Server MCenter MCenter MCenter MCenter Agent Agent Agent Agent Analytic Engines Models, Retraining Control, Statistics Events, Alerts Data Data Streams Data Lakes 8

9.Operational Abstraction • Link pipelines (training and inference) via an ION (Intelligence Overlay Network) Policy based • Basically a Directed Graph Update representation with allowance for cycles • Pipelines are DAGs within each engine Example – KMeans Batch Training • Distributed execution over Plus Streaming Inference heterogeneous engines, programming Anomaly Detection languages and geographies 9

10.An Example ION to Resource Mapping Human Approved Every Tuesday at 10AM - Cloud Model Every 5 min - Edge Update Central/Cloud Models Intelligence Edge Intelligence Streams Sources Batches Data Lake 10

11.Pipeline Examples Training Pipeline (SparkML) Inference Pipeline (SparkML) 11

12.Instrument, Upload, Orchestrate, Monitor 12 12

13.Integrating with Analytics Engines (Spark) Job Management • Via SparkLauncher: A library to control launching, monitoring and terminating jobs • PM Agent communicates with Spark through this library for job management (also uses Java API to launch child processes) Statistics • Via SparkListener: A Spark-driver callback service • SparkListener taps into all accumulators which, is one of the popular ways to expose statistics • PM agent communicates with the Spark driver and exposes statistics via a REST endpoint ML Health / Model collection and updates • PM Agent delivers and receives health events, health objects and models via sockets from custom PM components in the ML Pipeline 13

14.Demo Description Training Inference 14

15. Thank You! Nisha Talagala nisha.talagala@parallelm.com Vinay Sridhar vinay.sridhar@parallelm.com

16.What We Need For Edge Operational ML Edge Edge/Cloud Central/Cloud Intelligence Orchestration Intelligence Streams Sources Batches Data Lake • Distribute analytics processing to the optimal point for each use case • Flexible management framework enables: • Secure centralized and/or local learning, prediction, or combined learning/prediction • Granular monitoring and control of model update policies • Support multi-layer topologies to achieve maximum scale while accommodating low bandwidth or unreliable connectivity 16

17.Integrating with Analytics Engines (TensorFlow) Job Management • TensorFlow Python programs run as standalone applications • Standard process control mechanisms based on the OS is used to monitor and control TensorFlow programs Statistics Collection • PM Agent parses contents via TensorBoard log files to extract meaningful statistics and events that data scientists added ML Health / Model collection • Generation of models and health objects is recorded on a shared medium 17

18.An Example ION Model Node 1: Inference Pipeline Every 5 min on Spark cluster 1 Human approves/rejects Node 3: Policy When: anytime there is a new model Node 2: (Re) Training Pipeline Model Every Tuesday at 10AM on Spark cluster 2 18 18