Splice Machine’s use of Apache Spark and MLflow
展开查看详情
1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2.Splice Machine’s use of Apache Spark and MLflow Gene Davis, Splice Machine #UnifiedAnalytics #SparkAISummit
3.Splice Machine • What are we? – A scale-out RDBMS that enables simultaneous transactions (OLTP) and analytics (OLAP) – Powers Operational AI: the ability to run AI applications in real time • Who uses us? – Companies in financial services, healthcare, supply chain, etc. – One example: 7PB, 2B record updates/day, 2M queries/day with sub- second response time • How do we do it? – Transactional SQL engine on top of HBase and Spark • ”Dual engine” architecture – Many delivery options (on-premise, cloud service (AWS, Azure, bespoke cloud, etc.)) #UnifiedAnalytics #SparkAISummit 3
4.Operational AI Integrated data platform for real-time AI applications Operational OPERATIONAL INTELLIGENCE Database • Scale-out • OLTP INTELLIGENT BUSINESS • Fast ML Models DECISIONS INTELLIGENCE Enterprise Data • Notebooks • Algorithms Warehouse • Model ARTIFICIAL • In-Memory Workflow INTELLIGENCE • OLAP • Massively Parallel On Premise #UnifiedAnalytics #SparkAISummit 4
5.The Three Dimensions of Intelligence What has happened in What is happening right What will happen in the the past that might now? future? impact you? OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 5
6.The Three Dimensions of Intelligence Key platforms are duct-taped together leading to High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 6
7.Intelligent Action - Before #UnifiedAnalytics #SparkAISummit 7
8.Intelligent Action - After #UnifiedAnalytics #SparkAISummit 8
9.Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? #UnifiedAnalytics #SparkAISummit 9
10.Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? ● What data did I use? ● What libraries are used? ● What algorithms/parameters ● What model version is deployed? gave the best model? ● Why didn’t I get the same results? #UnifiedAnalytics #SparkAISummit 10
11.MLflow and ML Manager • Splice Machine chose MLflow – MLflow Tracking: Track experiment runs and parameters – MLflow Models: packaging model artifacts • Splice ML Manager – Machine Learning on the Splice Machine Stack – MLflow Tracking and Models – Includes UI to Deploy to Amazon SageMaker #UnifiedAnalytics #SparkAISummit 11
12.ML Manager Architecture Deployment Automation Native Spark Data Source Splice Machine Data Platform On Premises #UnifiedAnalytics #SparkAISummit 12
13.Native Spark Datasource • Efficient interface from the Splice relational tables into Spark DataFrames (and back again) • No serialization/deserialization • Examples: – interestingDf = spliceContext.df(“select * from interesting_table”) – spliceContext.insert(dfWithData,’table_name’) #UnifiedAnalytics #SparkAISummit 13
14.Accessing MLflow Capabilities • Start with Splice’s MLManager – manager = MLManager() – Convenience class on top of MLflow • API’s – manager.create_experiment() – manager.set_active_experiment() – manager.create_new_run() – manager.log_param() – manager.log_metric() – manager.log_spark_model() #UnifiedAnalytics #SparkAISummit 14
15.MLflow UI #UnifiedAnalytics #SparkAISummit 15
16.Deployment Automation #UnifiedAnalytics #SparkAISummit 16
17.ML Manager • Beta Launched in March – MLflow v0.8 • Available at cloud.splicemachine.com • MLManager() API Open Source at: – https://github.com/splicemachine/pysplice – (subject to change per MLflow 1.0 API) #UnifiedAnalytics #SparkAISummit 17
18.DEMO #UnifiedAnalytics #SparkAISummit 18
19.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT The picture can't be displayed.
20.Many Disparate Tools OLTP - Oracle, OLAP - Data Sources Cassandra, Redshift, Dynamo Snowflake, S3 Apache Notebooks Zeppelin Jupyter Data Manipulation Python Spark Machine Pandas Learning Scikit MLLib, R Experimenta tion Tracking MLflow Deployment Sagemaker AzureML #UnifiedAnalytics #SparkAISummit 20
21.Insurance Claim Example #UnifiedAnalytics #SparkAISummit 21
22.Insurance Claim Example #UnifiedAnalytics #SparkAISummit 22