- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- <iframe src="https://www.slidestalk.com/Spark/Splice_Machines_use_of_Apache_Spark_and_MLflow?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
- 微信扫一扫分享
Splice Machine’s use of Apache Spark and MLflow
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Splice Machine’s use of Apache Spark and MLflow Gene Davis, Splice Machine #UnifiedAnalytics #SparkAISummit
3 .Splice Machine • What are we? – A scale-out RDBMS that enables simultaneous transactions (OLTP) and analytics (OLAP) – Powers Operational AI: the ability to run AI applications in real time • Who uses us? – Companies in financial services, healthcare, supply chain, etc. – One example: 7PB, 2B record updates/day, 2M queries/day with sub- second response time • How do we do it? – Transactional SQL engine on top of HBase and Spark • ”Dual engine” architecture – Many delivery options (on-premise, cloud service (AWS, Azure, bespoke cloud, etc.)) #UnifiedAnalytics #SparkAISummit 3
4 .Operational AI Integrated data platform for real-time AI applications Operational OPERATIONAL INTELLIGENCE Database • Scale-out • OLTP INTELLIGENT BUSINESS • Fast ML Models DECISIONS INTELLIGENCE Enterprise Data • Notebooks • Algorithms Warehouse • Model ARTIFICIAL • In-Memory Workflow INTELLIGENCE • OLAP • Massively Parallel On Premise #UnifiedAnalytics #SparkAISummit 4
5 .The Three Dimensions of Intelligence What has happened in What is happening right What will happen in the the past that might now? future? impact you? OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 5
6 .The Three Dimensions of Intelligence Key platforms are duct-taped together leading to High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 6
7 .Intelligent Action - Before #UnifiedAnalytics #SparkAISummit 7
8 .Intelligent Action - After #UnifiedAnalytics #SparkAISummit 8
9 .Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? #UnifiedAnalytics #SparkAISummit 9
10 .Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? ● What data did I use? ● What libraries are used? ● What algorithms/parameters ● What model version is deployed? gave the best model? ● Why didn’t I get the same results? #UnifiedAnalytics #SparkAISummit 10
11 .MLflow and ML Manager • Splice Machine chose MLflow – MLflow Tracking: Track experiment runs and parameters – MLflow Models: packaging model artifacts • Splice ML Manager – Machine Learning on the Splice Machine Stack – MLflow Tracking and Models – Includes UI to Deploy to Amazon SageMaker #UnifiedAnalytics #SparkAISummit 11
12 .ML Manager Architecture Deployment Automation Native Spark Data Source Splice Machine Data Platform On Premises #UnifiedAnalytics #SparkAISummit 12
13 .Native Spark Datasource • Efficient interface from the Splice relational tables into Spark DataFrames (and back again) • No serialization/deserialization • Examples: – interestingDf = spliceContext.df(“select * from interesting_table”) – spliceContext.insert(dfWithData,’table_name’) #UnifiedAnalytics #SparkAISummit 13
14 .Accessing MLflow Capabilities • Start with Splice’s MLManager – manager = MLManager() – Convenience class on top of MLflow • API’s – manager.create_experiment() – manager.set_active_experiment() – manager.create_new_run() – manager.log_param() – manager.log_metric() – manager.log_spark_model() #UnifiedAnalytics #SparkAISummit 14
15 .MLflow UI #UnifiedAnalytics #SparkAISummit 15
16 .Deployment Automation #UnifiedAnalytics #SparkAISummit 16
17 .ML Manager • Beta Launched in March – MLflow v0.8 • Available at cloud.splicemachine.com • MLManager() API Open Source at: – https://github.com/splicemachine/pysplice – (subject to change per MLflow 1.0 API) #UnifiedAnalytics #SparkAISummit 17
18 .DEMO #UnifiedAnalytics #SparkAISummit 18
19 .DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT The picture can't be displayed.
20 .Many Disparate Tools OLTP - Oracle, OLAP - Data Sources Cassandra, Redshift, Dynamo Snowflake, S3 Apache Notebooks Zeppelin Jupyter Data Manipulation Python Spark Machine Pandas Learning Scikit MLLib, R Experimenta tion Tracking MLflow Deployment Sagemaker AzureML #UnifiedAnalytics #SparkAISummit 20
21 .Insurance Claim Example #UnifiedAnalytics #SparkAISummit 21
22 .Insurance Claim Example #UnifiedAnalytics #SparkAISummit 22