Splice Machine’s use of Apache Spark and MLflow

Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource – which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine’s Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Splice Machine’s use of Apache Spark and MLflow Gene Davis, Splice Machine #UnifiedAnalytics #SparkAISummit

3.Splice Machine • What are we? – A scale-out RDBMS that enables simultaneous transactions (OLTP) and analytics (OLAP) – Powers Operational AI: the ability to run AI applications in real time • Who uses us? – Companies in financial services, healthcare, supply chain, etc. – One example: 7PB, 2B record updates/day, 2M queries/day with sub- second response time • How do we do it? – Transactional SQL engine on top of HBase and Spark • ”Dual engine” architecture – Many delivery options (on-premise, cloud service (AWS, Azure, bespoke cloud, etc.)) #UnifiedAnalytics #SparkAISummit 3

4.Operational AI Integrated data platform for real-time AI applications Operational OPERATIONAL INTELLIGENCE Database • Scale-out • OLTP INTELLIGENT BUSINESS • Fast ML Models DECISIONS INTELLIGENCE Enterprise Data • Notebooks • Algorithms Warehouse • Model ARTIFICIAL • In-Memory Workflow INTELLIGENCE • OLAP • Massively Parallel On Premise #UnifiedAnalytics #SparkAISummit 4

5.The Three Dimensions of Intelligence What has happened in What is happening right What will happen in the the past that might now? future? impact you? OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 5

6.The Three Dimensions of Intelligence Key platforms are duct-taped together leading to High Infrastructure Costs • Latency in Decision Making • Isolation from Business Processes OLAP OLTP ML #UnifiedAnalytics #SparkAISummit 6

7.Intelligent Action - Before #UnifiedAnalytics #SparkAISummit 7

8.Intelligent Action - After #UnifiedAnalytics #SparkAISummit 8

9.Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? #UnifiedAnalytics #SparkAISummit 9

10.Data Science Pain Points Data Data Scientist Engineer ● The ETL process changed again – now ● Is my data ready to go? what? ● Is it still relevant? ● The Data Scientist requested a different ● Do my features still align? level of granularity – how do I do that? ● What data did I use? ● What libraries are used? ● What algorithms/parameters ● What model version is deployed? gave the best model? ● Why didn’t I get the same results? #UnifiedAnalytics #SparkAISummit 10

11.MLflow and ML Manager • Splice Machine chose MLflow – MLflow Tracking: Track experiment runs and parameters – MLflow Models: packaging model artifacts • Splice ML Manager – Machine Learning on the Splice Machine Stack – MLflow Tracking and Models – Includes UI to Deploy to Amazon SageMaker #UnifiedAnalytics #SparkAISummit 11

12.ML Manager Architecture Deployment Automation Native Spark Data Source Splice Machine Data Platform On Premises #UnifiedAnalytics #SparkAISummit 12

13.Native Spark Datasource • Efficient interface from the Splice relational tables into Spark DataFrames (and back again) • No serialization/deserialization • Examples: – interestingDf = spliceContext.df(“select * from interesting_table”) – spliceContext.insert(dfWithData,’table_name’) #UnifiedAnalytics #SparkAISummit 13

14.Accessing MLflow Capabilities • Start with Splice’s MLManager – manager = MLManager() – Convenience class on top of MLflow • API’s – manager.create_experiment() – manager.set_active_experiment() – manager.create_new_run() – manager.log_param() – manager.log_metric() – manager.log_spark_model() #UnifiedAnalytics #SparkAISummit 14

15.MLflow UI #UnifiedAnalytics #SparkAISummit 15

16.Deployment Automation #UnifiedAnalytics #SparkAISummit 16

17.ML Manager • Beta Launched in March – MLflow v0.8 • Available at cloud.splicemachine.com • MLManager() API Open Source at: – https://github.com/splicemachine/pysplice – (subject to change per MLflow 1.0 API) #UnifiedAnalytics #SparkAISummit 17

18.DEMO #UnifiedAnalytics #SparkAISummit 18


20.Many Disparate Tools OLTP - Oracle, OLAP - Data Sources Cassandra, Redshift, Dynamo Snowflake, S3 Apache Notebooks Zeppelin Jupyter Data Manipulation Python Spark Machine Pandas Learning Scikit MLLib, R Experimenta tion Tracking MLflow Deployment Sagemaker AzureML #UnifiedAnalytics #SparkAISummit 20

21.Insurance Claim Example #UnifiedAnalytics #SparkAISummit 21

22.Insurance Claim Example #UnifiedAnalytics #SparkAISummit 22