R and Spark as Yin and Yang of Scalable Machine Learning in Azure ...

Describe scalable machine learning offerings available in Azure HDInsight. Recommend joint use cases for R Server, Spark MLlib and Deep learning libraries.
展开查看详情

1.BR002

2.R & Spark as Yin and Yang of scalable machine learning in Azure HDInsight Maxim Lukiyanov Senior Program Manager Big Data, Microsoft

3.Session objectives and takeaways Session objective(s): Describe scalable machine learning offerings available in Azure HDInsight Recommend joint use cases for R Server, Spark MLlib and Deep learning libraries Key takeaway 1 R and Spark are better together Key takeaway 2 HDInsight Premium R Server on Spark offers comprehensive set of capabilities for scalable machine learning including two most popular data science languages: R and Python

4.Yin and Yang spark R Yin-Yang picture by DonkeyHotey is licensed under CC BY

5.Cortana Intelligence Suite Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (R Server and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data

6.Cortana Intelligence Suite Action People Automated Systems Apps Web Mobile Bots Intelligence Dashboards & Visualizations Cortana Bot Framework Cognitive Services Power BI Information Management Event Hubs Data Catalog Data Factory Machine Learning and Analytics HDInsight (R Server and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning Big Data Stores SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data

7.Infinite world of scalable machine learning Traditional algorithms Specialized algorithms Deep learning algorithms logistic regression, linear models, basic statistics, hypothesis testing, k-means, decision trees page rank, collaborative filtering, graph processing, SVD, PCA, Bayesian models, … deep learning over various types of networks

8.Use cases of scalable machine learning Traditional algorithms Specialized algorithms Deep learning algorithms product recommendations intelligent search routing robotics ad placement predictive maintenance image, video recognition sentiment analysis text comprehension natural language processing robotics bots augmented reality predictive maintenance Retail Financial services Healthcare Manufacturing loyalty programs customer acquisition pricing strategy supply chain mgnt customer churn fraud detection risk & compliance cross-sell & upsell personalization bill collection operational efficiency patient demographics pay for performance demand forecasting pricing strategy supply chain optimization predictive maintenance remote monitoring

9.Scalable machine learning offerings in HDInsight Server

10.R Server

11.What is The most popular statistical programming language A data visualization tool Open source 2.5+M users Taught in most universities Thriving user groups worldwide 8000+ contributed packages New and recent grad’s use it Language Platform Community Ecosystem Rich application & platform integration

12.R Adoption is on a Tear, but Open Source R is not Enterprise Class Data Flows Overwhelm Open Source R In-Memory Operation Lack of Parallelism Expensive Data Movement & Duplication Not enterprise ready Inadequacy of Community Support Lack of Guaranteed Support Timeliness No SLAs or Support models

13.R Server: scale-out R, Enterprise Class! 100% compatible with open source R Full ecosystem access: any code/package that works today with R will work in R Server Wide range of scalable and distributed R functions Examples: rxDataStep (), rxSummary (), rxGlm (), rxDForest (), rxPredict () Ability to parallelize any R function Ideal for parameter sweeps, simulation, scoring Enterprise-grade offering Stable bits, SLAs, support

14.Open source R mydata <- read.csv( "http://www.ats.ucla.edu/stat/data/binary.csv" ) mylogit <- glm (admit ~ gre + gpa + rank, data = mydata , family = "binomial")

15.R Server mydata <- RxTextData ( “/data/binary.csv” , fileSystem = hdfsFS ) mylogit <- rxLogit (admit ~ gre + gpa + rank, data = mydata ) Switch functions

16.R Server parallelized by Spark rxSetComputeContext ( RxSpark (…) ) mydata <- RxTextData ( “/data/binary.csv” , fileSystem = hdfsFS ) mylogit <- rxLogit (admit ~ gre + gpa + rank, data = mydata ) Switch compute context

17.R Server on HDInsight R R R R R R R R R R Data Nodes Edge Node Head Nodes Data Scientists R Server

18.R Server and Spark resource sharing YARN Spark Application Spark Application Spark Application Spark Application Livy server Thrift server Jupyter notebooks REST ODBC Default Queue Thrift Queue IntelliJ IDEA BI Tools Head node Edge node R server DeployR SSH R Tools for VS R Studio

19.Performance: R Server on HDInsight Spark Scales linearly to billions of rows Elapsed Time (seconds) Billions of rows 2.2 TB Preliminary results

20.Performance: R Server on HDInsight Spark Spark is much faster than MapReduce and Local Preliminary results 36X Billions of rows Millions of rows

21.Parallelized & Distributed Algorithms Data import – Delimited, Fixed, SAS, SPSS, OBDC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test ETL Statistical Tests Subsample (observations & variables) Random Sampling Sampling Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Descriptive Statistics Sum of Squares (cross product matrix for set variables) Multiple Linear Regression Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Predictions/scoring for models Residuals for all models Predictive Statistics K-Means Clustering Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Machine Learning Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Custom Parallelization rxDataStep rxExec PEMA-R API Variable Selection Stepwise Regression

22.Spark

23.Apache Spark Data Sources Spark SQL Spark Streaming GraphX (graph) MLlib (machine learning) R Server Apache Spark

24.Spark MLlib algorithms Basic statistics summary statistics correlations stratified sampling hypothesis testing random data generation Classification and regression linear models (SVMs, logistic, linear) naive Bayes decision trees ensembles of trees: random forests gradient-boosted trees isotonic regression Simulation Monte Carlo Collaborative filtering alternating least squares (ALS) Clustering k-means Gaussian mixture power iteration clustering (PIC) latent Dirichlet allocation (LDA) Dimensionality reduction SVD PCA Frequent pattern mining FP-growth association rules - parity - Spark-only - R Server-only

25.Spark MLlib algorithms in R language Basic statistics summary statistics correlations stratified sampling hypothesis testing random data generation Classification and regression linear models (SVMs, logistic, linear) naive Bayes decision trees ensembles of trees: random forests gradient-boosted trees isotonic regression Simulation Monte Carlo Collaborative filtering alternating least squares (ALS) Clustering k-means Gaussian mixture power iteration clustering (PIC) latent Dirichlet allocation (LDA) Dimensionality reduction SVD PCA Frequent pattern mining FP-growth association rules

26.Spark MLlib algorithms in Python language Basic statistics summary statistics correlations stratified sampling hypothesis testing random data generation Classification and regression linear models (SVMs, logistic, linear) naive Bayes decision trees ensembles of trees: random forests gradient-boosted trees isotonic regression Simulation Monte Carlo Collaborative filtering alternating least squares (ALS) Clustering k-means Gaussian mixture power iteration clustering (PIC) latent Dirichlet allocation (LDA) Dimensionality reduction SVD PCA Frequent pattern mining FP-growth association rules

27.Popularity of languages for data science Link to the poll R and Python are two dominant languages

28.R and Spark are better together R Server contributes Premium experience for scalable machine learning in R language Spark contributes Scalable machine learning in Python, Scala and Java languages Rich data exploration and transformation capabilities Faster scale-out engine

29.Demo Maxim Lukiyanov