SPARKR在2018中扩展数据科学应用的最新进展

SCAP已成为最先进的分析应用程序最流行的平台。它与Hadoop生态系统深度集成,提供了一组强大的库,支持Python和R。由于这些原因,数据科学家已经开始采用Spark来训练和部署他们的模型。当Spark 1.4在2015年发布时,它包含了新的SparkR库:这个API为R用户提供了在Spark上运行R代码的令人兴奋的新选项。
展开查看详情

1.An Update on Scaling Data Science with SparkR Heiko Korndorf, Wireframe #DSSAIS18

2.Agenda • About Me • Spark & R • Spark Architecture • Spark DataFrames and SparkSQL • Natively Distributed ML with Spark ML • Big Compute: Parallelization with Spark UDFs • ML & Data-in-motion: Spark Streaming • Tips & Pitfalls • What About Python? • Summary & Outlook #DSSAIS18 2

3.About Me • MSc in Computer Science, University of Zurich • > 20 Years in Consulting • Finance, Energy, Telco, Pharma, Manufacturing • EAI, BI/Data Warehousing, CRM, ERP, Technology • Speaker at SparkSummit, HadoopSummit, and others • Founder & CEO Wireframe AG • PatternFinder: Data Science Data Warehouse / Business Machine Intelligence • PatternGenerator: Development Tool for Streaming Applications • https://wireframe.digital #DSSAIS18 3

4. SparkR Architecture • Execute R on cluster • Master/Slave Integration • Out-Of-Memory Datasets • Access Data Lake Libraries • Powerful Libraries • Machine Learning Core • SQL • Streaming • R Integration through SparkR #DSSAIS18 4

5.Data (Lake) Access • Ability to read Big Data File Formats • HDFS, AWS S3, Azure WASB, … • Parquet, CSV, JSON, ORC, ... • Security DataFrame • Fine grained authorization • Role-/Attribute-based Access Control • Governance • Metadata Management • Lineage #DSSAIS18 5

6.SparkSQL • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables • Alternatively, use • select(), where(), groupBy(), count(), etc. #DSSAIS18 6

7.Spark MLlib #DSSAIS18 7

8.SparkR & Streaming Use R to process data streams Data Stream Unbounded Table • Structured Streaming: • DataFrames with streaming sources • New data in stream is appended to an unbounded table (i.e. DataFrame) • Seamless integration: • read.stream(“kafka”, ….) #DSSAIS18 8

9.SparkR UDFs SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect() • Apply a function to each group/partition of a Spark DataFrame • Input: Grouping Key and DataFrame partition as R data.frame • Output: R data.frame #DSSAIS18 9

10.SparkR UDFs SparkR spark.lapply() • Run a function over a list of elements and • Distribute the computation with Spark #DSSAIS18 10

11.Big Compute Areas where massively parallel computation is relevant: • Ensemble Learning for Time Series • Hyperparameter Sweeps • High-Dimensional Problem/Search-Space • Wireframe PatternFinder • Shape/Motif Detection • IoT Pattern/Shapes • Monte-Carlo Simulation • Value-at-Risk (Finance) • Catastrophe Modeling (Reinsurance) • Inventory Optimization (Oil & Gas, Manufacturing) #DSSAIS18 11

12.Big Compute Massive Time Series Forecasting • Sequential computation: > 22 hours • Single-Server, parallelized: > 4.5 hours • SparkR, Cluster w/ 25 nodes: ~ 12 minutes Wireframe PatternFinder • 15.500.000 Models to be computed • 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100 Segments • 0.1 Sec. per Model • Sequential: ~ 18 Days • 1000 Cores: ~ 26 Minutes Implications • Minor refactoring of R code • Massive cost reduction by using elastically scaling of Cloud resources #DSSAIS18 12

13.Tips & Pitfalls • Generate diagrams in SparkR • PDF, HTML, Rmarkdown • Store in shared persistence or serialize into Spark DataFrame • SPARK-21866 might be helpful? • Store complex object (models) from SparkR • saveRDS() saves to local storage • Store in shared persistence or serialize into Spark DataFrame • Run R on a YARN cluster w/o locally installed R • / / -- - / .- . ../ . / - • Mixing Scala & R • Not supported by Oozie’s SparkAction • Can be replaced with ShellAction • Not supported by Apache Livy • Only support for Scala, Java, and Python #DSSAIS18 13

14.And What About Python? “Do I need to learn Python?” Let’s compare (Spark)R & Py(Spark): • Language: Interpreted Languages, R (1993), Python (1991) • Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas • Package Management • IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM Watson Studio, Cloudera Data Science Workbench, …. And there’s more: Market Momentum Deep Learning Spark Support Spark Integration #DSSAIS18 14

15.Market Momentum Redmonk (Jan 2018) 1. JavaScript 2. Java 3. Python 4. PHP 5. C# . . 10. Swift 11. Objective-C 12. R TIOBE index (May 2018): 4. Python 11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/ #DSSAIS18 15

16.Deep Learning • Python is a first-class citizen in the Deep Learning/Neural Network world • Using R with these DL frameworks is possible but more complex R Python Other APIs TensorFlow No Yes C++, Java, Go, Swift Keras Yes Yes No MXNet Yes Yes C++, Scala, Julia, Perl PyTorch No Yes No CNTK No Yes C++ #DSSAIS18 16

17.SparkR v PySpark • Both R and Python can access the same types of Spark APIs SparkR PySpark Data Lake Integration Yes Yes Spark SQL Yes Yes Spark ML Yes Yes UDFs Yes Yes Streaming Yes Yes #DSSAIS18 17

18.Spark Integration JVM vs Non-JVM • Spark Executors run in JVM • R & Python run in different processes • Data must be moved between both environments (SerDe) • Low performance #DSSAIS18 18

19.Apache Arrow In-Memory Columnar Data Format • Cross-Language • Optimized for numeric data • Zero-copy reads (no serialization) • Support DataFrame • Spark 2.3 • Python • R Bindings not available yet • And more: Parquet, HBase, Cassandra, … • Will also improve R-Python-Integration (rpy2, reticulate) See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ #DSSAIS18 19

20.Summary & Outlook • Spark is the best option to scale R • See also sparklyr, R Server for Spark • Common Environment for Dev and Production • “Looks like R to Data Science, looks like Spark to Data Engineers” • Security & Data Governance • Row-/Column-Level Access Control • Full Data Lineage (Up- and Downstream • Shared Memory Format • Apache Arrow! • Mix and Match R, Python, and Scala • Towards an Open Data Science Platform #DSSAIS18 20

21.Thank You! Heiko Korndorf heiko.korndorf@wireframe.digital # DSSAIS18 21