- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
SPARKR在2018中扩展数据科学应用的最新进展
展开查看详情
1 .An Update on Scaling Data Science with SparkR Heiko Korndorf, Wireframe #DSSAIS18
2 .Agenda • About Me • Spark & R • Spark Architecture • Spark DataFrames and SparkSQL • Natively Distributed ML with Spark ML • Big Compute: Parallelization with Spark UDFs • ML & Data-in-motion: Spark Streaming • Tips & Pitfalls • What About Python? • Summary & Outlook #DSSAIS18 2
3 .About Me • MSc in Computer Science, University of Zurich • > 20 Years in Consulting • Finance, Energy, Telco, Pharma, Manufacturing • EAI, BI/Data Warehousing, CRM, ERP, Technology • Speaker at SparkSummit, HadoopSummit, and others • Founder & CEO Wireframe AG • PatternFinder: Data Science Data Warehouse / Business Machine Intelligence • PatternGenerator: Development Tool for Streaming Applications • https://wireframe.digital #DSSAIS18 3
4 . SparkR Architecture • Execute R on cluster • Master/Slave Integration • Out-Of-Memory Datasets • Access Data Lake Libraries • Powerful Libraries • Machine Learning Core • SQL • Streaming • R Integration through SparkR #DSSAIS18 4
5 .Data (Lake) Access • Ability to read Big Data File Formats • HDFS, AWS S3, Azure WASB, … • Parquet, CSV, JSON, ORC, ... • Security DataFrame • Fine grained authorization • Role-/Attribute-based Access Control • Governance • Metadata Management • Lineage #DSSAIS18 5
6 .SparkSQL • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables • Alternatively, use • select(), where(), groupBy(), count(), etc. #DSSAIS18 6
7 .Spark MLlib #DSSAIS18 7
8 .SparkR & Streaming Use R to process data streams Data Stream Unbounded Table • Structured Streaming: • DataFrames with streaming sources • New data in stream is appended to an unbounded table (i.e. DataFrame) • Seamless integration: • read.stream(“kafka”, ….) #DSSAIS18 8
9 .SparkR UDFs SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect() • Apply a function to each group/partition of a Spark DataFrame • Input: Grouping Key and DataFrame partition as R data.frame • Output: R data.frame #DSSAIS18 9
10 .SparkR UDFs SparkR spark.lapply() • Run a function over a list of elements and • Distribute the computation with Spark #DSSAIS18 10
11 .Big Compute Areas where massively parallel computation is relevant: • Ensemble Learning for Time Series • Hyperparameter Sweeps • High-Dimensional Problem/Search-Space • Wireframe PatternFinder • Shape/Motif Detection • IoT Pattern/Shapes • Monte-Carlo Simulation • Value-at-Risk (Finance) • Catastrophe Modeling (Reinsurance) • Inventory Optimization (Oil & Gas, Manufacturing) #DSSAIS18 11
12 .Big Compute Massive Time Series Forecasting • Sequential computation: > 22 hours • Single-Server, parallelized: > 4.5 hours • SparkR, Cluster w/ 25 nodes: ~ 12 minutes Wireframe PatternFinder • 15.500.000 Models to be computed • 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100 Segments • 0.1 Sec. per Model • Sequential: ~ 18 Days • 1000 Cores: ~ 26 Minutes Implications • Minor refactoring of R code • Massive cost reduction by using elastically scaling of Cloud resources #DSSAIS18 12
13 .Tips & Pitfalls • Generate diagrams in SparkR • PDF, HTML, Rmarkdown • Store in shared persistence or serialize into Spark DataFrame • SPARK-21866 might be helpful? • Store complex object (models) from SparkR • saveRDS() saves to local storage • Store in shared persistence or serialize into Spark DataFrame • Run R on a YARN cluster w/o locally installed R • / / -- - / .- . ../ . / - • Mixing Scala & R • Not supported by Oozie’s SparkAction • Can be replaced with ShellAction • Not supported by Apache Livy • Only support for Scala, Java, and Python #DSSAIS18 13
14 .And What About Python? “Do I need to learn Python?” Let’s compare (Spark)R & Py(Spark): • Language: Interpreted Languages, R (1993), Python (1991) • Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas • Package Management • IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM Watson Studio, Cloudera Data Science Workbench, …. And there’s more: Market Momentum Deep Learning Spark Support Spark Integration #DSSAIS18 14
15 .Market Momentum Redmonk (Jan 2018) 1. JavaScript 2. Java 3. Python 4. PHP 5. C# . . 10. Swift 11. Objective-C 12. R TIOBE index (May 2018): 4. Python 11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/ #DSSAIS18 15
16 .Deep Learning • Python is a first-class citizen in the Deep Learning/Neural Network world • Using R with these DL frameworks is possible but more complex R Python Other APIs TensorFlow No Yes C++, Java, Go, Swift Keras Yes Yes No MXNet Yes Yes C++, Scala, Julia, Perl PyTorch No Yes No CNTK No Yes C++ #DSSAIS18 16
17 .SparkR v PySpark • Both R and Python can access the same types of Spark APIs SparkR PySpark Data Lake Integration Yes Yes Spark SQL Yes Yes Spark ML Yes Yes UDFs Yes Yes Streaming Yes Yes #DSSAIS18 17
18 .Spark Integration JVM vs Non-JVM • Spark Executors run in JVM • R & Python run in different processes • Data must be moved between both environments (SerDe) • Low performance #DSSAIS18 18
19 .Apache Arrow In-Memory Columnar Data Format • Cross-Language • Optimized for numeric data • Zero-copy reads (no serialization) • Support DataFrame • Spark 2.3 • Python • R Bindings not available yet • And more: Parquet, HBase, Cassandra, … • Will also improve R-Python-Integration (rpy2, reticulate) See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ #DSSAIS18 19
20 .Summary & Outlook • Spark is the best option to scale R • See also sparklyr, R Server for Spark • Common Environment for Dev and Production • “Looks like R to Data Science, looks like Spark to Data Engineers” • Security & Data Governance • Row-/Column-Level Access Control • Full Data Lineage (Up- and Downstream • Shared Memory Format • Apache Arrow! • Mix and Match R, Python, and Scala • Towards an Open Data Science Platform #DSSAIS18 20
21 .Thank You! Heiko Korndorf heiko.korndorf@wireframe.digital # DSSAIS18 21