第四讲:Spark for ETL & Data Science -JeffZhang

第 4 讲:Spark for ETL and Data Science
主要介绍如何用Spark来做ETL以及交互式数据分析的最佳实践,
主讲嘉宾 章剑锋,阿里巴巴高级技术专家, Apache Tez、Livy 、Zeppelin PMC ,Apache Pig Committer

加入钉钉群了解更多技术信息

展开查看详情

1.Spark for ETL & Data Science 章剑锋 · 阿⾥巴巴 /⾼级技术专家

2.2008 Who Am I 2011 2013 2014 2018

3. 01 What is ETL & Data Science 02 CONTENT How to do ETL in Spark ⽬录 >> 03 How to do Data Science in Spark 04 Demo via Spark on Zeppelin

4. 01 What is ETL & Data Science Apache Flink 中⽂学习⽹站: ververica.cn © Apache Flink Community China 严禁商业⽤途

5.Data Pipeline ETL Data Science

6.线下交流 微信公众号 钉钉群 We are hiring (校招,社招) jeffzhang.zjf@alibaba-inc.com

7. 02 How to do ETL in Spark Apache Flink 中⽂学习⽹站: ververica.cn © Apache Flink Community China 严禁商业⽤途

8.What is ETL • Extract - Read raw data from single/multiple sources (no schema, uncompressed, dirty) • Transform - Transform raw data (Filtering/Aggregation/Normalization/Join) • Load - Write data into sinks (compressed, structured, cleaned, well-organized) Source Sink

9.Why Spark • Architecture • Performance • Ecosystem • API

10.ETL Example in Spark spark.read.csv(“source_path”) Extract .filter(...) .agg(...) Transform .write.mode(“append”) .orc(“sink_path”) Load

11.Handling bad record Text format (csv, json) supports 3 parsing mode • PERMISSIVE - sets other fields to `null` when it meets a corrupted record and puts the malformed string into a new field configured by `spark.sql.columnNameOfCorruptRecord`. • DROPMALFORMED - ignores the whole corrupted records • FAILFAST - throws an exception when it meets corrupted records

12.Handling record corrupted

13.Keep in mind • You have no control on source data (format / scale / schema) • You have no control on hardware/network (fault tolerance)

14. 03 How to do data science in Spark Apache Flink 中⽂学习⽹站: ververica.cn © Apache Flink Community China 严禁商业⽤途

15.Data Science via Spark BI AI

16.Spark SQL Operation • Add or update columns • Drop column • Where | Filter • Group by • Aggregation • Join • Union • UDF

17.Visualization • Zeppelin Notebook (SQL) • PySpark (Python) • Sparkr (R)

18.Zeppelin Notebook

19.PySpark • Matplotlib • Pandas • Bokeh • Seaborn • Plotnine • Holoviews

20.R • R builtin • ggplot2 • googlevis

21.Three types of Machine Learning • Supervised Learning - Labeled data is available - Classification / Regression • Unsupervised Learning - No labeled data is available • Reinforcement Learning - Model is continuously learned and relearn based on the action and effects/rewards based on the actions

22.Machine Learning Basics

23.Spark ML Pipeline

24. 04 Demo via Spark on Zeppelin Apache Flink 中⽂学习⽹站: ververica.cn © Apache Flink Community China 严禁商业⽤途

25.Apache Zeppelin Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.

26.

阿里巴巴开源大数据EMR技术团队成立Apache Spark中国技术社区,定期打造国内Spark线上线下交流活动。请持续关注。 团队群号:HPRX8117 微信公众号:Apache Spark技术交流社区