第四讲：Spark for ETL & Data Science -JeffZhang

播放视频

视频文档

第四讲：Spark for ETL & Data Science -JeffZhang

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
视频嵌入链接文档嵌入链接
<iframe src="https://www.slidestalk.com/ray/SparkforETLDataScienceJeffZhang30233?embed&video" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

Apache Spark中国技术交流社区

发布于

4年前

1101

人观看

#信息技术

第 4 讲：Spark for ETL and Data Science
主要介绍如何用Spark来做ETL以及交互式数据分析的最佳实践，
主讲嘉宾章剑锋，阿里巴巴高级技术专家， Apache Tez、Livy 、Zeppelin PMC ，Apache Pig Committer

加入钉钉群了解更多技术信息

展开查看详情

1 .Spark for ETL & Data Science 章剑锋 · 阿⾥巴巴 /⾼级技术专家

2 .2008 Who Am I 2011 2013 2014 2018

3 . 01 What is ETL & Data Science 02 CONTENT How to do ETL in Spark ⽬录 >> 03 How to do Data Science in Spark 04 Demo via Spark on Zeppelin

4 . 01 What is ETL & Data Science Apache Flink 中⽂学习⽹站： ververica.cn © Apache Flink Community China 严禁商业⽤途

5 .Data Pipeline ETL Data Science

6 .线下交流微信公众号钉钉群 We are hiring （校招，社招） jeffzhang.zjf@alibaba-inc.com

7 . 02 How to do ETL in Spark Apache Flink 中⽂学习⽹站： ververica.cn © Apache Flink Community China 严禁商业⽤途

8 .What is ETL • Extract - Read raw data from single/multiple sources (no schema, uncompressed, dirty) • Transform - Transform raw data (Filtering/Aggregation/Normalization/Join) • Load - Write data into sinks (compressed, structured, cleaned, well-organized) Source Sink

9 .Why Spark • Architecture • Performance • Ecosystem • API

10 .ETL Example in Spark spark.read.csv(“source_path”) Extract .filter(...) .agg(...) Transform .write.mode(“append”) .orc(“sink_path”) Load

11 .Handling bad record Text format (csv, json) supports 3 parsing mode • PERMISSIVE - sets other fields to `null` when it meets a corrupted record and puts the malformed string into a new field configured by `spark.sql.columnNameOfCorruptRecord`. • DROPMALFORMED - ignores the whole corrupted records • FAILFAST - throws an exception when it meets corrupted records

12 .Handling record corrupted

13 .Keep in mind • You have no control on source data (format / scale / schema) • You have no control on hardware/network (fault tolerance)

15 .Data Science via Spark BI AI

16 .Spark SQL Operation • Add or update columns • Drop column • Where | Filter • Group by • Aggregation • Join • Union • UDF

17 .Visualization • Zeppelin Notebook (SQL) • PySpark (Python) • Sparkr (R)

18 .Zeppelin Notebook

19 .PySpark • Matplotlib • Pandas • Bokeh • Seaborn • Plotnine • Holoviews

20 .R • R builtin • ggplot2 • googlevis

21 .Three types of Machine Learning • Supervised Learning - Labeled data is available - Classification / Regression • Unsupervised Learning - No labeled data is available • Reinforcement Learning - Model is continuously learned and relearn based on the action and effects/rewards based on the actions

22 .Machine Learning Basics

23 .Spark ML Pipeline

25 .Apache Zeppelin Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python and more.

26 .

1点赞

1收藏