- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Connecting the Dots: Integrating Apache Spark into Production Pipelines
展开查看详情
1 .WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2 .Connecting the Dots Integrating Apache Spark into Production Pipelines Hanna Torrence Data Scientist #UnifiedAnalytics #SparkAISummit
3 . Amazon Prime for everyone else: Our 6 million members get free two-day shipping, returns, and deals across a growing network of 140+ retailers. #UnifiedAnalytics #SparkAISummit 3
4 .Data Science Projects • Trending products • Product recommendations • Retailer propensity models • Churn modeling • Taxonomy classification • Attribute tagging #UnifiedAnalytics #SparkAISummit 4
5 .Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 5
6 .Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 6
7 .Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 7
8 .Important Business Need or #UnifiedAnalytics #SparkAISummit 8
9 .Exploratory Phase • Wrangling relevant data • Playing with different models • Continuing conversations to clarify the business problem #UnifiedAnalytics #SparkAISummit 7
10 . Exploration #UnifiedAnalytics #SparkAISummit 10
11 . Production • Maintainable Code • Scheduled Jobs • APIs #UnifiedAnalytics #SparkAISummit 11
12 .Maintainable Code • scripts cleaned up + turned into functions/classes • code review to improve code and share knowledge • unit tests + continuous integration for safer, easier changes #UnifiedAnalytics #SparkAISummit 12
13 .Maintainable Code #UnifiedAnalytics #SparkAISummit 13
14 .Maintainable Code #UnifiedAnalytics #SparkAISummit 14
15 .Maintainable Code #UnifiedAnalytics #SparkAISummit 15
16 .Scheduled Jobs • Databricks job scheduler manages clusters • Jenkins manages library updates • We wrote apparate to manage communication between the two apparate #UnifiedAnalytics #SparkAISummit 16
17 .Scheduled Jobs Create a Job Update library manual update in UI Build a new egg Upload to Databricks manual inspection manual update in UI Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 17
18 .Scheduled Jobs #UnifiedAnalytics #SparkAISummit 18
19 .Scheduled Jobs Create a Job Update library apparate Build a new egg Upload to Databricks apparate apparate Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 19
20 .#UnifiedAnalytics #SparkAISummit 20
21 .Scheduled Jobs #UnifiedAnalytics #SparkAISummit 21
22 . Scheduled Jobs GitHub repo: https://github.com/ShopRunner/apparate Databricks blog post: Apparate: Managing Libraries in Databricks with CI/CD #UnifiedAnalytics #SparkAISummit 22
23 .APIs • Approach to serving results varies by use case • Flask API in a Docker container deployed on a Kubernetes cluster via Spinnaker • Deploy APIs using ShopRunner’s standard production pipeline #UnifiedAnalytics #SparkAISummit 23
24 .APIs #UnifiedAnalytics #SparkAISummit 24
25 .APIs api: crookshanks/cat_or_dog post: image: “cat_1.jpg” post: image: “dog_1.jpg” post: image: “dog_2.jpg” vector: […] vector: […] vector: […] #UnifiedAnalytics #SparkAISummit 25
26 .APIs image: “cat_1.jpg” image: “dog_1.jpg” image: “dog_2.jpg” prediction: “cat” prediction: “dog” prediction: “cat” #UnifiedAnalytics #SparkAISummit 26
27 . So we have … • Moved exploratory code into a python package • Branch/Pull Request • code reviewed workflow with Jenkins CI • tested • version-controlled • single source of truth • Scheduled a batch jobs • Updates with apparate • Deployed an API • Deploys via Spinnaker CD #UnifiedAnalytics #SparkAISummit 27
28 .Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 28
29 .Lessons Learned • Take advantage of existing infrastructure + best-in-class tools • Be aware of friction points in the process • Build out solutions to ease frustrating connections #UnifiedAnalytics #SparkAISummit 29