Connecting the Dots: Integrating Apache Spark into Production Pipelines
展开查看详情
1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
2.Connecting the Dots Integrating Apache Spark into Production Pipelines Hanna Torrence Data Scientist #UnifiedAnalytics #SparkAISummit
3. Amazon Prime for everyone else: Our 6 million members get free two-day shipping, returns, and deals across a growing network of 140+ retailers. #UnifiedAnalytics #SparkAISummit 3
4.Data Science Projects • Trending products • Product recommendations • Retailer propensity models • Churn modeling • Taxonomy classification • Attribute tagging #UnifiedAnalytics #SparkAISummit 4
5.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 5
6.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 6
7.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 7
8.Important Business Need or #UnifiedAnalytics #SparkAISummit 8
9.Exploratory Phase • Wrangling relevant data • Playing with different models • Continuing conversations to clarify the business problem #UnifiedAnalytics #SparkAISummit 7
10. Exploration #UnifiedAnalytics #SparkAISummit 10
11. Production • Maintainable Code • Scheduled Jobs • APIs #UnifiedAnalytics #SparkAISummit 11
12.Maintainable Code • scripts cleaned up + turned into functions/classes • code review to improve code and share knowledge • unit tests + continuous integration for safer, easier changes #UnifiedAnalytics #SparkAISummit 12
13.Maintainable Code #UnifiedAnalytics #SparkAISummit 13
14.Maintainable Code #UnifiedAnalytics #SparkAISummit 14
15.Maintainable Code #UnifiedAnalytics #SparkAISummit 15
16.Scheduled Jobs • Databricks job scheduler manages clusters • Jenkins manages library updates • We wrote apparate to manage communication between the two apparate #UnifiedAnalytics #SparkAISummit 16
17.Scheduled Jobs Create a Job Update library manual update in UI Build a new egg Upload to Databricks manual inspection manual update in UI Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 17
18.Scheduled Jobs #UnifiedAnalytics #SparkAISummit 18
19.Scheduled Jobs Create a Job Update library apparate Build a new egg Upload to Databricks apparate apparate Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 19
20.#UnifiedAnalytics #SparkAISummit 20
21.Scheduled Jobs #UnifiedAnalytics #SparkAISummit 21
22. Scheduled Jobs GitHub repo: https://github.com/ShopRunner/apparate Databricks blog post: Apparate: Managing Libraries in Databricks with CI/CD #UnifiedAnalytics #SparkAISummit 22
23.APIs • Approach to serving results varies by use case • Flask API in a Docker container deployed on a Kubernetes cluster via Spinnaker • Deploy APIs using ShopRunner’s standard production pipeline #UnifiedAnalytics #SparkAISummit 23
24.APIs #UnifiedAnalytics #SparkAISummit 24
25.APIs api: crookshanks/cat_or_dog post: image: “cat_1.jpg” post: image: “dog_1.jpg” post: image: “dog_2.jpg” vector: […] vector: […] vector: […] #UnifiedAnalytics #SparkAISummit 25
26.APIs image: “cat_1.jpg” image: “dog_1.jpg” image: “dog_2.jpg” prediction: “cat” prediction: “dog” prediction: “cat” #UnifiedAnalytics #SparkAISummit 26
27. So we have … • Moved exploratory code into a python package • Branch/Pull Request • code reviewed workflow with Jenkins CI • tested • version-controlled • single source of truth • Scheduled a batch jobs • Updates with apparate • Deployed an API • Deploys via Spinnaker CD #UnifiedAnalytics #SparkAISummit 27
28.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 28
29.Lessons Learned • Take advantage of existing infrastructure + best-in-class tools • Be aware of friction points in the process • Build out solutions to ease frustrating connections #UnifiedAnalytics #SparkAISummit 29