Connecting the Dots: Integrating Apache Spark into Production Pipelines

Have you ever struggled to smoothly integrate Spark into a production workflow? Spark is an excellent tool for processing data on the terabyte scale, but building a system to move from raw data through featurization, modeling, and prediction serving involves interacting with numerous other components. Over the past year and half my team at ShopRunner has built a production Spark workflow for data science from scratch. In this talk you’ll learn about the tools we use, the challenges we encountered, an open-source library we wrote to work through them, and how you can avoid the detours we took along the way. Data science work often begins in an interactive notebook environment, exploring data and testing out different modeling approaches. However, moving towards a production environment means building reproducible workflows, packaging libraries, setting up scheduling and monitoring of jobs, and figuring out ways to serve results to clients in real time. After testing out a variety of tools, we at ShopRunner have settled on a stack including Databricks, Snowflake, Datadog, Jenkins, and S3, ECS, and RDS from the suite of AWS services. These tools each offer unique benefits for their area of focus, but crafting a cohesive pipeline from this range of tools presented a challenge. Come learn how to integrate a Spark workflow into a pipeline that analyzes many terabytes of data, builds machine learning models at scale, and serves predictions to a variety of customer-facing tools. Whether you’re just getting started using Spark in productions systems or you already have Spark running in production and want to smooth the process, this talk will leave you better equipped to find and connect the tools that suit your needs.

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Connecting the Dots Integrating Apache Spark into Production Pipelines Hanna Torrence Data Scientist #UnifiedAnalytics #SparkAISummit

3. Amazon Prime for everyone else: Our 6 million members get free two-day shipping, returns, and deals across a growing network of 140+ retailers. #UnifiedAnalytics #SparkAISummit 3

4.Data Science Projects • Trending products • Product recommendations • Retailer propensity models • Churn modeling • Taxonomy classification • Attribute tagging #UnifiedAnalytics #SparkAISummit 4

5.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 5

6.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 6

7.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 7

8.Important Business Need or #UnifiedAnalytics #SparkAISummit 8

9.Exploratory Phase • Wrangling relevant data • Playing with different models • Continuing conversations to clarify the business problem #UnifiedAnalytics #SparkAISummit 7

10. Exploration #UnifiedAnalytics #SparkAISummit 10

11. Production • Maintainable Code • Scheduled Jobs • APIs #UnifiedAnalytics #SparkAISummit 11

12.Maintainable Code • scripts cleaned up + turned into functions/classes • code review to improve code and share knowledge • unit tests + continuous integration for safer, easier changes #UnifiedAnalytics #SparkAISummit 12

13.Maintainable Code #UnifiedAnalytics #SparkAISummit 13

14.Maintainable Code #UnifiedAnalytics #SparkAISummit 14

15.Maintainable Code #UnifiedAnalytics #SparkAISummit 15

16.Scheduled Jobs • Databricks job scheduler manages clusters • Jenkins manages library updates • We wrote apparate to manage communication between the two apparate #UnifiedAnalytics #SparkAISummit 16

17.Scheduled Jobs Create a Job Update library manual update in UI Build a new egg Upload to Databricks manual inspection manual update in UI Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 17

18.Scheduled Jobs #UnifiedAnalytics #SparkAISummit 18

19.Scheduled Jobs Create a Job Update library apparate Build a new egg Upload to Databricks apparate apparate Find all jobs using that library Update each job #UnifiedAnalytics #SparkAISummit 19

20.#UnifiedAnalytics #SparkAISummit 20

21.Scheduled Jobs #UnifiedAnalytics #SparkAISummit 21

22. Scheduled Jobs GitHub repo: Databricks blog post: Apparate: Managing Libraries in Databricks with CI/CD #UnifiedAnalytics #SparkAISummit 22

23.APIs • Approach to serving results varies by use case • Flask API in a Docker container deployed on a Kubernetes cluster via Spinnaker • Deploy APIs using ShopRunner’s standard production pipeline #UnifiedAnalytics #SparkAISummit 23

24.APIs #UnifiedAnalytics #SparkAISummit 24

25.APIs api: crookshanks/cat_or_dog post: image: “cat_1.jpg” post: image: “dog_1.jpg” post: image: “dog_2.jpg” vector: […] vector: […] vector: […] #UnifiedAnalytics #SparkAISummit 25

26.APIs image: “cat_1.jpg” image: “dog_1.jpg” image: “dog_2.jpg” prediction: “cat” prediction: “dog” prediction: “cat” #UnifiedAnalytics #SparkAISummit 26

27. So we have … • Moved exploratory code into a python package • Branch/Pull Request • code reviewed workflow with Jenkins CI • tested • version-controlled • single source of truth • Scheduled a batch jobs • Updates with apparate • Deployed an API • Deploys via Spinnaker CD #UnifiedAnalytics #SparkAISummit 27

28.Data Science Product business data modelling production need exploration #UnifiedAnalytics #SparkAISummit 28

29.Lessons Learned • Take advantage of existing infrastructure + best-in-class tools • Be aware of friction points in the process • Build out solutions to ease frustrating connections #UnifiedAnalytics #SparkAISummit 29