申请试用
HOT
登录
注册
 
Connecting the Dots: Integrating Apache Spark into Production Pipelines

Connecting the Dots: Integrating Apache Spark into Production Pipelines

Spark开源社区
/
发布于
/
8516
人观看
Have you ever struggled to smoothly integrate Spark into a production workflow? Spark is an excellent tool for processing data on the terabyte scale, but building a system to move from raw data through featurization, modeling, and prediction serving involves interacting with numerous other components. Over the past year and half my team at ShopRunner has built a production Spark workflow for data science from scratch. In this talk you’ll learn about the tools we use, the challenges we encountered, an open-source library we wrote to work through them, and how you can avoid the detours we took along the way. Data science work often begins in an interactive notebook environment, exploring data and testing out different modeling approaches. However, moving towards a production environment means building reproducible workflows, packaging libraries, setting up scheduling and monitoring of jobs, and figuring out ways to serve results to clients in real time. After testing out a variety of tools, we at ShopRunner have settled on a stack including Databricks, Snowflake, Datadog, Jenkins, and S3, ECS, and RDS from the suite of AWS services. These tools each offer unique benefits for their area of focus, but crafting a cohesive pipeline from this range of tools presented a challenge. Come learn how to integrate a Spark workflow into a pipeline that analyzes many terabytes of data, builds machine learning models at scale, and serves predictions to a variety of customer-facing tools. Whether you’re just getting started using Spark in productions systems or you already have Spark running in production and want to smooth the process, this talk will leave you better equipped to find and connect the tools that suit your needs.
0点赞
1收藏
3下载
确认
3秒后跳转登录页面
去登陆