如何用Databricks和Upwork

下载 0

快召唤伙伴们来围观吧
微博 QQ QQ空间 贴吧
文档嵌入链接
<iframe src="https://www.slidestalk.com/u180/How_to_Rebuild_an_End_to_End_MLPipeline_with_Databricks_and_Upwork_?embed" frame border="0" width="640" height="360" scrolling="no" allowfullscreen="true">复制
微信扫一扫分享
已成功复制到剪贴板

poppy

发布于

6年前

2099

人观看

#信息技术

pwork拥有劳动史上最大的工作岗位和求职者的闭环在线数据集（>10M个人资料，>100M个工作岗位，工作建议和招聘决定，>10B条信息，交易和反馈数据）。除了纯粹的数量之外，我们的数据在上下文上也是非常丰富的。我们有客户和承包商的数据来完成整个工作漏斗——从找到工作到完成工作。

展开查看详情

1 .How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Databricks, Airflow and Upwork Data Science Infrastructure Team, Thanh Tran Upwork #AssignedHashtagGoesHere

2 .Introduction 2

3 . Nikolay Melnik Roman Tkachuk Andrei Demus Anna Lysak Igor Korsunov Lead ML Engineer Senior Data Engineer Data/ML Engineer Data/ML Engineer ML Engineer Ukraine Ukraine Ukraine Ukraine Russia Dimitris Manikis Yongtao Ma Senior Data Engineer Greece Senior ML Engineer Germany Giannis Koutsoubos WE Artem Moskvin Lead Backend Engineer Data/ML Engineer Greece I Germany OUR 3

4 .With Upwork, our new hires AND I are better off! Me My Team ● Highest-skilled experts for ● Work on cutting-edge projects QUALITY the job ● Happy with competitive COST/EARNING ● Competitive/lower rate compensation + flexibility in AGILITY ● Mix of long-term and location and work hours project-based staff ● Work only when they want work 4

5 .We believe significant welfare improvements can be achieved through data science driven optimization of the online labor marketplace. 5

6 . Profiles (~10M) We have the biggest Job Posts (~10M) closed-loop online Proposals (~100M) Messages (~100M) dataset of jobs and job Hiring decisions (~10M) Contract progress (~1B) seekers in labor Feedback (~10M) Web site activity (~10B) history. Money transactions (~100M) Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M) 6

7 .What do we need to ship data science products? 7

8 .We need to support an agile data science workflow to provide quick and validated improvements! ● Data Science analytics ○ Complete and cleansed data, single ground-truth ○ Tools for computing metrics, continuous validation ● Data Science model development ○ Business objects and UI event data ○ Scaling complex data processing and feature computation ○ Discoverability of data and features ○ Batch + live data mismatches ○ Managing, monitoring and versioning of models and experiments ○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline) ○ Flexibility to accommodate variety of ML frameworks ● Data Science model productionalization ○ Minimize differences between trained model and production code ○ Code modularized, tested, integrated into CI/CD workflow ○ Standardized model serving that is scalable, available, high throughput, low latency... 8

9 .Upwork Data & ML Platform 9

10 .10

11 .● Kinesis and Structure Spark Streaming for high throughput live event data processing ● Moving away from traditional DWH solution to distributed Spark-based batch data processing to avoid performance issues and workload limitations ● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between flexibility and standardized model engineering ● Data processing, feature computation and pipeline retraining jobs scheduled and orchestrated via Airflow ● Experiment management and model versioning integral part of CI/CD workflow ● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow: standalone model testing + live regression test helps to identify batch and live data mismatches ● Spark-based pipeline developed by data scientists directly used for model scoring in production environment ● Microservices for streamlined model serving, scalability, availability... ● Extensive use of Databricks notebook-based documentation of model, experiments and feature engineering code ● Graphite, ELK and Pagerduty for logging, monitoring and alerts 11

12 .Batch Data & ML Environment 12

13 .13

14 .14

15 .Live Data & ML Pipeline 15

16 .16

17 .17

18 .CI/CD Workflow 18

19 .19

20 .20

21 .Pitfalls and Lessons Learned 21

22 .• Microservices can lead to data fragmentation and high downstream processing overhead • Structured streaming latency when number of Kinesis consumers is high • Stream-to-stream/stream-to-batch join not suitable for real-time use cases yet • Differences between live and batch data • Differences between trained vs. deployed ML pipeline can be minimized • CI/CD needs to be customized to support data science workflow and artefacts • Databricks notebooks very convenient for collaboration, documentation, code sharing and reuse, results dissemination 22

23 .Interested in search & recommendations, multi-sided matching or online labor marketplace optimization? We are hiring! Interested in doing work only when you want work? Join Upwork as contractors! 23

3点赞

1收藏

0下载