如何用Databricks和Upwork

pwork拥有劳动史上最大的工作岗位和求职者的闭环在线数据集(>10M个人资料,>100M个工作岗位,工作建议和招聘决定,>10B条信息,交易和反馈数据)。除了纯粹的数量之外,我们的数据在上下文上也是非常丰富的。我们有客户和承包商的数据来完成整个工作漏斗——从找到工作到完成工作。
展开查看详情

1.How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Databricks, Airflow and Upwork Data Science Infrastructure Team, Thanh Tran Upwork #AssignedHashtagGoesHere

2.Introduction 2

3. Nikolay Melnik Roman Tkachuk Andrei Demus Anna Lysak Igor Korsunov Lead ML Engineer Senior Data Engineer Data/ML Engineer Data/ML Engineer ML Engineer Ukraine Ukraine Ukraine Ukraine Russia Dimitris Manikis Yongtao Ma Senior Data Engineer Greece Senior ML Engineer Germany Giannis Koutsoubos WE Artem Moskvin Lead Backend Engineer Data/ML Engineer Greece I Germany OUR 3

4.With Upwork, our new hires AND I are better off! Me My Team ● Highest-skilled experts for ● Work on cutting-edge projects QUALITY the job ● Happy with competitive COST/EARNING ● Competitive/lower rate compensation + flexibility in AGILITY ● Mix of long-term and location and work hours project-based staff ● Work only when they want work 4

5.We believe significant welfare improvements can be achieved through data science driven optimization of the online labor marketplace. 5

6. Profiles (~10M) We have the biggest Job Posts (~10M) closed-loop online Proposals (~100M) Messages (~100M) dataset of jobs and job Hiring decisions (~10M) Contract progress (~1B) seekers in labor Feedback (~10M) Web site activity (~10B) history. Money transactions (~100M) Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M) 6

7.What do we need to ship data science products? 7

8.We need to support an agile data science workflow to provide quick and validated improvements! ● Data Science analytics ○ Complete and cleansed data, single ground-truth ○ Tools for computing metrics, continuous validation ● Data Science model development ○ Business objects and UI event data ○ Scaling complex data processing and feature computation ○ Discoverability of data and features ○ Batch + live data mismatches ○ Managing, monitoring and versioning of models and experiments ○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline) ○ Flexibility to accommodate variety of ML frameworks ● Data Science model productionalization ○ Minimize differences between trained model and production code ○ Code modularized, tested, integrated into CI/CD workflow ○ Standardized model serving that is scalable, available, high throughput, low latency... 8

9.Upwork Data & ML Platform 9

10.10

11.● Kinesis and Structure Spark Streaming for high throughput live event data processing ● Moving away from traditional DWH solution to distributed Spark-based batch data processing to avoid performance issues and workload limitations ● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between flexibility and standardized model engineering ● Data processing, feature computation and pipeline retraining jobs scheduled and orchestrated via Airflow ● Experiment management and model versioning integral part of CI/CD workflow ● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow: standalone model testing + live regression test helps to identify batch and live data mismatches ● Spark-based pipeline developed by data scientists directly used for model scoring in production environment ● Microservices for streamlined model serving, scalability, availability... ● Extensive use of Databricks notebook-based documentation of model, experiments and feature engineering code ● Graphite, ELK and Pagerduty for logging, monitoring and alerts 11

12.Batch Data & ML Environment 12

13.13

14.14

15.Live Data & ML Pipeline 15

16.16

17.17

18.CI/CD Workflow 18

19.19

20.20

21.Pitfalls and Lessons Learned 21

22.• Microservices can lead to data fragmentation and high downstream processing overhead • Structured streaming latency when number of Kinesis consumers is high • Stream-to-stream/stream-to-batch join not suitable for real-time use cases yet • Differences between live and batch data • Differences between trained vs. deployed ML pipeline can be minimized • CI/CD needs to be customized to support data science workflow and artefacts • Databricks notebooks very convenient for collaboration, documentation, code sharing and reuse, results dissemination 22

23.Interested in search & recommendations, multi-sided matching or online labor marketplace optimization? We are hiring! Interested in doing work only when you want work? Join Upwork as contractors! 23