- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
如何用Databricks和Upwork
展开查看详情
1 .How to Rebuild Data and ML Platform using Kinesis, S3, Spark, MLlib, Databricks, Airflow and Upwork Data Science Infrastructure Team, Thanh Tran Upwork #AssignedHashtagGoesHere
2 .Introduction 2
3 . Nikolay Melnik Roman Tkachuk Andrei Demus Anna Lysak Igor Korsunov Lead ML Engineer Senior Data Engineer Data/ML Engineer Data/ML Engineer ML Engineer Ukraine Ukraine Ukraine Ukraine Russia Dimitris Manikis Yongtao Ma Senior Data Engineer Greece Senior ML Engineer Germany Giannis Koutsoubos WE Artem Moskvin Lead Backend Engineer Data/ML Engineer Greece I Germany OUR 3
4 .With Upwork, our new hires AND I are better off! Me My Team ● Highest-skilled experts for ● Work on cutting-edge projects QUALITY the job ● Happy with competitive COST/EARNING ● Competitive/lower rate compensation + flexibility in AGILITY ● Mix of long-term and location and work hours project-based staff ● Work only when they want work 4
5 .We believe significant welfare improvements can be achieved through data science driven optimization of the online labor marketplace. 5
6 . Profiles (~10M) We have the biggest Job Posts (~10M) closed-loop online Proposals (~100M) Messages (~100M) dataset of jobs and job Hiring decisions (~10M) Contract progress (~1B) seekers in labor Feedback (~10M) Web site activity (~10B) history. Money transactions (~100M) Contract progress (~1B) Feedback (~10M) Web site activity (~10B) Money transactions (~100M) 6
7 .What do we need to ship data science products? 7
8 .We need to support an agile data science workflow to provide quick and validated improvements! ● Data Science analytics ○ Complete and cleansed data, single ground-truth ○ Tools for computing metrics, continuous validation ● Data Science model development ○ Business objects and UI event data ○ Scaling complex data processing and feature computation ○ Discoverability of data and features ○ Batch + live data mismatches ○ Managing, monitoring and versioning of models and experiments ○ Knowledge sharing and code reuse (experiments, model, feature computation pipeline) ○ Flexibility to accommodate variety of ML frameworks ● Data Science model productionalization ○ Minimize differences between trained model and production code ○ Code modularized, tested, integrated into CI/CD workflow ○ Standardized model serving that is scalable, available, high throughput, low latency... 8
9 .Upwork Data & ML Platform 9
10 .10
11 .● Kinesis and Structure Spark Streaming for high throughput live event data processing ● Moving away from traditional DWH solution to distributed Spark-based batch data processing to avoid performance issues and workload limitations ● Spark MLlib + Tensorflow as core ML libraries to balance the tradeoff between flexibility and standardized model engineering ● Data processing, feature computation and pipeline retraining jobs scheduled and orchestrated via Airflow ● Experiment management and model versioning integral part of CI/CD workflow ● Adopt engineer CI/CD workflow to data science using Jenkins, Databricks and Airflow: standalone model testing + live regression test helps to identify batch and live data mismatches ● Spark-based pipeline developed by data scientists directly used for model scoring in production environment ● Microservices for streamlined model serving, scalability, availability... ● Extensive use of Databricks notebook-based documentation of model, experiments and feature engineering code ● Graphite, ELK and Pagerduty for logging, monitoring and alerts 11
12 .Batch Data & ML Environment 12
13 .13
14 .14
15 .Live Data & ML Pipeline 15
16 .16
17 .17
18 .CI/CD Workflow 18
19 .19
20 .20
21 .Pitfalls and Lessons Learned 21
22 .• Microservices can lead to data fragmentation and high downstream processing overhead • Structured streaming latency when number of Kinesis consumers is high • Stream-to-stream/stream-to-batch join not suitable for real-time use cases yet • Differences between live and batch data • Differences between trained vs. deployed ML pipeline can be minimized • CI/CD needs to be customized to support data science workflow and artefacts • Databricks notebooks very convenient for collaboration, documentation, code sharing and reuse, results dissemination 22
23 .Interested in search & recommendations, multi-sided matching or online labor marketplace optimization? We are hiring! Interested in doing work only when you want work? Join Upwork as contractors! 23