利用Uber数据科学工作台构建智能应用实验ML

在本文中,我们将探讨Uber如何通过Uber的数据科学工作台(DSW)实现机器学习模型和优化算法的快速实验。DSW涵盖了数据科学家工作流程的一系列阶段,包括数据探索、特征工程、机器学习模型训练、测试和生产部署。DSW通过按需资源分配为多种语言提供交互式笔记本,并通过社区特性共享它们的作品。
展开查看详情

1.Building Intelligent Applications & Experimental ML with Uber’s Data Science Workbench Felix Cheung & Atul Gupte Uber Technologies, Inc.

2.Contents / Data at Uber / Analytics Stack / Spark at Uber / Machine Learning at Uber / Data Science Workbench / Common User Flows & Impact

3./ About Atul Engineer turned Product Manager Previously: building FarmVille & the mobile advertising platform @ Zynga Currently: Product Manager for Data Science Workbench & Data Warehouse

4./ About Felix Apache Spark PMC & Committer Engineer, Tech Lead & Area Owner of Spark @ Uber

5./ Data at Uber

6.Uber's mission is to bring reliable transportation - to everyone, everywhere

7.Data informs every decision at the company

8.Uber’s massive data holds deep, hidden insights. We surface them

9.6,000+ data scientists, engineers, and operations managers rely on us to support the business

10.Data is what differentiates Uber but, data at Uber is unlike anywhere else.

11.What makes Uber unique Business Analytics Consumers Pluggable mobility platform Real-time. Real-world. 6,000 and growing Delicate marketplace with Apps/Machine generated queries Spatio-temporal network effects Bits to atoms ML is Uber’s brain Varied skills: BI to DNN New LOBs spun up in a snap Sheer scale Internal and external

12.Data Platform Team MISSION Move the world with global data, local insights, and intelligent decisions.

13./ Data Analytics Stack

14.The Data Team Ad-Hoc & Business Visualization Streaming Intelligence Analytics Data Platforms Machine Metadata/ Experimentation/ Learning Knowledge Segmentation Data Services & Analytics Produce Model Disperse Data Infrastructure Workflow Ingest Store Management

15. Data Infrastructure Experimentation ML BI Apps Ad-hoc Notebooks Observability Streaming Hadoop Warehouse Security Kafka AthenaX Hive Presto Spark Vertica Real-time Vertica Management All-Active Raw Raw Modeled Cluster Schemaless Data Tables Tables Apollo Metadata/Workflow Management SOA

16./ Spark At Uber

17. at Uber Scale 100,000+ ~96% ~98% YARN job resource use (in Spark jobs per day ETL pipelines vcore-seconds) on Spark ● 11,000+ machines across multiple data-centers ● Many 10s-petabytes of data ● Runs on one of the largest production HDFS clusters

18.Introducing Uber’s Spark Compute Service Simplifies lives of developers & cluster operators Consolidate Infrastructure Investments Improve Developer Experience Standardized Spark builds across Uber YARN, Mesos Bring-your-own-stack (optional) Available across multiple data-centers Advanced monitoring & debugging Serve Multiple Use Cases Proliferate Exploratory, bursty & scheduled batch Better language support (R/Python/Java) Manage full Spark application lifecycle Consumption Interfaces (CLI/REST/GUI)

19.Session Recap (June 5th) Karthikeyan Natarajan Senior Software Engineer Bo Yang Senior Software Engineer

20./ Machine Learning At Uber

21.The hype ● Ability of a machine to learn without being explicitly programmed ● Identify hidden patterns in the world based on current and historical data and use it to predict the future ● Ability of a machine to get better at a task with data and experience ● Learn from mistakes and improve when given newer/more information

22. Demand prediction Object detection/tracking Motion prediction Route planning Pick-up clustering Voice recognition Supply modeling Occupancy modeling Route planning, ETA, road modeling, Elasticity estimation, ETA, route Speech generation, Natural language generations, low-latency image classifier optimization, demand prediction image classifiers, drop-off clustering

23.Typical ML Workflow 1. define 4. measure Launch and Iterate 2. prototype 3. productionize

24.Problem Definition 1. define ○ Customers + cross-functional team ○ Define objectives and key results UNDERSTAND BUSINESS NEED(S) ○ Data-driven ○ Research ○ Ruthless prioritization DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure 2. prototype 3. productionize

25.Exploration 1. define UNDERSTAND BUSINESS NEED(S) DEFINE MINIMUM VIABLE PRODUCT (MVP) SQL, Spark 4. measure GET DATA validation computational cost interpretability EVALUATE MODELS 2. prototype DATA PREPARATION data cleansing and pre-processing, 3. productionize R / Python TRAIN MODELS CPU or GPU

26. Production 1. define UNDERSTAND BUSINESS NEED(S) DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure GET DATA Real-time or MAKE PREDICTIONS batch EVALUATE MODELS 2. prototype DATA PREPARATION DEPLOY MODELS 3. productionize Experimentation and rollout monitoring; TRAIN MODELS PRODUCTIONIZE Retraining strategy MODELS Engineers + Data Scientists, Java or Go, unit tests

27.Measure 1. define UNDERSTAND BUSINESS NEED(S) Deep-dive analyses inform future product roadmap DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure GATHER AND ANALYZE INSIGHTS GET DATA Automatically detect MONITOR degradations PREDICTIONS MAKE PREDICTIONS EVALUATE MODELS 2. prototype DATA PREPARATION DEPLOY MODELS 3. productionize TRAIN MODELS PRODUCTIONIZE MODELS

28.Our world in 2016 3x growth in Data Science community Py and R Machine Learning was mostly DIY - and on laptops Moving a Py models to production was hard Proliferation of tools, libraries, infra None of which could scale to 1000s Collaboration and Sharing non-existent Security / Compliance / DC redundancy

29.Data Science Workbench eng.uber.com/dsw