- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
利用Uber数据科学工作台构建智能应用实验ML
展开查看详情
1 .Building Intelligent Applications & Experimental ML with Uber’s Data Science Workbench Felix Cheung & Atul Gupte Uber Technologies, Inc.
2 .Contents / Data at Uber / Analytics Stack / Spark at Uber / Machine Learning at Uber / Data Science Workbench / Common User Flows & Impact
3 ./ About Atul Engineer turned Product Manager Previously: building FarmVille & the mobile advertising platform @ Zynga Currently: Product Manager for Data Science Workbench & Data Warehouse
4 ./ About Felix Apache Spark PMC & Committer Engineer, Tech Lead & Area Owner of Spark @ Uber
5 ./ Data at Uber
6 .Uber's mission is to bring reliable transportation - to everyone, everywhere
7 .Data informs every decision at the company
8 .Uber’s massive data holds deep, hidden insights. We surface them
9 .6,000+ data scientists, engineers, and operations managers rely on us to support the business
10 .Data is what differentiates Uber but, data at Uber is unlike anywhere else.
11 .What makes Uber unique Business Analytics Consumers Pluggable mobility platform Real-time. Real-world. 6,000 and growing Delicate marketplace with Apps/Machine generated queries Spatio-temporal network effects Bits to atoms ML is Uber’s brain Varied skills: BI to DNN New LOBs spun up in a snap Sheer scale Internal and external
12 .Data Platform Team MISSION Move the world with global data, local insights, and intelligent decisions.
13 ./ Data Analytics Stack
14 .The Data Team Ad-Hoc & Business Visualization Streaming Intelligence Analytics Data Platforms Machine Metadata/ Experimentation/ Learning Knowledge Segmentation Data Services & Analytics Produce Model Disperse Data Infrastructure Workflow Ingest Store Management
15 . Data Infrastructure Experimentation ML BI Apps Ad-hoc Notebooks Observability Streaming Hadoop Warehouse Security Kafka AthenaX Hive Presto Spark Vertica Real-time Vertica Management All-Active Raw Raw Modeled Cluster Schemaless Data Tables Tables Apollo Metadata/Workflow Management SOA
16 ./ Spark At Uber
17 . at Uber Scale 100,000+ ~96% ~98% YARN job resource use (in Spark jobs per day ETL pipelines vcore-seconds) on Spark ● 11,000+ machines across multiple data-centers ● Many 10s-petabytes of data ● Runs on one of the largest production HDFS clusters
18 .Introducing Uber’s Spark Compute Service Simplifies lives of developers & cluster operators Consolidate Infrastructure Investments Improve Developer Experience Standardized Spark builds across Uber YARN, Mesos Bring-your-own-stack (optional) Available across multiple data-centers Advanced monitoring & debugging Serve Multiple Use Cases Proliferate Exploratory, bursty & scheduled batch Better language support (R/Python/Java) Manage full Spark application lifecycle Consumption Interfaces (CLI/REST/GUI)
19 .Session Recap (June 5th) Karthikeyan Natarajan Senior Software Engineer Bo Yang Senior Software Engineer
20 ./ Machine Learning At Uber
21 .The hype ● Ability of a machine to learn without being explicitly programmed ● Identify hidden patterns in the world based on current and historical data and use it to predict the future ● Ability of a machine to get better at a task with data and experience ● Learn from mistakes and improve when given newer/more information
22 . Demand prediction Object detection/tracking Motion prediction Route planning Pick-up clustering Voice recognition Supply modeling Occupancy modeling Route planning, ETA, road modeling, Elasticity estimation, ETA, route Speech generation, Natural language generations, low-latency image classifier optimization, demand prediction image classifiers, drop-off clustering
23 .Typical ML Workflow 1. define 4. measure Launch and Iterate 2. prototype 3. productionize
24 .Problem Definition 1. define ○ Customers + cross-functional team ○ Define objectives and key results UNDERSTAND BUSINESS NEED(S) ○ Data-driven ○ Research ○ Ruthless prioritization DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure 2. prototype 3. productionize
25 .Exploration 1. define UNDERSTAND BUSINESS NEED(S) DEFINE MINIMUM VIABLE PRODUCT (MVP) SQL, Spark 4. measure GET DATA validation computational cost interpretability EVALUATE MODELS 2. prototype DATA PREPARATION data cleansing and pre-processing, 3. productionize R / Python TRAIN MODELS CPU or GPU
26 . Production 1. define UNDERSTAND BUSINESS NEED(S) DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure GET DATA Real-time or MAKE PREDICTIONS batch EVALUATE MODELS 2. prototype DATA PREPARATION DEPLOY MODELS 3. productionize Experimentation and rollout monitoring; TRAIN MODELS PRODUCTIONIZE Retraining strategy MODELS Engineers + Data Scientists, Java or Go, unit tests
27 .Measure 1. define UNDERSTAND BUSINESS NEED(S) Deep-dive analyses inform future product roadmap DEFINE MINIMUM VIABLE PRODUCT (MVP) 4. measure GATHER AND ANALYZE INSIGHTS GET DATA Automatically detect MONITOR degradations PREDICTIONS MAKE PREDICTIONS EVALUATE MODELS 2. prototype DATA PREPARATION DEPLOY MODELS 3. productionize TRAIN MODELS PRODUCTIONIZE MODELS
28 .Our world in 2016 3x growth in Data Science community Py and R Machine Learning was mostly DIY - and on laptops Moving a Py models to production was hard Proliferation of tools, libraries, infra None of which could scale to 1000s Collaboration and Sharing non-existent Security / Compliance / DC redundancy
29 .Data Science Workbench eng.uber.com/dsw