- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
1 MLOps TFX pipelines 及大规模机器学习的应用
展开查看详情
1 .MLOps TFX pipelines ⼤规模机器学习应⽤ 江骏 Machine Learning GDE (Google Developer Expert) 蚂蚁集团 · ⾼级技术专家
2 . MLOps • An ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). https://www.youtube.com/watch?v=6gdrwFMaEZ0&ab_channel=GoogleCloudTech
3 . MLOps Goals • Faster experimentation and model development • Faster deployment of updated models into production • Quality Assurance • Reusability and Reproducibility
4 . Needs for MLOps • Version the source data and its attributes • Experimentation: build the model • Feature selection/generation • algorithm selection • hyperparameter tuning • etc. • Learn from your mistakes: track metrics, source control the code, checkpoint steps in the ML lifecycle, auto validation and deployment • Model Drift, Data Drift. (Model-centric view, Data-centric view) • Monitoring model performance
5 . Needs for MLOps • Waze: the world's largest community-based traffic and navigation app • Predicting ETA • Matching Riders & Drivers (Carpool) • Serving The Right Ads • Challenges • Multiple ML frameworks - you name it (sklearn, xgboost, TensorFlow, fbprophet, Java PMML, hand made etc.) • ML & Ops disconnect - models & feature engineering embedded in (Java) backend servers by engineers with limited monitoring and validation capabilities • Semi-manual operations for training, validation and deployment • A hideously long development cycle from idea to production
6 . Needs for MLOps • Airbus & International Space Station (ISS) • Ensure the health of the crew as well as hundreds of systems onboard the Columbus module • Challenges • keep track of many telemetry data streams, which are constantly beamed to earth. • 10 years worth of historical data • 5 trillion data points, (10y * 365d * 24h * 60min * 60s * 17K params) https://blog.tensorflow.org/2020/04/how-airbus-detects-anomalies-iss-telemetry-data-tfx.html
7 . Needs for MLOps • Large Models for NLP • trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops • ZeRO can train large models of up to 13B parameters • Bert-large (0.3B) • GPT-2 (1.5B) • Megatron GPT 8.3B • T5 11B
8 . TensorFlow Extended (TFX) Goals • Building one machine learning platform for many different learning tasks • Continuous training and serving • Human-in-the-loop & Easy-to-use • Production-level reliability and scalability
9 .
10 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Data Analysis Continuous Features Discrete Features • quantiles • top-K values by frequency • equi-width histograms • mean • standard deviation • ...... On large training data, some of these statistics become difficult to compute exactly, and the component resorts to distributed streaming algorithms that give approximate results.
11 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Data Analysis - Schema • Features present in the data. Schema feature { feature { • The expected type of each feature. name: "fare" name: "payment_type" value_count { value_count { min: 1 min: 1 max: 1 • The expected presence of each feature, in terms of a min- } max: 1 } imum count and fraction of examples that must contain type: FLOAT presence { type: BYTES domain: "payment_type" the feature. min_fraction: 1.0 min_count: 1 presence { min_fraction: 1.0 } min_count: 1 } • The expected valency of the feature in each example, feature { } } name: "trip_start_hour" i.e., minimum and maximum number of values. value_count { string_domain { name: "payment_type" min: 1 value: "Cash" max: 1 value: "Credit Card" • The expected domain of a feature, i.e., the small universe } value: "Dispute" type: INT of values for a string feature, or range for an integer presence { value: "No Charge" value: "Pcard" min_fraction: 1.0 feature. min_count: 1 value: "Unknown" value: "Prcard" } } }
12 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Data Validation Key design principles • The user should understand at a glance which anomalies are detected and their coverage over the data. • Each anomaly should have a simple description that helps the user understand how to debug and fix the data. • In some cases the anomalies correspond to a natural evolution of the data, and the appropriate action is to change the schema (rather than fix the data). • We want the user to treat data errors with the same rigor and care that they deal with bugs in code. • These principles have affected both the logic to detect anomalies and the presentation of anomalies in the UI component of TFX.
13 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Data Validation - Anomalies Anomalies anomaly_info { key: "company" value { description: "Examples contain values missing from the schema: 2092 - 61288 Sbeih company (<1%), 2192 - ..." severity: ERROR short_description: "Unexpected string values" reason { type: ENUM_TYPE_UNEXPECTED_STRING_VALUES short_description: "Unexpected string values" description: "Examples contain values missing from the schema: 2092 - 61288 Sbeih com<1%), ..." } } } anomaly_info { key: "payment_type" value { description: "Examples contain values missing from the schema: Prcard (<1%). " severity: ERROR short_description: "Unexpected string values" reason { type: ENUM_TYPE_UNEXPECTED_STRING_VALUES short_description: "Unexpected string values" description: "Examples contain values missing from the schema: Prcard (<1%). " } } }
14 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Data Transformation TFX exports any data transformations as part of the trained model. Feature Store: e.g. Feast
15 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Training • Warm-Starting • High-Level Model Specification API • Kubeflow (tf-operator, pytorch-operator, mpi-operator, …)
16 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Model Evaluation & Model Validation human-facing metrics of model quality machine-facing judgment of model goodness Having a reusable component that automatically evaluates and validates models to ensure that they are “good” before serving them to users can help prevent unexpected degradations in the user experience.
17 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Model Evaluation Since it is costly and time-consuming to run A/B experiments on live traffic, models are evaluated offline on held-out data to determine if they are promising enough to start an online A/B experiment.
18 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Model Validation Any new model failing any of these checks is not pushed to serving, and product teams are alerted.
19 . Data Model Data Analysis Data Validation Training Model Evaluation Serving Transformation Validation Serving Low latency and high efficiency. a specialized protocol buffer parser. ...
20 .TFX TensorFlow Extended Data Model Model Data Ingestion Data Analysis Data Validation Training Serving Transformation Evaluation Validation metrics data stats (training) stats (eval) transform steps schema TFRecords model TensorBoard plots serving anomalies saved model eval result copyright © 江骏 (ohmystack)
21 .TFX Paper https://dl.acm.org/citation.cfm?id=3098021
22 .Open Source TFX https://www.tensorflow.org/tfx/ Schema & Anomalies: https://github.com/tensorflow/metadata Data Validation: https://github.com/tensorflow/data-validation Data Transformation: https://github.com/tensorflow/transform Model Validation: https://github.com/tensorflow/model-analysis Serving: https://github.com/tensorflow/serving
23 .I’m 江骏 / ohmystack https://www.slideshare.net/jiangjun1990/presentations https://github.com/ohmystack ⭐ http://ohmystack.com/ https://www.linkedin.com/in/ohmystack/