张鑫-云原生智能助力企业数字化与智能化转型

注脚

展开查看详情

1.

2.第七届 全球软件案例研究峰会 AI/ AIOps DevOps AI 2018 11 30 -12 3 | 100+

3. 上海精品公开课 管理3.0认证课程 时间:12月22-23日 | 地点:上海 | 讲师:林伟丹 大数据及AI挖掘技术 时间:12月22-23日 | 地点:上海 | 讲师:风清扬 高可用架构与设计 时间:01月12-13日 | 地点:上海 | 讲师:沈老师 K8S与service mesh 时间:01月12-13日 | 地点:上海 | 讲师:Jim 备注:扫码查看课程详情,两人以上报名有优惠,详情咨询:15802217295

4.Market Competition

5. AI and ML spending: from $12 billion in 2017 to $57.6 billion in 2021 Deloitte Near 0.8 million GPUs in datacenters; # ML Global activities is doubling each year 61% interviewees plan to use ML in 2019; 58% already have ML in use 70% will adopt AI in 2030 GPU market in China is booming: 230% increase, to 3.5 billion RMB in 2017

6.“Can I tweak the bought models and APIs ” “I don’t want to give out my sensitive data and business ideas”

7.“Which framework, models, hyperparameters to use?”

8.“How to speed up my training using really deep network against really huge amount of data?”

9.“How to allocate our 400 GPUs to 20 model development?”

10.From Enterprise Almanac 2018 by Work-Bench

11.

12.“Ge tech wor is a chal And

13. Machine Data Collection Resource Data Verification Management Serving Feature ML Analysis Tools Infrastructure Code Extraction Configuration Monitoring Source: Sculley et al.: Hidden Technical Debt in Machine Learning Systems

14. ,: , : : ? : , ! :? , :? : :? : :: :

15.

16.Building a model

17. Data Data ingestion Data analysis Data validation Data splitting transformation Building Model Training Trainer a model validation at scale Roll-out Serving Monitoring Logging

18.-

19.- Experimentation Training Multi-Cloud

20.https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/

21.Kubeflow's Mission Make it Easy for Everyone to Develop, Deploy and Manage Portable, Distributed ML on Kubernetes 21

22.1. Kubeflow = Cloud Native, multi-cloud solution for ML. 2. Kubeflow provides a platform for composable, portable and scalable ML pipelines. 3. If you have a Kubernetes conformant cluster, you can run Kubeflow.

23.Experimentation Training Cloud Kubeflow

24.Goal: Low bar; High ceiling ● Low bar - make it super easy to get started a. Minimize number of K8s concepts users need to learn b. Optimize Kubeflow deployments with scaffolding for apps ● High ceiling - allow sysadmins to do complex customizations a. Extensibility has been critical to K8s success b. Users should be able to easily customize individual components

25.More Tools and Frameworks TF. Transform Data Data ingestion Data analysis DataTF.Data validation Data splitting transformation Numpy Spark TF Job Building Model Training Trainer aJupyter model validation MXNet at scale PyTorch TF.Serving Roll-out Serving Seldon Prometheus Monitoring Logging TensorRT ...And lots more I can’t fit into this slide

26.Kubeflow is initially targeting ML Engineers ● Target Persona ML Engineers ○ Responsible for productionizing ML ○ Enterprise concerns: reliability, scalability, security ■ Kubernetes is a natural choice ○ More devops expertise ● Datascientists/Researchers important but secondary at this stage ○ Build models ○ Less devops expertise ○ Kubernetes can be a tougher sell; they might be quite happy with a single machine with lots of resources

27.3 Areas Of Development 1. Simple deployment of ML components on K8s a. kfctl b. GitOps for ML 2. K8s Native ML components/tools and integrations a. K8s packaging for TF/TFX components & libraries b. Katib c. Jupyter 3. Example Solutions a. Natural Language Code Search 27

28.Simple Deployment 28

29.Getting Started Is Difficult Grab bag of components ● Cobble together an ML platform out of 10's components ● ML bits ○ Jupyter ○ TFJob/PyTorch ● K8s bits ○ Networking(Ambassa dor, ISTIO, CertManager) ○ GPU installers ● Cloud Bits ○ K8s cluster 29 ○ Storage

30.With Kubeflow its O(1) command or clicks ${KUBEFLOW_REPO}/scripts/kfctl.sh init ${KFAPP} --platform none cd ${KFAPP} ${KUBEFLOW_REPO}/scripts/kfctl.sh generate k8s ${KUBEFLOW_REPO}/scripts/kfctl.sh apply k8s or via UI 30

31.Deploying notebooks ● Easily launch notebooks ● Access notebooks securely: https://kubeflow.acme.com/jupyter ● Shared storage (NFS) for collaboration ● Connect to data warehouse ● Enforce enterprise security policies 31

32.Enable GitOps for ML ● Fully declarative ● kfctl is a two step process ○ Create configs ○ Apply configs ● GitOps reduces the toil of managing infrastructure ○ Automation (e.g. WeaveFlux) keeps infrastructure up to date ○ Allows for automatic policy enforcement 32

33.K8s Native Components and Integrations 33

34. apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: TFJob name: tf-job-simple namespace: kubeflow spec: ● Integrates TF distributed training and estimator API with K8s tfReplicaSpecs: ● Use K8s to scale out training and leverage Workers: accelerators ● TF specific controller takes care of replicas: 3 managing all the K8s resources template: ○ K8s services spec: ○ Pods containers: ● Users benefit from K8s toolchain - image: acme/myjob ○ kubectl for CLI ○ K8s dashboard for monitoring 34

35. model push ≠ binary push TFServing ● Collaborating with TF to create a K8s native story for TFServing ● Adding prometheus exporter for metrics ● ISTIO for telemetry and traffic splitting ● Opportunity to leverage K8s to simplify pushing models ○ Need to measure model quality ○ Global rollout needed to uniformly sample traffic across 35

36.Katib(HP Tuner) ● Pluggable microservice architecture for HP tuning ○ Different optimization algorithms ○ Different frameworks ● StudyJob (K8s CR) (example) ○ Hides complexity from user ○ No code needed to do HP tuning 36 Design Doc

37. JupyterHub K8s Native Jupyter Reverse Auth Monolit Proxy h ● JupyterHub is a "monolithic" application UI Spawner ○ Reverse proxy ○ Notebook launcher/manager ○ UI Auth modules K8s ○ Envoy/ISTI ● K8s Native UI API Microservi O ○ Use K8s to manage Server ce Architectur resources e ○ Envoy/ISTIO provide proxy and auth 37 ○ Just need a simple web app to create/delete K8s resources for notebooks

38.User Experience Deploy Experim Build Train at Kubeflo ent in Docker Scale w Jupyter Image Integra Build te Model Deploy Operat Model Server Model e into App

39.Projects being developed within Kubeflow ● K8s CRDs for several ML frameworks ○ tf-operator, PyTorch Operator, caffe-2, ○ Horvod for TF ● KVC ○ Kubernetes volume controller ○ Efficiently manage data for ML workloads ● Katib ○ Hyperparameter tuning system Clone of Vizier (Google's HP Tuning System) ● Docker images for ML ○ TFServing images ○ Curated Jupyter Notebook Images

40.Projects integrated with Kubeflow ● Argo ○ CRD for workflows ● JupyterHub ○ Multi-user server for Jupyter notebooks ● Pachyderm ○ deploy and manage multi-stage data pipelines while maintaining complete reproducibility and provenance ● SeldonIO ○ CRD and tooling for serving and deploying models ● Tensor2Tensor ○ Library of TensorFlow models and datasets for a variety of applications ● TFX Libraries ○ OSS libraries from Google's TensorFlow based platform ML platform (TFX) ○ Currently available: TF Serving, TF Transform and TF Model Analysis (TFMA)

41.3 Core Principles

42.Open ● Why ○ Building an ML platform is too big a challenge to do alone ○ Kubernetes' success illustrates the value of building a broad, energetic community ● What this means ○ All members of the community equal opportunity ■ Except: Google is currently sole owner of kubeflow.org domain ○ All test/release infrastructure is community owned ■ Release/test teams include members from multiple organizations ● Success will depend on everyone carrying water and chopping wood ○ # PRs per week is 2x # commits -> Need more reviewers

43.Low bar; high ceiling ● Low bar - make it super easy to get started ○ Minimize number of K8s concepts/APIs users need to learn just to get started ○ Optimize Kubeflow deployments ■ Work with sig-apps to define appropriate scaffolding for apps ○ Very active area in the community ● High ceiling - allow system administrators to do complex customizations ○ Extensibility has been critical to K8s success ○ Users should be able to easily customize individual components

44.Kubernetes Native ● Run anywhere Kubernetes runs ● Reuse K8s concepts/APIs; don't reinvent the wheel ● Hard dependency on K8s ○ Kubeflow will not invest in running on other platforms

45.Applying these Principles

46.How is Kubeflow K8s Native? ● Kubeflow uses K8s APIs and concepts ○ TfJob & other controllers don't hide K8s APIs ■ Use requests/limits for resource scheduling ■ Let users customize image, arguments, environment variables etc... ○ Volumes for storage ● Kubeflow is managed declaratively matching K8s best practices ○ config intended to be checked into source control ○ embracing GitOps ● Leveraging the K8s ecosystem ○ Use CRDs ○ Want to align with sig-apps app CRD for app management

47. Can we reconcile K8s Native & Low bar? ● Hot topic in the community ● K8s is a steep learning curve for datascientists ● Can we make K8s approachable and avoid users falling off a cliff ○ Learn as you go What we want to avoid Your models on your data K8s Knowledge Required K8s logging/ Resources/ RBAC/ Volumes etc.. Run Kubeflow Examples Task complexity

48.• Data! x N • Data collection and storage • Data analysis and visualization • Data transformation • Data cleaning • Data validation • Data management Sad example

49.• Model Selection and Evaluation • Model selection • from linear to non-linear to deep learning… A LOT • Choose appropriate hyper-parameters • Learning rate, network layers, normalization, … A LOT • Model evaluation • First look at the curve • Tune the hyper parameters? • Need more data? • Chose the wrong model?

50.• Collaborative development Alice: Can you help me on this problem? Bob (delightedly): Where is your training code? I am stuck! Want a Alice: In my laptop second opinion Bob: Where are your model results? and help! Alice: In my laptop Bob: Where are your pre-processing scripts? Alice: In my laptop Bob: Where is your data so I can test-run? Alice: In my laptop Bob: Copy everything to me … doesn’t work at all

51.• Go-to-Production ○ Model to API ○ Requires collaboration: API design, deployment testing, engineering Tony Dereck I know docker, Kubernetes… I know Algorithm1, Wait a minute, which service Algorithm2, Algorthim3… broke again?

52.• Reproducibility Tony Dereck Why only 97% in my workstation? Only 93% at the customer’s! My model is 99% accuracy; sending to you Wechat transfer is slow today; via Wechat and you take will run it once received. care of the rest!

53.• Devops for ML (CI/CD) • Even worse, frequent re-train is needed when we have more new data.

54.● System and Model OPs ○ Monitor model effectiveness ○ Monitor data anomaly ○ Monitor system resource usage and anomaly We Need This and This

55.

56.

57.