如何落地TensorFlow on Kubernetes

介绍了完整的TensorFlow on Kubernetes部署架构,包括代码参考,还特别列出了一些问题调试和调优的办法。
展开查看详情

1.TensorFlow on Kubernetes @ vivo xidianwangtao@gmail.com

2.Agenda •  Distributed TensorFlow •  Why TensorFlow on Kubernetes •  How TensorFlow on Kubernetes •  Deploy Architecture •  Step By Step •  The Major Problems I Have Encountered •  Todo List

3.Outrageously large models Improving accuracy with up to 68 billion parameters https://www.cs.toronto.edu/~hinton/absps/Outrageously.pdf

4. Distributed TensorFlow Derek Murray @ TensorFlow DEV SUMMIT 2017 《 Distributed TensorFlow 》   h"ps://www.youtube.com/watch? 3me_con3nue=703&v=la_M6bCV91M

5.

6.

7.Distributed TensorFlow Model

8.In-graph Replication

9.Between-graph Replication

10.Async/Sync Training

11.Between-graph + Async

12.•  Distributed TensorFlow •  Why TensorFlow on Kubernetes •  How TensorFlow on Kubernetes •  Deploy Architecture •  Step By Step •  The Major Problems I Have Encountered •  Todo List

13.Motivation •  TensorFlow Task •  GPU GPU •  Task •  Task •  HDFS Read •  TensorFlow

14.Kubernetes is Suitable •  ResourceQuota, LimitRanger •  GPU (only limits) •  PLEG •  EFK •  Read Glusterfs, Ceph) •  TensorFlow

15. HDFS vs Glusterfs vs Ceph •  Glusterfs – 12GB/s •  HDFS – 3GB/s GlusterFS Read Performance is Best •  CephFS – 2GB/s http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf

16.•  Distributed TensorFlow •  Why TensorFlow on Kubernetes •  How TensorFlow on Kubernetes •  Deploy Architecture •  Step By Step •  The Major Problems I Have Encountered •  Todo List

17.GlusterFS + K8S + TF

18. HDFS + K8S + TF •  GCEPersistentDisk •  AWSElasticBlockStor •  CephFS e •  Cinder •  AzureFile •  Glusterfs •  AzureDisk •  VsphereVolume •  FC (Fibre Channel) •  Quobyte Volumes •  FlexVolume •  HostPath •  Flocker •  VMware Photon •  NFS •  Portworx Volumes •  iSCSI •  ScaleIO Volumes •  RBD •  StorageOS

19. Kube-scheduler Kube-apiserver etcd Job Kube-controller-manager •  •  .spec.completions .spec.parallelism JobController NewJobController Run •  .spec.activeDeadlineSeconds •  .spec.template.spec.backoffLimit •  RestartPolicy: Never or OnFailure syncJob manageJob func (jm *JobController) syncJob(key string) (bool, error) func (jm *JobController) manageJob(activePods []*v1.Pod, succeeded int32, job *batch.Job) (int32, error) Indexer SatisfiedExpectations

20.•  Distributed TensorFlow •  Why TensorFlow on Kubernetes •  How TensorFlow on Kubernetes •  Deploy Architecture •  Step By Step •  The Major Problems I Have Encountered •  Todo List

21.Components

22.

23.•  Distributed TensorFlow •  Why TensorFlow on Kubernetes •  How TensorFlow on Kubernetes •  Deploy Architecture •  Step By Step •  The Major Problems I Have Encountered •  Todo List

24.Step 1- •  User Node $AlgorithmName copy User Node `/ var/www/html/$UserName/$AlgorithmName/` run.sh ( $AlgorithmName ) •  $AlgorithmName •  : User Node `/var/www/html/` httpd HTTP `http://$UserNodeIP:80/$UserName/$AlgorithmName`

25.Step 2- •  User Node `/opt/tensorflow/` tfcluster_template.yaml.jinja •  HDFS https://github.com/tensorflow/ecosystem/blob/master/render_template.py •  GlusterFS

26.

27.

28.

29.