- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
如何落地TensorFlow on Kubernetes
展开查看详情
1 .TensorFlow on Kubernetes @ vivo xidianwangtao@gmail.com
2 .Agenda • Distributed TensorFlow • Why TensorFlow on Kubernetes • How TensorFlow on Kubernetes • Deploy Architecture • Step By Step • The Major Problems I Have Encountered • Todo List
3 .Outrageously large models Improving accuracy with up to 68 billion parameters https://www.cs.toronto.edu/~hinton/absps/Outrageously.pdf
4 . Distributed TensorFlow Derek Murray @ TensorFlow DEV SUMMIT 2017 《 Distributed TensorFlow 》 h"ps://www.youtube.com/watch? 3me_con3nue=703&v=la_M6bCV91M
5 .
6 .
7 .Distributed TensorFlow Model
8 .In-graph Replication
9 .Between-graph Replication
10 .Async/Sync Training
11 .Between-graph + Async
12 .• Distributed TensorFlow • Why TensorFlow on Kubernetes • How TensorFlow on Kubernetes • Deploy Architecture • Step By Step • The Major Problems I Have Encountered • Todo List
13 .Motivation • TensorFlow Task • GPU GPU • Task • Task • HDFS Read • TensorFlow
14 .Kubernetes is Suitable • ResourceQuota, LimitRanger • GPU (only limits) • PLEG • EFK • Read Glusterfs, Ceph) • TensorFlow
15 . HDFS vs Glusterfs vs Ceph • Glusterfs – 12GB/s • HDFS – 3GB/s GlusterFS Read Performance is Best • CephFS – 2GB/s http://iopscience.iop.org/article/10.1088/1742-6596/513/4/042014/pdf
16 .• Distributed TensorFlow • Why TensorFlow on Kubernetes • How TensorFlow on Kubernetes • Deploy Architecture • Step By Step • The Major Problems I Have Encountered • Todo List
17 .GlusterFS + K8S + TF
18 . HDFS + K8S + TF • GCEPersistentDisk • AWSElasticBlockStor • CephFS e • Cinder • AzureFile • Glusterfs • AzureDisk • VsphereVolume • FC (Fibre Channel) • Quobyte Volumes • FlexVolume • HostPath • Flocker • VMware Photon • NFS • Portworx Volumes • iSCSI • ScaleIO Volumes • RBD • StorageOS
19 . Kube-scheduler Kube-apiserver etcd Job Kube-controller-manager • • .spec.completions .spec.parallelism JobController NewJobController Run • .spec.activeDeadlineSeconds • .spec.template.spec.backoffLimit • RestartPolicy: Never or OnFailure syncJob manageJob func (jm *JobController) syncJob(key string) (bool, error) func (jm *JobController) manageJob(activePods []*v1.Pod, succeeded int32, job *batch.Job) (int32, error) Indexer SatisfiedExpectations
20 .• Distributed TensorFlow • Why TensorFlow on Kubernetes • How TensorFlow on Kubernetes • Deploy Architecture • Step By Step • The Major Problems I Have Encountered • Todo List
21 .Components
22 .
23 .• Distributed TensorFlow • Why TensorFlow on Kubernetes • How TensorFlow on Kubernetes • Deploy Architecture • Step By Step • The Major Problems I Have Encountered • Todo List
24 .Step 1- • User Node $AlgorithmName copy User Node `/ var/www/html/$UserName/$AlgorithmName/` run.sh ( $AlgorithmName ) • $AlgorithmName • : User Node `/var/www/html/` httpd HTTP `http://$UserNodeIP:80/$UserName/$AlgorithmName`
25 .Step 2- • User Node `/opt/tensorflow/` tfcluster_template.yaml.jinja • HDFS https://github.com/tensorflow/ecosystem/blob/master/render_template.py • GlusterFS
26 .
27 .
28 .
29 .