This slide would like to mention a new feature to run Spark on Kubernetes.

注脚

展开查看详情

1.Weiting Chen weiting.chen@intel.com Zhen Fan fanzhen@jd.com

2. ABOUT US Zhen Fan Software Development Engineer at JD.com Zhen is a software development engineer at JD.com, where he focuses on big data and machine learning platform development and management. Weiting Chen(William) Senior Software Engineer at Intel Weiting is a senior software engineer in Intel’s Software Service Group, where he works on big data on cloud solutions. One of his responsibilities is helping customers integrate big data solutions into their cloud infrastructure. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others.

3. INTEL NOTICE & DISCLAIMER No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1- 800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the Intel logo, Intel® are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Copyright © 2018 Intel Corporation. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others.

4. AGENDA BACKGROUND SPARK ON KUBERNETES - Why use Spark-on-K8s - How it works JD.COM CASE STUDY - JD.com’s MoonShot - Network Choice - Storage Choice SUMMARY Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others.

5. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others.

6. ABOUT SPARK-ON-KUBERNETES https://github.com/apache-spark-on-k8s/spark Spark* on Kubernetes*(K8s) is a new project proposed by the companies including Bloomberg, Google, Intel, Palantir, Pepperdata, and Red Hat. The goal is to bring native support for Spark to use Kubernetes as a cluster manager like Spark Standalone, YARN*, or Mesos*. The feature has been merged into Spark 2.3.0 release(SPARK-18278). Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 6 *Other names and brands may be claimed as the property of others.

7.WHY JD.COM CHOOSE SPARK-ON-K8S heterogeneous computing CPU + GPU + FPGA Customers are asking to use an unified cloud platform to manage their applications. Based on Kubernetes*, we can ease to set up a platform to support CPU, GPU, as well as FPGA resources for Big Data/AI workloads. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others.

8. HETEROGENEOUS CLOUD SOLUTION USER Web Service Command Line Jupyter*/Zeppelin* INTERFACE Spark* Spark BigDL* Mllib* SQL Streaming COMPUTING TensorFlow* Storm* Hadoop* FRAMEWORK /Caffe* /Flink* Spark Docker* Docker Docker Docker CONTAINER CLUSTER Kubelet* Kubelet Kubelet Kubelet Servers Servers Servers Servers HARDWARE RESOURCE CPU CPU CPU CPU CPU CPU FPGA FPGA CPU CPU GPU GPU Storage Storage Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others.

9. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others.

10.SPARK ON DOCKER SOLUTIONS Solution1 - Spark* Standalone on Docker* - Run Spark standalone cluster in Docker. - Two-tiers resource allocation(K8s->Spark Cluster->Spark Applications). - Less efforts to migrate existing architecture into container environment. Solution2 - Spark on Kubernetes* - Use native way to run Spark on Kubernetes like Spark Standalone, YARN, or Mesos. - Single tier resource allocation(K8s->Spark Applications) for higher utilization. - Must re-write the entire logical program for resource allocation via K8s. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others.

11. SOLUTION1 - SPARK STANDALONE ON DOCKER Spark- Kubelet Kubelet* Step3 Submit App1 Spark Master Spark Slave Pod Step1 Spark Slave Pod App1 Executor Step4 App1 Executor App2 Executor Step2 Kubernetes* Step2 Master Kubelet Kubelet Spark Slave Pod Spark Slave Pod App1 Executor App2 Executor Spark- App2 Executor Submit App2 Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others.

12. SOLUTION2 - SPARK ON KUBERNETES Spark- Kubelet Kubelet* Submit Step2 App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 App2 Executor Pod Kubernetes Master Kubelet Kubelet App1 Executor Step4 Pod App2 Executor Spark- App2 Driver Pod Pod Submit App2 Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others.

13. HOW TO USE SPARK ON K8S bin/spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --master k8s://http://127.0.0.1:8080 \ --kubernetes-namespace default \ --conf spark.executor.instances=5 \ --conf spark.executor.cores=4 \ --conf spark.executor.memory=4g \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.driver.docker.image=localhost:5000/spark-driver \ --conf spark.kubernetes.executor.docker.image=localhost:5000/spark-executor \ --conf spark.kubernetes.initcontainer.docker.image=localhost:5000/spark-init \ --conf spark.kubernetes.resourceStagingServer.uri=http://$ip:31000 \ hdfs://examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others.

14. DATA PROCESSING MODEL PATTERN 1: PATTERN 2: PATTERN 3: Storage Plan Internal HDFS* External HDFS Object Store for Spark* on K8s* Virtual Cluster Virtual Virtual The design rule is based on “whether the data must be Cluster Cluster persisted”. spark.local.dir: Computing HDFS Computing Computing For Spark Data Shuffling. Task Task Task Use Ephemeral Volume. Now it uses docker-storage Docker1 Docker2 with diff. storage backend. Docker1 HDFS Docker1 Object Store EmptyDir is WIP. Host Host Host Host Host File Staging Server: For sharing data such as Jar Use HDFS as file sharing server. Use HDFS as file sharing server. Launch a File Staging Server to or dependence file between HDFS runs in the same host to HDFS runs outside in a long- share data between nodes. computing nodes. give elasticity to add/reduce running cluster to make sure Input and Output data can put Now it uses docker-storage. compute nodes by request. data is persisted. in an object store Local Storage support in Streaming data directly via Persist Volume(PV) is WIP. Please refer to Spark and HDFS. Please refer to PR-350 object level storage like Amazon S3, Swift. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others.

15. KEY FEATURES  Support Cluster Mode  Client Mode Support is under reviewing.  Support File Staging in local, HDFS, or running a File Stage Server container.  Support Scala, Java, and PySpark.  Support Static and Dynamic Allocation for Executors.  Support running HDFS inside K8s or externally.  Support for Spark 2.3  Pre-built docker images Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others.

16. STATIC RESOURCE ALLOCATION The resources are allocated in the beginning and cannot change during the executors are running. Static resource allocation uses local storage(docker-storage) for data shuffle. Kubelet Spark- Kubelet Step2 Submit App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 docker storage Kubernetes Master Kubelet Step4 Kubelet App1 Executor Step4 App1 Executor Pod Pod Use EmptyDir in K8s for this temporary data shuffle. docker docker storage storage Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others.

17. DYNAMIC RESOURCE ALLOCATION The resources are allocated in the beginning, but applications can change the resource in run time. Dynamic resource allocation uses shuffle service container for data shuffle. Kubelet Spark- Kubelet Step2 Submit App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 Shuffle Service Pod Kubernetes Master Kubelet Step4 Kubelet App1 Executor Step4 App1 Executor Pod Pod There are two methods: 1st method is to run shuffle service in a pod. Shuffle Service Shuffle Service 2nd method is to run shuffle service as a container with a Pod Pod executor. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others.

18. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others.

19. JD.com (NASDAQ: JD) Founded in 2004 in Beijing by CEO, Richard Liu. Largest online retailer in China Member of the Fortune Global 500 Business including e-commerce, Internet finance, logistics, cloud computing and smart technology Technology-driven company, ABC strategy Joybuy.com for US customers - affiliate to JD.com Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others.

20. JD.com MOONSHOT JD has used K8s* as cloud infrastructure management for several years. JD would like to use K8s to manage all the computing resources including CPU, GPU, FPGA, …etc. Target for all AI workloads; Using the same cluster for training/inference. Across multiple Machine Learning framework including Caffe, TensorFlow, XGBoost, MXNet, BigDL …etc. To optimize workloads for different resource allocation. Multi-tenancy support by different user accounts and resource pool. Reference: https://mp.weixin.qq.com/s?__biz=MzA5Nzc2NDAxMg%3D%3D&mid=2649864623&idx=1&sn=f476db89b3d0ec580e8a63ff78144a37 Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others.

21. MOONSHOT ARCHITECTURE Management Image Security Center Applications Recognition NLP Solutions Finance Public Cloud Authority Mgmt. Spark Strea DeepLea BigDL MLlib Computing SQL ming rning4j Task TensorFlow* Caffe* MXNet* XGBoost* Mgmt. Engine Spark Procedure Mgmt. Container Docker* + Kubernetes* Cluster Monitor Center Network File System Logging Infrastructure CPU GPU FPGA Omini- Center Ethernet InfiniBand SSD HDD Path Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others.

22. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 22 *Other names and brands may be claimed as the property of others.

23. TYPES OF CONTAINER NETWORK Bridge Bridge is the default network(docker0) in Docker*. Linux bridge provides a host internal network for each host and leverages iptables for NAT and port mapping. It is simple and easy, but with bad performance. Host Container shares its network namespace with the host. This way provides high performance without NAT support, but limits with port conflict issue. Overlays Overlays use networking tunnels(such as VXLAN) to communicate across hosts. Overlay network provides the capability to separate the network by projects. Underlays Underlays expose host interfaces directly to containers running on the host. It supports many popular drivers like MACvlan, IPvlan, …etc. Some other ways via Underlay network are Direct Routing, Fan Networking, Point-to-Point. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 23 *Other names and brands may be claimed as the property of others.

24. NETWORK SOLUTIONS Flannel* A simple and easy to configure layer 3 network fabric designed for K8s. It runs flanneld on each host to allocate subnet and uses etcd to store network configuration. Flannel supports several backends including VXLAN, host-gw, UDP, …etc. Weave* Weave creates a virtual network that connects Docker containers across multiple hosts and enables their automatic discovery. OpenVSwitch* Calico* An approach to virtual networking and network security for containers, VMs, and bare metal services, which provides a rich set of security enforcement capabilities running on top of a highly scalable and efficient virtual network fabric. Calico uses BGP to set up the network and it also supports IPIP methods to build up a tunnel network. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 24 *Other names and brands may be claimed as the property of others.

25. Why CALICO? No overlay required Little overhead comparing to bare metal. Sometimes, overlay network(encapsulating packets inside an extra IP header) is an option, not MUST. Using Calico with BGP is much faster. Simple & Scalable The architecture is simple, the deployment is simple as well. We can easily deploy thousands of nodes in k8s by using yaml file. Policy-driven network security In many scenarios of JD.com, for example, multi-tenancy is necessary to make network isolation. Calico enables developers and operators to easily define network policy with fine granularity such as allowed or blocked connections. Widely deployed, and proven at scale We leverage the experience from other big companies who share their issues in the community. These experience are very valuable for us at the very beginning of moonshot. Fortunately, Calico has passed the verified in our production environment. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 25 *Other names and brands may be claimed as the property of others.

26. NETWORK PERFORMANCE RESULT All scenarios use ab command to connect to nginx* server with different IP address. “ab -n 1000000 -c 100 -H"Host: nginx.jd.local" 172.20.141.72:80/index.html “ Concurrency Total Request Response No. Scenario # Time(s) per Second Time(ms) 1 Client -> Nginx 50 50.044 19982 0.05 Weave: 2 50 132.839 7527 0.133 Client -> iptables -> Weave -> Nginx container Calico with IPIP: 3 50 111.136 8998 0.111 Client -> iptables -> Calico -> Nginx container Calico with BGP: 4 50 59.218 16886 0.059 Client -> iptables -> Calico -> Nginx container JD.com decides to pick up Calico since Calico provides similar performance to Bare Metal. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 26 *Other names and brands may be claimed as the property of others.

27. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 27 *Other names and brands may be claimed as the property of others.

28.STORAGE CHOICES Separate Compute and Storage cluster Use Kubernetes to allocate resources for compute Use Stand-alone HDFS Cluster for data persistent Data locality depends on the workload types Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 28 *Other names and brands may be claimed as the property of others.

29. DATA LOCALITY IMPACT Workloads Types Locality Datasize Cluster Size Network Execution Time Notes Terasort IO Local 320GB 5 1Gb 2119.926sec 1x 5 Spark + Terasort IO Remote 320GB 1Gb 4212.029sec 1.98x 3 Hadoop Terasort IO Local 320GB 5 10Gb 500.198sec 1x 5 Spark + Terasort IO Remote 320GB 10Gb 548.549sec 1.10x 3 Hadoop Kmeans CPU Local 240GB 5 10Gb 1156.235sec 1x 5 Spark + Kmeans CPU Remote 240GB 10Gb 1219.138sec 1.05x 3 Hadoop Note1: This testing is using 5-nodes bare metal cluster. Note2: 4 SATA SSD per Spark and Hadoop node Note3: Performance may impact in different configuration including the number of disk, network bandwidth, as well as different platform. Optimization Notice Copyright © 2017, Intel Corporation. All rights reserved. 29 *Other names and brands may be claimed as the property of others.