京东 陈韦廷&范振 - 《京东如何基于容器打造高性能及效率的大数据平台》

展开查看详情

1. 京东如何基于容器 打造高性能及效率的大数据平台 Zhen Fan fanzhen@jd.com Weiting Chen weiting.chen@intel.com

2. INTEL NOTICE & DISCLAIMER No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm. Intel, the Intel logo, Intel® are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others Copyright © 2018 Intel Corporation. 4

3. AGENDA BACKGROUND SPARK ON KUBERNETES - Why use Spark-on-K8s - How it works - Current Status & Issues JD.COM CASE STUDY - JD.com’s MoonShot - Network Choice - Storage Choice SUMMARY 5

4.BACKGROUND 6

5. ABOUT SPARK-ON-KUBERNETES • https://github.com/apache-spark-on-k8s/spark • Spark* on Kubernetes*(K8s) is a new project proposed by the companies including Bloomberg, Google, Intel, Palantir, Pepperdata, and Red Hat. • The goal is to bring native support for Spark to use Kubernetes as a cluster manager like Spark Standalone, YARN*, or Mesos*. • The feature is planning to be put into Spark 2.3.0 release(SPARK- 18278). 7

6.WHY JD.COM CHOOSE SPARK-ON-K8S Heterogeneous Computing CPU + GPU + FPGA Customers are asking to use an unified cloud platform to manage their applications. Based on Kubernetes*, we can ease to set up a platform to support CPU, GPU, as well as FPGA resources for Big Data/AI workloads. 8

7. HETEROGENEOUS CLOUD SOLUTION USER Web Service Command Line Jupyter*/Zeppelin* INTERFACE Spark* Spark BigDL* Mllib* SQL Streaming COMPUTING TensorFlow*/ Storm* Hadoop* FRAMEWOR Caffe* /Flink* K Spark Docker* Docker Docker Docker CONTAINER CLUSTER Kubelet* Kubelet Kubelet Kubelet Servers Servers Servers Servers HARDWARE RESOURCE CPU CPU CPU CPU CPU CPU FPGA FPGA CPU CPU GPU GPU Storage Storage 9

8.SPARK ON KUBERNETES 10

9. SPARK ON DOCKER SOLUTIONS • Solution1 - Spark* Standalone on Docker* - Run Spark standalone cluster in Docker. - Two-tiers resource allocation(K8s->Spark Cluster->Spark Applications). - Less efforts to migrate existing architecture into container environment. • Solution2 - Spark on Kubernetes* - Use native way to run Spark on Kubernetes like Spark Standalone, YARN, or Mesos. - Single tier resource allocation(K8s->Spark Applications) for higher utilization. - Must re-write the entire logical program for resource allocation via K8s. 11

10.SOLUTION1 - SPARK STANDALONE ON DOCKER Spark- Kubelet Kubelet* Step3 Submit App1 Spark Master Spark Slave Pod Step1 Spark Slave Pod App1 Executor Step4 App1 Executor App2 Executor Step2 Kubernetes* Step2 Master Kubelet Kubelet Spark Slave Pod Spark Slave Pod App1 Executor App2 Executor Spark- App2 Executor Submit App2 12

11. SOLUTION2 - SPARK ON KUBERNETES Spark- Kubelet Kubelet* Submit Step2 App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 App2 Executor Pod Kubernetes Master Kubelet Kubelet App1 Executor Step4 Pod App2 Executor Spark- App2 Driver Pod Pod Submit App2 13

12. HOW TO USE SPARK ON K8S # bin/spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.SparkPi \ --master k8s://http://127.0.0.1:8080 \ --kubernetes-namespace default \ --conf spark.executor.instances=5 \ --conf spark.executor.cores=4 \ --conf spark.executor.memory=4g \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.driver.docker.image=localhost:5000/spark-driver \ --conf spark.kubernetes.executor.docker.image=localhost:5000/spark-executor \ --conf spark.kubernetes.initcontainer.docker.image=localhost:5000/spark-init \ --conf spark.kubernetes.resourceStagingServer.uri=http://$ip:31000 \ hdfs://examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 14

13. KEY FEATURES • Support Cluster Mode • Client Mode Support is under reviewing. • Support File Staging in local, HDFS, or running a File Stage Server container. • Support Scala, Java, and PySpark. • Support Static and Dynamic Allocation for Executors. • Support running HDFS inside K8s or externally. • Support for Kubernetes 1.6 - 1.7 • Pre-built docker images 15

14. DATA PROCESSING MODEL PATTERN 1: PATTERN 2: PATTERN 3: Storage Plan Internal HDFS* External HDFS Object Store for Spark* on K8s* Virtual Cluster Virtual Virtual The design rule is based on “whether the data must be Cluster Cluster persisted”. spark.local.dir: Computing HDFS Computing Computing For Spark Data Shuffling. Task Task Task Use Ephemeral Volume. Now it uses docker-storage Docker1 Docker2 with diff. storage backend. Docker1 HDFS Docker1 Object Store EmptyDir is WIP. Host Host Host Host Host File Staging Server: For sharing data such as Jar Use HDFS as file sharing Use HDFS as file sharing Launch a File Staging Server to or dependence file between server. HDFS runs in the same server. share data between nodes. computing nodes. host to give elasticity to HDFS runs outside in a long- Input and Output data can put in Now it uses docker-storage. add/reduce compute nodes by running cluster to make sure an object store Local Storage support in request. data is persisted. Streaming data directly via Persist Volume(PV) is WIP. object level storage like Please refer to Spark and Please refer to PR-350 Amazon S3, Swift. HDFS. 16

15. STORAGE SUPPORT • Spark* Shuffle: Uses Ephemeral Volumes Spark Docker • Docker* Storage: Use devicemapper Executor Storage • Spark Shuffle • Shared Volumes: • File Staging 1. #439 Use EmptyDir for File Staging to share jar file. File 2. Local in Spark Executors(Docker Storage) Staging 3. Remote HDFS* Server • File Staging 4. Create a Staging Server Container • Persistent Volumes(Ongoing): HDFS/ GlusterF 1. #306 Use PV S 2. Input/Output Data • File Staging 3. Remote HDFS • Input Data 4. Remote Object Storage such as GlusterFS* • Output Data 17

16. STATIC RESOURCE ALLOCATION The resources are allocated in the beginning and cannot change during the executors are running. Static resource allocation uses local storage(docker-storage) for data shuffle. Kubelet Spark- Kubelet Step2 Submit App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 docker storage Kubernetes Master Kubelet Step4 Kubelet App1 Executor Step4 App1 Executor Pod Pod Use EmptyDir in K8s for this temporary data shuffle. docker docker storage storage 18

17. DYNAMIC RESOURCE ALLOCATION The resources are allocated in the beginning, but applications can change the resource in run time. Dynamic resource allocation uses shuffle service container for data shuffle. Kubelet Spark- Kubelet Step2 Submit App1 App1 Executor App1 Driver Pod Pod Step1 Step3 Step4 Shuffle Service Pod Kubernetes Master Kubelet Step4 Kubelet App1 Executor Step4 App1 Executor Pod Pod There are two implementations: 1st is to run shuffle service in a pod. Shuffle Service Shuffle Service 2nd is to run shuffle service as a container with a Pod Pod executor. 19

18. CURRENT STATUS & ISSUES • Spark* Shell for Client Mode hasn’t verified yet. • Spark Cluster for Long Time Job • Data Locality Support • Storage Backend Support • Container Launch Time may take too long • Performance Issues • Reliability 20

19.JD.COM CASE STUDY 21

20. JD.com (NASDAQ: JD) • Founded in 2004 in Beijing by CEO, Richard Liu. • Largest online retailer in China • Member of the Fortune Global 500 • Business including e-commerce, Internet finance, logistics, cloud computing and smart technology • Technology-driven company, ABC strategy • Joybuy.com for US customers - affiliate to JD.com 22

21. JD.com MOONSHOT • JD has used K8s* as cloud infrastructure management for several years. • JD would like to use K8s to manage all the computing resources including CPU, GPU, FPGA, …etc. • Target for all AI workloads; Using the same cluster for training/inference. • Across multiple Machine Learning framework including Caffe, TensorFlow, XGBoost, MXNet, BigDL …etc. • To optimize workloads for different resource allocation. • Multi-tenancy support by different user accounts and resource pool. Reference: https://mp.weixin.qq.com/s?__biz=MzA5Nzc2NDAxMg%3D%3D&mid=2649864623&idx=1&sn=f476db89b3d0ec580e8a63ff78144a37 23

22. MOONSHOT ARCHITECTURE Management Image Security Center Applications Recognition NLP Solutions Finance Public Cloud Authority Mgmt. Spark Strea DeepLe BigDL MLlib Computing XGBoost SQL ming arning4j Task TensorFlow* Caffe* MXNet* Mgmt. Engine * Spark Procedure Mgmt. Container Docker* + Kubernetes* Cluster Monitor Center Network File System Logging Infrastructure CPU GPU FPGA Omini- Center Ethernet InfiniBand SSD HDD Path 24

23.NETWORK CHOICE by JD.com 25

24. TYPES OF CONTAINER NETWORK • Bridge Bridge is the default network(docker0) in Docker*. Linux bridge provides a host internal network for each host and leverages iptables for NAT and port mapping. It is simple and easy, but with bad performance. • Host Container shares its network namespace with the host. This way provides high performance without NAT support, but limits with port conflict issue. • Overlays Overlays use networking tunnels(such as VXLAN) to communicate across hosts. Overlay network provides the capability to separate the network by projects. • Underlays Underlays expose host interfaces directly to containers running on the host. It supports many popular drivers like MACvlan, IPvlan, …etc. Some other ways via Underlay network are Direct Routing, Fan Networking, Point-to-Point. 26

25. NETWORK SOLUTIONS • Flannel* A simple and easy to configure layer 3 network fabric designed for K8s. It runs flanneld on each host to allocate subnet and uses etcd to store network configuration. Flannel supports several backends including VXLAN, host-gw, UDP, …etc. • Calico* An approach to virtual networking and network security for containers, VMs, and bare metal services, which provides a rich set of security enforcement capabilities running on top of a highly scalable and efficient virtual network fabric. Calico uses BGP to set up the network and it also supports IPIP methods to build up a tunnel network. • Weave* Weave creates a virtual network that connects Docker containers across multiple hosts and enables their automatic discovery. • OpenVSwitch* • Others 27

26. Why CALICO? • No overlay required Little overhead comparing to bare metal. Sometimes, overlay network(encapsulating packets inside an extra IP header) is an option, not MUST. • Simple & Scalable The architecture is simple, the deployment is simple as well. We can easily deploy thousands of nodes in k8s by using yaml file. • Policy-driven network security In many scenarios of JD.com, for example, multi-tenancy is necessary to make network isolation. Calico enables developers and operators to easily define network policy with fine granularity such as allowed or blocked connections. • Widely deployed, and proven at scale We leverage the experience from other big companies who share their issues in the community. These experience are very valuable for us at the very beginning of moonshot. Fortunately, Calico has passed the verified in our production environment. 28

27. NETWORK PERFORMANCE RESULT All scenarios use ab command to connect to nginx* server with different IP address. “ab -n 1000000 -c 100 -H"Host: nginx.jd.local" 172.20.141.72:80/index.html “ Concurrency Request Waiting No. Scenario Total Time(s) # per Second Time(ms) 1 Client -> Nginx 50 50.044 19982 0.05 Weave: 2 50 132.839 7527 0.133 Client -> iptables -> Weave -> Pod Calico with IPIP: 3 50 111.136 8998 0.111 Client -> iptables -> Calico -> Pod Calico with BGP: 4 50 59.218 16886 0.059 Client -> iptables -> Calico -> Pod JD.com decides to pick up Calico since Calico provides better performance than Weave and Calico can still provide tunnel method(via IPIP) to set up network. 29

28.STORAGE CHOICE by JD.com 30

29. STORAGE CHOICES • Separate Compute and Storage cluster • Use Kubernetes to allocate resources for compute • Use Stand-alone HDFS Cluster for data persistent • Data locality depends on the workload types 31