含有 Kubernetes 和 Hadoop YARN 的混合型容器云端——Jian He,阿里巴巴和 Bushuang Gao,阿里巴巴

Hadoop YARN 是运行 MapReduce、Spark 等大数据应用的资源管理平台,其在架构上有别于非常适合长时间运行服务的 Kubernetes。很多公司同时拥有两者,以满足不同类型的工作负载。然而,这一方法将招致蹒跚的操作和硬件开销。 注意到这两种类型工作负载的区别,有无办法让它们在让两个资源管理系统和谐工作的同时共享一个集群?有哪些要求,需要克服哪些障碍? 在本次演讲中,我们将介绍由阿里巴巴开发的框架,其能够在一个具有弹性资源共享性能的集群中无缝的运行 Kubernetes 和 Hadoop。此外,我们还将分享在生产中管理两种工作负载方面学到的知识,以支持阿里巴巴大规模商业平台。 含有 Kubernetes 和 Hadoop YARN 的混合型容器云端——Jian He,阿里巴巴和 Bushuang Gao,阿里巴巴
展开查看详情

1.

2.– Jian He – Staff Engineer @Alibaba cluster management team – Staff Engineer @Hortonworks – Hadoop Committer & Project Management Committee member – Bushuang Gao – Senior Engineer @Alibaba

3.– – – – – –

4.

5.– Gartner has long talked about the "80% rule": that 80 percent of IT budgets get spent simply "keeping the lights on” – The average data center cpu utilization is about 10%

6.– – – – – – –

7.

8. Online service Batch jobs Category Online shopping web MR, spark, flink apps, payment service Latency Sensitive Insensitive Priority high low Traffic Peak at day time Peak at night time pattern Fault should not fail Fail and retry tolerance Complementary !

9.

10.–

11.– –

12.– – –

13.Borg paper mentions 20% - 30% more machines if If segregating prod and non-prod workloads

14.

15.Retail search adds spark MR flink Sigma Fuxi Node Kubernetes YARN

16.Co-located 40% Seperated 10% 30%

17.

18. Resource Scheduling contention Isolation Efficient placement of service container and tasks When placed together, don’t affect each other

19.- Online workload low 1:00am – 6:00am - Offline jobs scale up while online workload remains idle - Offline jobs scale down while online workload comes back

20.

21.

22.

23.Kubernetes Focus on long running service. Driving current state towards desired state with control loops YARN Focus on scheduling jobs

24.Kubernetes Container centric – bottom up. Container is the primitive. Other primitives such as replicaset, deployment are built around containers. YARN Application centric: top down. Scheduling sequence: Queue -> user -> application -> container request

25.kubernetes Based on api-server watch mechanism Everything stored in etcd YARN Based on RPC Only application-level metadata persisted. Container data is not persisted. Recover from in-memory state from peers

26.kubernetes CRI compatible. Docker etc. YARN Docker + TAR ball

27.

28. Resource Online service Offline jobs Console management RPC: VTRON RPC: VTRON Apiserver Co-location YARN-RM L&W Scheduler GRPC RPC L&W NODE kubelet agent YARN-NM cgroup pod pod pod pod pod task task task VTRON: Virtual Total Resources Of Node

29.Online service usage Offline job resource usage Kubernetes YARN Online service resource quota Offline job resource quota