京东大规模全域调度平台实践

京东的YARN集群以社区2.7.1版本为基础构建而成,经过多年的打磨已发展成京东大数据的全域调度平台。在此次演讲中,我们将分享YARN在京东的应用,并延伸到如何借助YARN的特性构建全域调度平台。我们会介绍如何实现作业跨机房的迁移,如何借助K8s资源实现临时扩容,怎样结合超卖、物理资源、资源隔离等提升集群资源利用率。

展开查看详情

1.Large-scale global scheduling platform practice in JD.com Wanqiang Ji Hadoop YARN @ JD.com 2020/09/26

2.INTRODUCE Wanqiang Ji • Staff Software Engineer @ JD.com, Teach Lead of YARN Team • Apache Submarine Committer

3.Agenda • Control Plane: a stateless multi data center router • Job migration across the data center • Stateless job management • Lease resource between YARN and K8s • Orochi: a federated scheduling policy scheduler • Multithreading scheduling with different policy • Node scoring and filtering by physical utilization • Stability: container scoring and eviction etc.

4.Control Plane YARN • stateless Control Plane Federation Cluster • generate yarn app id IDC-1 b K8s jo • queue mapping it Federation Cluster m b • YARN queue su su b m • K8s namespace it YARN jo Control Plane b Federation Cluster • lease resource IDC-2 K8s • lease from K8s Federation Cluster • lease from yarn A stateless multi datacenter router

5.Queue Mapping Logical Queue Physical Queue f() queue-a idc1.yarn.queue-a f() queue-b idc2.yarn.queue-a f() queue-c idc1.k8s.ns-a f() queue-d idc2.k8s.ns-a f() queue-e idc2.yarn.queue-b Job migration cross the datacenter by Queue Mapping

6.Get Application Id • Why support this feature? • Resolve each request should reach to RM • YARN application id rule • Consists of three components • Use application as the prefix • The second component is the server start timestamp • New application id rule • Keep three components and prefix • Use prefix and suffix to change the second component

7.Lease Resource YARN Cluster Heart Beat NM Periodic Request RM NM rcm-yarn-controller NM rcm-k8s-controller RCM Queue Manager NM Plane Portal kubelet API Server Control Plane Node NM This design was the first version in our clusters, kubelet nowadays we upgrade to the second version. Node K8s Cluster

8.Orochi A federated scheduling policy scheduler • Multithreading scheduling with different policy, supports customize • FIFO Policy • Fair Policy • Gang Policy • Partition Policy • Long tail Policy • ... • Node scoring and filtering, move the low score node to rest queue • Stability: container scoring and eviction etc.

9.Multithread Scheduling Thread - FIFO Policy Thread - Fair Policy Commit Service scheduler write lock Thread - Gang Policy a6 a5 a4 a3 a2 a1 backlogs take Thread - Partition Policy backlogs commit Thread - Partition Policy scheduler write unlock ……. The scheduling thread supports dynamic change of the policy and the Prepare Scheduling capacity of the backlogs can be changed without restart the RM server.

10.Node scoring and filtering We use of node score to sort and filter the node list. • Add a new HB(heart-beat) between RM and NM to collect the physical resource utilization, such as CPU/Memory/IO/Net/system load etc. • Define a scoring function to calculate the node score • Virtual Resource • Physical Resource • Node Label • Topology • ……

11.Each Thread Scheduling Process Filter the node list by policy queue Attempt scheduling one node read lock Sort the queue by policy child queue/running apps Assign the container read unlock Put the assignment to backlogs Different thread maybe retrieves the identical app attempt to assign the container.

12.Container Scoring We use the container score to dynamic scheduling and eviction • Business Level: High, Middle, Low. Each level has the configurable factor to calculate the oversubscription resource, such as: • High Level Factor: [1.2 - 1.4] • Middle Level Factor: 1.0 • Low Level Factor: [0.8, 1.0] • Application Priority, used by preemption and eviction • Generated by AI system

13.Container Scheduler The container scheduler is responsible for dynamically Calculate the usage with weight changing the CFS hard limits by the container score. Initialize: cpu.cfs_quota_us = 200000 Sort the container by score cpu.cfs_period_us = 100000 In Scheduling: Adjust the factor of containers factor = [1.2, 1.4] with low score First Scheduling: Recalculate the usage with cpu.cfs_quota_us = 280000 weight cpu.cfs_period_us = 100000 Second Scheduling: Update the cgroup cpu.cfs_quota_us = 240000 cpu.cfs_period_us = 100000

14.Isolation We divide the resource into 3 parts based on our business situation. Such as: NodeManager, Containers and the others. Containers CPU: 80% The container scheduler can change the CFS limits of container periodically. Memory: Due to enable oversubscription, the NM and containers were isolated with other services. 10% NM IO: Others 10% Experimental Node

15.Container Eviction RPC ResourceTrackerService KILL REJECT scheduler AM 2 NM ContainerManager RMContainerAllocator C0 3 Node Manager ContainerLauncher getUsage Node Monitor TaskAttemptListener checkBusy C1 getContainers AM C2 signalEvict TaskAttemptListener 1

16.Future Works • Co-location • Intelligent Scheduling • Containerization

17.火热招聘中 岗位:存储研发工程师/专家 岗位:计算引擎研发工程师/专家

18.感谢您的时间 THANKS