Tangram: Distributed Scheduling Framework for Apache Spark at Facebook

Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook’s data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram’s architecture, discuss Facebook’s need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark’s efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.
展开查看详情

1.WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

2.Tangram: Distributed Scheduling Framework for Apache Spark at Facebook Rui Jian, Hao Lin, Facebook Inc. rjian@fb.com, hlin@fb.com #UnifiedAnalytics #SparkAISummit

3.About Us • Rui Jian – Software Engineer at Facebook (Data Warehouse & Graph Indexing) – Master of Computer Science (Shanghai Jiao Tong university) • Hao Lin – Research scientist at Facebook (Data Warehouse Batch Scheduling) – PhD in Parallel Computing (Purdue ECE) #UnifiedAnalytics #SparkAISummit 3

4.Agenda • Overview • Tangram Architecture • Scheduling Policies & Resource Allocation • Future work #UnifiedAnalytics #SparkAISummit 4

5.What is Tangram? The scheduling platform for • reliably running various batch workloads • with efficient heterogenous resource management • at scale #UnifiedAnalytics #SparkAISummit 5

6.Tangram Scheduling Targets • Single jobs: adhoc/periodic • Batch jobs: adhoc/periodic, malleable • Gang jobs: adhoc/periodic, rigid • Long-running jobs: steady and regular; e.g. online training #UnifiedAnalytics #SparkAISummit 6

7.Why Tangram? • Various workload characteristics – ML – Apache Spark – Apache Giraph – Single jobs • Customized scheduling policies • Scalability – Fleet size: hundreds of thousands worker nodes – Job scheduling throughput: hundreds of millions jobs per day #UnifiedAnalytics #SparkAISummit 7

8.Overview 1 ML • What is Tangram? 2 Admin DB 3 SQL query Job Manager 6 4 5 Resource Manager Spark Master Gang Job Giraph Single Job ML Elastic Scheduler Agent Agent Agent #UnifiedAnalytics #SparkAISummit 8

9.Client Library • Job management Application • Request/Release resources 1 4 • Resource grant 2 Tangram Resource • Preemption notification client 3 Manager • Launch containers 5 • Container status change event 6 Agent #UnifiedAnalytics #SparkAISummit 9

10.Agent • Report schedulable resources and runtime usage • Health check reports • Detect labels • Launch/Kill Containers • Container recovery • Resource isolation with cgroup v2 #UnifiedAnalytics #SparkAISummit 10

11.Failure Recovery • Agent failure – Scan the recovery directory and recover the running containers • RM failure – Both agent and client hold off communication to the RM until the new master shows up – Client sync session info to the new master to help it build the states – Agents add them to the new master #UnifiedAnalytics #SparkAISummit 11

12.Scheduling Policies • Hierarchical queue structure / DRF • Jobs to be queued on leaves 80% 20% • Queue configs: DRF ads feed DRF – min/max resources 50% 50% – Policy: • FIFO FIFO pipelines interactive User Fairness • Dominant Resource Fairness (DRF) Job Job 50% 50% • User fairness FIFO user1 user2 FIFO • Global Job Job • … #UnifiedAnalytics #SparkAISummit 12

13.Scheduling Policies • Jobs ordered by priority, submission time within queue • Gang job as first class in scheduling and resource allocation • Lookahead scheduling for better throughput and utilization • Job starvation prevention Gang 200 Gang 20 Single Gang 4 Single #UnifiedAnalytics #SparkAISummit 13

14.Resource Allocation • Fine-grained resource specification: – {cpuMilliCores: 3000, memoryBytes: 200GB} • Constraints: – “dataCenter = dc1 & type in [1,2] & kernelVersion > 4.10” • Job Affinity: – inSameDatacenter #UnifiedAnalytics #SparkAISummit 14

15.Resource Allocation Prefetched Host Filtering Host Scoring Commit Host Cache and Ordering Allocation • Bypass the • Hard & • Packing • Book steps of Soft efficiency keeping host constraints • Host resources filtering • Resource healthiness • Update and constraint • Data cluster & scoring • Label locality queue • Speedup constraint parameters allocation • Job affinity process #UnifiedAnalytics #SparkAISummit 15

16.Constraint-based Scheduling • Machine type Queue • Datacenter Job Job • Region Job Job constraint: Job constraint: Job Job type=2 • CPU architecture type=1 • Host prefix Host 1 Host 4 • … Labeled with {”type”:”1”} Host 2 Host 5 Labeled with Host 3 {”type”:”2”} Merged host pool - type 1 & 2 #UnifiedAnalytics #SparkAISummit 16

17.Preemption • Guarantee resource availability SLO within and across queues • Identify the starving jobs and overallocated jobs • Minimize preemption cost: two-phase protocol – Only candidates appearing in both phases will be preempted – Resource Manager notifies client with preemption intent s.t. necessary action can be taken, e.g. checkpointing #UnifiedAnalytics #SparkAISummit 17

18.Cross Datacenter Scheduling • The growing demand of computation and storage for Hive tables spans across data centers • Stranded capacity with imbalanced load • Poor data locality and waste of network bandwidth • Slow reaction to recover from crisis and disaster #UnifiedAnalytics #SparkAISummit 18

19.Cross Datacenter Scheduling • Dispatcher Proxy Job – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs – Provides location hints to the Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and available resources #UnifiedAnalytics #SparkAISummit 19

20.Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs Job – Provides location hints to the Job constraint: datacenter=1 Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and available resources #UnifiedAnalytics #SparkAISummit 20

21.Cross Datacenter Scheduling • Dispatcher Proxy – Monitors resource consumption across data centers – Decides the Resource Manager Dispatcher for scheduling jobs Job – Provides location hints to the Job constraint: datacenter=1 Resource Manager for enforcement Resource Manager 1 Resource Manager 2 • Planner – Decides where the data will be Datacenter 1 Datacenter 2 Datacenter 3 replaced based on utilization and Table Data Table Data available resources #UnifiedAnalytics #SparkAISummit 21

22.Future Work • Mix workloads managed by one resource manager • Run batch workloads with off-peak resources from online services • Automatic resource tuning for high utilization • We’re hiring! Contact: rjian@fb.com #UnifiedAnalytics #SparkAISummit 22

23.DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT