- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Deep Dive Scheduler in Apache Spark
展开查看详情
1 . Deep Dive: Scheduler in Apache Spark Xingbo Jiang Spark Summit | SF | Jun 2018 1
2 .About Me • Software Engineer at • Active contributor of Apache Spark Xingbo Jiang (Github: jiangxb1987)
3 .Databricks’ Unified Analytics Platform Unifies Data Engineers COLLABORATIVE NOTEBOOKS and Data Scientists Data Engineers Data Scientists DATABRICKS RUNTIME Unifies Data and AI Technologies Powered by Delta SQL Streaming Eliminates infrastructure complexity CLOUD NATIVE SERVICE
4 .Architecture CoarseGrainedSchedulerBackend SchedulerBackend LocalSchedulerBackend SparkContext TaskScheduler DAGScheduler
5 .SparkContext • Main entry point • Create SchedulerBackend/TaskScheduler/DAGScheduler on initialization • submitJob()/cancelJob()
6 .Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager
7 .RDD → Stage RDD Shuffled MapPartitions Shuffled MapPartitions RDD RDD RDD RDD RDD Stage0 Stage1 Stage2 rdd1.join(rdd2).groupBy(...).filter(...)
8 .Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager
9 .DAGScheduler • Implement stage-oriented scheduling • Compute a DAG of stages for submitted job • Keep track of materialized RDD/Stage outputs • Find a minimal schedule to run the job • Stage → TaskSet
10 .Stage Dependency Stage0 Stage2 Stage1 Stage3
11 .TaskSet • A set of tasks submitted to compute missing partitions of a particular stage • A stage can correspond to multiple TaskSets
12 .Zombie TaskSet FetchFailure TaskSet (zombie) TaskSet Blacklisted TaskSet (active) (zombie) Exceptions TaskSet (zombie)
13 .Stage → TaskSet TaskSet TaskSetManager (active) Stage TaskSet TaskSetManager (zombie)
14 .Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager
15 .TaskScheduler • DAGScheduler submit set of tasks to TaskScheduler • Schedule and monitor tasks with SchedulerBackend • Return events to DAGScheduler • JobSubmitted/JobCancelled • MapStageSubmitted/StageCancelled • CompletionEvent
16 .Scheduling Tasks • Batch scheduling approach • Get all available slots • Schedule tasks with locality preference • Barrier scheduling approach (SPARK-24375) • Wait until all tasks in the same TaskSet can be scheduled at the same time • Retry all tasks in the TaskSet if any task fails
17 .Scheduling TaskSets • FIFO by default • When multiple users run jobs on a single cluster? • Long running tasks may block later tasks • Use Fair Scheduling to ensure minimal share of resources for each user
18 .Scheduling TaskSets • FIFO • FIFO between TaskSetManagers • Order by Priority, StageId • Fair Scheduling • FS between Pools, and FIFO or FS within Pools • Try to launch tasks in TaskSet that is further below the minShare • If both TaskSets are running above minShare, order by weighted number of running tasks
19 .Scheduling Tasks in a TaskSet • Try to achieve better locality for each task • Less data transfer over network • Higher Performance • Locality can have different levels • Process locality • Node locality • Rack locality
20 .TaskSetManager • Schedule tasks within a single TaskSet • Implement locality-aware scheduling via delay scheduling
21 .Delay Scheduling • Wait for tasks to finish (vs killing running tasks) to assign slots • Wait for a few extra time to achieve better locality
22 .Delay Scheduling Slot1 Task1 Task2 …... Taskn Satisfies Yes Launch max Task locality No No Exceeds Yes Skip wait timeout
23 .Delay Scheduling when a heartbeat is received from node n: if n has a free slot then: compute maxAllowedLocality for pending tasks if exists task t can launch on n with locality <= maxAllowedLocality: launch t update currentLocality else if waitTime > maxDelayTime: launch t else: // Wait for next round of scheduling endif
24 .Delay Scheduling For more detail, Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling.
25 .Handle Failures Task Failure Fetch Failure ● Don’t count the failure into ● Record the failure count of task failure count the task ● Retry the stage if stage failure ● Retry the task if failure < maxStageFailures count < maxTaskFailures ● Abort the stage and ● Abort the stage and corresponding jobs if stage corresponding jobs if count failure >= maxStageFailures >= maxTaskFailures ● Mark executor/host as lost (optional)
26 .Retry Stage ShuffleMapTask1 Shuffle01 Shuffle11 ResultTask1 ShuffleMapTask2 Shuffle02 Shuffle12 ResultTask2 ShuffleMapTask3 Shuffle03 Note: Shuffle01/02/03 not on the same host with Shuffle11, don’t need to retry.
27 .Retry Stage ShuffleMapTask1 Shuffle01 Shuffle11 ResultTask1 ShuffleMapTask2 Shuffle02 Shuffle12 ResultTask2 ShuffleMapTask3 Shuffle03 Note: Shuffle02 on the same host with Shuffle11, retry ShuffleMapTask2.
28 .Retry Stage/TaskSet • Retry Stage • Retry parent stages if necessary • Only retry tasks that have missing partitions • Retry TaskSet • Mark current TaskSet as zombie • Don’t kill running tasks
29 .Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager