深度解析Spark Scheduler

注脚

展开查看详情

1. Deep Dive: Scheduler in Apache Spark Xingbo Jiang Spark Summit | SF | Jun 2018 1

2.About Me • Software Engineer at • Active contributor of Apache Spark Xingbo Jiang (Github: jiangxb1987)

3.Databricks’ Unified Analytics Platform Unifies Data Engineers COLLABORATIVE NOTEBOOKS and Data Scientists Data Engineers Data Scientists DATABRICKS RUNTIME Unifies Data and AI Technologies Powered by Delta SQL Streaming Eliminates infrastructure complexity CLOUD NATIVE SERVICE

4.Architecture CoarseGrainedSchedulerBackend SchedulerBackend LocalSchedulerBackend SparkContext TaskScheduler DAGScheduler

5.SparkContext • Main entry point • Create SchedulerBackend/TaskScheduler/DAGScheduler on initialization • submitJob()/cancelJob()

6.Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager

7.RDD → Stage RDD Shuffled MapPartitions Shuffled MapPartitions RDD RDD RDD RDD RDD Stage0 Stage1 Stage2 rdd1.join(rdd2).groupBy(...).filter(...)

8.Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager

9.DAGScheduler • Implement stage-oriented scheduling • Compute a DAG of stages for submitted job • Keep track of materialized RDD/Stage outputs • Find a minimal schedule to run the job • Stage → TaskSet

10.Stage Dependency Stage0 Stage2 Stage1 Stage3

11.TaskSet • A set of tasks submitted to compute missing partitions of a particular stage • A stage can correspond to multiple TaskSets

12.Zombie TaskSet FetchFailure TaskSet (zombie) TaskSet Blacklisted TaskSet (active) (zombie) Exceptions TaskSet (zombie)

13.Stage → TaskSet TaskSet TaskSetManager (active) Stage TaskSet TaskSetManager (zombie)

14.Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager

15.TaskScheduler • DAGScheduler submit set of tasks to TaskScheduler • Schedule and monitor tasks with SchedulerBackend • Return events to DAGScheduler • JobSubmitted/JobCancelled • MapStageSubmitted/StageCancelled • CompletionEvent

16.Scheduling Tasks • Batch scheduling approach • Get all available slots • Schedule tasks with locality preference • Barrier scheduling approach (SPARK-24375) • Wait until all tasks in the same TaskSet can be scheduled at the same time • Retry all tasks in the TaskSet if any task fails

17.Scheduling TaskSets • FIFO by default • When multiple users run jobs on a single cluster? • Long running tasks may block later tasks • Use Fair Scheduling to ensure minimal share of resources for each user

18.Scheduling TaskSets • FIFO • FIFO between TaskSetManagers • Order by Priority, StageId • Fair Scheduling • FS between Pools, and FIFO or FS within Pools • Try to launch tasks in TaskSet that is further below the minShare • If both TaskSets are running above minShare, order by weighted number of running tasks

19.Scheduling Tasks in a TaskSet • Try to achieve better locality for each task • Less data transfer over network • Higher Performance • Locality can have different levels • Process locality • Node locality • Rack locality

20.TaskSetManager • Schedule tasks within a single TaskSet • Implement locality-aware scheduling via delay scheduling

21.Delay Scheduling • Wait for tasks to finish (vs killing running tasks) to assign slots • Wait for a few extra time to achieve better locality

22.Delay Scheduling Slot1 Task1 Task2 …... Taskn Satisfies Yes Launch max Task locality No No Exceeds Yes Skip wait timeout

23.Delay Scheduling when a heartbeat is received from node n: if n has a free slot then: compute maxAllowedLocality for pending tasks if exists task t can launch on n with locality <= maxAllowedLocality: launch t update currentLocality else if waitTime > maxDelayTime: launch t else: // Wait for next round of scheduling endif

24.Delay Scheduling For more detail, Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling.

25.Handle Failures Task Failure Fetch Failure ● Don’t count the failure into ● Record the failure count of task failure count the task ● Retry the stage if stage failure ● Retry the task if failure < maxStageFailures count < maxTaskFailures ● Abort the stage and ● Abort the stage and corresponding jobs if stage corresponding jobs if count failure >= maxStageFailures >= maxTaskFailures ● Mark executor/host as lost (optional)

26.Retry Stage ShuffleMapTask1 Shuffle01 Shuffle11 ResultTask1 ShuffleMapTask2 Shuffle02 Shuffle12 ResultTask2 ShuffleMapTask3 Shuffle03 Note: Shuffle01/02/03 not on the same host with Shuffle11, don’t need to retry.

27.Retry Stage ShuffleMapTask1 Shuffle01 Shuffle11 ResultTask1 ShuffleMapTask2 Shuffle02 Shuffle12 ResultTask2 ShuffleMapTask3 Shuffle03 Note: Shuffle02 on the same host with Shuffle11, retry ShuffleMapTask2.

28.Retry Stage/TaskSet • Retry Stage • Retry parent stages if necessary • Only retry tasks that have missing partitions • Retry TaskSet • Mark current TaskSet as zombie • Don’t kill running tasks

29.Scheduling Process RDD Objects DAGScheduler TaskScheduler Worker DAG TaskSet Task Threads Cluster Manager Block Manager

user picture
由Apache Spark PMC & Committers发起。致力于发布与传播Apache Spark + AI技术,生态,最佳实践,前沿信息。

相关文档