SPIP: Accelerator-aware scheduling
1.[This doc is shared publicly.] [SPARK-24615] SPIP: Accelerator-aware scheduling Author: Xingbo Jiang Background and Motivation GPUs and other accelerators have been widely used for accelerating special workloads, e.g., deep learning and signal processing. While users from the AI community use GPUs heavily, they often need Apache Spark to load and process large datasets and to handle complex data scenarios like streaming. YARN and Kubernetes already support GPUs in their recent releases. Although Spark supports those two cluster managers, Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users. To make Spark be aware of GPUs, we shall make two major changes at high level: ● At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them. ● Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly. Based on the work done in YARN and Kubernetes to support GPUs and some offline prototypes, we could have necessary features implemented in the next major release of Spark. You can find a detailed scoping doc here, where we listed user stories and their priorities. Goals ● Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes. ● No regression on scheduler performance for normal jobs. Non-goals ● Fine-grained scheduling within one GPU card. ○ We treat one GPU card and its memory together as a non-divisible unit. ● Support TPU. ● Support Mesos. ● Support Windows.
2.[This doc is shared publicly.] Target Personas ● Admins who need to configure clusters to run Spark with GPU nodes. ● Data scientists who need to build DL applications on Spark. ● Developers who need to integrate DL features on Spark. Implementation Sketch This section outlines a possible implementation of this proposal. These details may change and are intended to show how the implementation might be integrated. Spark Scheduling Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly. We shall allow specify resource requirements from RDD/PandasUDF API, these requirements shall be summarized inside DAGScheduler for each Stage. TaskSetManager manages the pending tasks for each Stage attempt, we shall update it to provide a pending task that have GPU requirements when possible. Note that if the submitted job don’t require GPUs, the scheduling behavior and efficiency shall remain the same as before. Currently CPUS_PER_TASK(spark.task.cpus) is a global config with int value to specify the number of cores each task shall be assigned. Since we expect the ability to control the task resource requirements at per-stage level, we shall change the config spark.task.cpus to the default resource requirements for each RDD and it can get overridden. We can also introduce a similar config spark.task.gpus to specify the default number of GPUs required by each task. We shall update ExecutorBackend to accept and manage GPU mappings, it can sync the resources information with SchedulerBackend, thus the SchedulerBackend can generate WorkerOffers that contains available GPU resources. On TaskContext creation, we shall allocate free GPU index(s) to the context, so we can avoid collisions.
3.[This doc is shared publicly.] Resource Manager Standalone Support Since we assume homogeneous worker resources, the accelerator resources info can be read from a global conf file. The ExecutorRunner shall carry the allocated GPU resource mappings, and pass them to Executor by parameters. YARN Support User can request GPU resources in the Spark application via spark-submit, the application with GPU resources can be launched using YARN+Docker, so user can easily define the DL environment in the Dockerfile. Spark need to upgrade YARN to 3.1.2+ to enable GPU support, it support the following features: ● Auto discovery of GPU resources. ● GPU isolation at process level. ● Placement constraints. ● Heterogeneous device types via node labels. Kubernetes Support User can specify GPU requirements for the Spark application on Kubernetes by the following possible choices: ● spark-submit w/ the same GPU configs used by standalone/YARN. ● spark-submit w/ pod template (new feature for Spark 3.0). ● Spark-submit w/ mutating webhook confs to modify pods at runtime. User can run Spark jobs on Kubernetes using nvidia-docker to access GPUs, Kubernetes also support the following features: ● Auto discovery of GPU resources. ● GPU isolation at executor pod level. ● Placement constraints via node selectors. ● Heterogeneous device types via node labels.