Guest post by Volcano Maintainers

On June 16, 2022, Apache Spark released its new version, v3.3. The highlight of this version is that it provides framework support for customized Kubernetes schedulers and, for the first time, uses Volcano as the default batch scheduler. Spark users can now easily move from Hadoop to Kubernetes and achieve high performance on large-scale data analytics.

Community Efforts

This feature is initiated by developers from Huawei and jointly implemented by contributors from companies such as Apple, Cloudera, Netflix, and Databricks. Users are now able to use third-party scheduling plug-ins for customized scheduling.

Spark + Volcano: Better Scheduling

A Spark job has one driver and multiple executors, both of which run in pods. Before, Spark scheduled the driver and executor pods separately, causing resource deadlock for the driver. In addition, advanced functions such as queue scheduling, fair-share scheduling, and resource reservation are not available.

Now with Volcano, Spark users can easily cope with the following scheduling scenarios.

Job-based Fair-Share Scheduling

Running elastic jobs (for example, streaming) requires fair resource allocation to meet their SLA/QoS. In the worst case, a single job may take up excess pod resources, and as a result, other jobs would underperform. To avoid insufficient allocation (for example, one pod for a job), Volcano allows you to define the minimum number of pods that should be started for an elastic job. Any pods beyond the specified minimum will share cluster resources with other jobs fairly.

Queue Scheduling

Queues enable elastic and batch processing workloads to access the shared resources. Specially:

Share resources among tenants or resource pools. For example, each department is mapped to a queue so that multiple departments can dynamically share cluster resources based on their queue weight.

Provide different scheduling policies or algorithms, such as FIFO, Fairness, and Priority, for tenants or resource pools.

Queues are implemented as cluster-scoped Custom Resource Definitions (CRDs), which are decoupled from namespaces. This allows jobs created in different namespaces to be placed in one shared queue. Queues also support configuration on min (minimum resources reserved for any urgent jobs jumping the queue) and max (maximum resources allowed for the queue). If max is not reached, the available resources can be shared among queues to improve the overall resource utilization.

Namespace-based, Fair-Share, Cross-Queue Scheduling

In a queue, jobs can be fairly scheduled in a scheduling loop. This means a user with more jobs has a greater chance of having their job scheduled.  For example, when a queue contains a small amount of resources, and there are 10 pods belonging to user A and 1,000 pods belonging to user B, pods of user A are less likely to be scheduled to a node.

In this case, a more fine-grained policy is required to balance resource allocation. As Kubernetes supports multi-tenancy, namespaces are used to isolate resources for users. Each namespace can be configured with a weight to determine the resource use priority.

Preemption & Reclaim

In a model backed by fair share, some jobs or queues will occupy resources even when they are idle. The resource owner will reclaim the resources if there are any further resource requests. The resources can be shared between queues or jobs. The reclaim action is used to balance resources between queues, and the preempt action is used to balance resources between jobs.

Minimal Resource Reservation

Before running a job (such as Spark) with multiple task roles, the Spark driver pod is created and run first, requesting kube-apiserver to create a Spark executor pod. In scenarios with insufficient resources or massive job submission, all available resources would be used up by the Spark driver pod, leaving the Spark executor nothing. In this way, all Spark jobs cannot run properly. To solve this problem, dedicated nodes are created for the Spark driver and executor pods, which cause resource fragmentation and low utilization. Volcano supports minimal resource reservation to ensure the Spark executors can obtain resources and improve the performance by more than 30%.

Reservation & Backfill

When a huge job that requests a large number of resources is submitted to Kubernetes, the job may be starved or be killed due to the current scheduling policy or algorithm if there are too many small jobs. To prevent this problem, resources should be reserved for jobs conditionally, such as by setting the timeout period. When resources are reserved, they may be idle. To improve resource utilization, the scheduler will backfill smaller jobs to those reserved resources conditionally. Both reservation and backfill are triggered according to the results returned by the plug-ins. Volcano provides APIs for developers or users to determine which jobs are reserved or backfilled.

Future Development

Volcano is adding new algorithms and optimizing APIs to facilitate algorithm extension and customization. More use cases are on the go, such as cross-cloud and cross-cluster scheduling, hybrid deployment, FinOps, intelligent elastic scheduling, and fine-grained resource management.

A detailed technical interpretation on batch scheduling provided by Volcano in Spark 3.3 is coming soon.

Volcano is CNCF’s first cloud native batch computing project. Open-sourced at Shanghai KubeCon in June 2019, it became an official CNCF project in April 2020. In April 2022, Volcano was promoted to a CNCF incubating project.

Volcano has been thriving and expanding into more sectors such as AI, big data, gene sequencing, transcoding, and rendering with its own ecosystem. It is also favored by Tencent, iQIYI, Xiaohongshu, Mogujie, Vipshop, Peng Cheng Laboratory, and Ruitian Capital.

Spark 3.3 release notes: https://spark.apache.org/releases/spark-release-3-3-0.html

Volcano: https://volcano.sh/en/

Github: https://github.com/volcano-sh/volcano