Guest post by Anne Holler, Chi Su, Travis Addair, Henry Prêcheur, Paweł Bojanowski, Madhuri Yechuri, and Richard Liaw

INTRODUCTION

Deep Learning (DL) has been successfully applied to many fields, including computer vision, natural language, business, and science. The open-source platforms Ray and Ludwig make DL accessible to diverse users, by reducing the complexity barriers to training, scaling, deploying, and serving DL models.

However, DL’s cost and operational overhead present significant challenges. The DL model dev/test/tuning cycle requires intermittent use of substantial GPU resources, which cloud vendors are well-positioned to provide, though at non-trivial prices. Given the expense, managing GPU resources judiciously is critical to the practical use of DL.

Nodeless Kubernetes commoditizes compute for Kubernetes clusters. It provisions just-in-time right-sized cost-effective compute for a Kubernetes application when the application starts, and terminates the compute when the application terminates. There are no autoscaling knobs to configure/maintain and no compute shape decisions (e.g., on-demand/spot/CaaS) to be made.

This blog describes running Ray and Ludwig on cloud Kubernetes clusters, using Nodeless K8s as a smart cluster provisioner to add right-sized GPU resources to the K8s cluster when they are needed and to remove them when they are not. Experiments comparing the cost and operational overhead of using Nodeless K8s vs using fixed-size Ray clusters running directly on EC2 show sizable improvements in efficiency and usability, reducing elapsed time by 61%, computing cost by 54%, and idle Ray cluster cost by 66%, while retaining the performance quality of the AutoML results and reducing operational complexity.

CLOUD RESOURCE MANAGEMENT EXPERIMENTS

OVERVIEW

Ludwig v0.4.1 was recently released, introducing its AutoML capability [Ludwig AutoML blog].  The functionality was developed by meta-learning from the results of thousands of hours of model training across 12 datasets.  We previously reported [Cloud Native Rejekts 2021] on an initial proof-of-concept applying Nodeless K8s resource management to this heuristic development workload.

In this blog, we describe applying Nodeless K8s resource management to the workload of validating Ludwig’s AutoML capability across an additional three datasets.  Ludwig AutoML utilizes the Ray Tune distributed execution platform to perform data-informed hyperparameter search on GPU-enabled workers.  The validation datasets were run using AutoML with 1 hour, 2 hour, and 4 hour Ray Tune time budgets, and the resulting model accuracy performance was compared to high-performing manually-tuned models.

CONFIGURATIONS AND RESULTS

The baseline configuration we used for Ludwig v0.4.1 AutoML validation was a fixed-size Ray cluster deployed directly on AWS GPU-enabled virtual machines, as shown in Figure 1.

Figure 1: Fixed-sized Ray Cluster deployed on AWS GPU VMs
Figure 1: Fixed-sized Ray Cluster deployed on AWS GPU VMs

We compared the validation workload running on this configuration to running on two alternative configurations that use Nodeless K8s resource management on AWS EKS.  The first was a Nodeless K8s cluster with a GPU-enabled Ray head and 0-8 GPU-enabled Ray workers, as shown in Figure 2.

Figure 2: Variable-sized Ray Cluster deployed on K8s with Nodeless, GPU-enabled Ray Head
Figure 2: Variable-sized Ray Cluster deployed on K8s with Nodeless, GPU-enabled Ray Head

The second was a Nodeless K8s cluster, with a CPU-only Ray head and 0-9 GPU-enabled Ray workers, as shown in Figure 3.

Figure 3: Variable-sized Ray Cluster deployed on K8s with Nodeless, CPU-only Ray Head
Figure 3: Variable-sized Ray Cluster deployed on K8s with Nodeless, CPU-only Ray Head

We discuss the reasons for these configuration choices and we compare these configurations running the AutoML validation workload in terms of elapsed time, workload and idle computing cost, operational complexity, and AutoML results quality in the following three sections.

BASELINE

Configuration

Our baseline runs to evaluate AutoML effectiveness at specified time budgets were performed on a fixed-size three-node Ray cluster, with the head and two workers all on g4dn.4xlarge GPU instances.  Our reasons for this configuration choice are as follows:

With this setup, we ran the three 1-hour runs serially (MushroomEdibility, ForestCover, Higgs), then the three 2-hour runs serially (same order), and then the three 4-hour runs serially (same).

AutoML Results

The AutoML model accuracy performance results run on the baseline configuration are shown in Table 1.  The results are competitive with manually-tuned reference models.

DatasetTaskRowsColsReference ScoreAutoML Score, 1hr, fixed-sizeAutoML Score, 2hr, fixed-sizeAutoML Score, 4hr, fixed-size
Higgsbclass11,000,000290.7880.7560.7600.762
ForestCovermclass580,000130.9700.9510.9540.956
MushroomEdibilitymclass8,124231.0001.0001.0001.000
Table 1: AutoML validation runs on baseline configuration

Observations

The per-hour cost of g4dn.4xlarge instances is $1.204.  Our serial baseline took 22.6 hours, which includes the 21 hours (3x1hr + 3x2hr + 3x4hr) budgeted for hyperparameter search, plus an additional 1.6 hours used for dataset load before each search starts and for evaluation of the best model produced per trial after each search ends.  Hence, the overall cost for the baseline workload was $81.631.  The idle cost for this Ray cluster is $3.612/hr.

We sought ways to reduce the workload elapsed time and cost, and also the cluster idle cost, while reducing operational cost as well.

With respect to reducing the workload elapsed time and the cluster idle cost, we wanted to apply auto-scaling with minimum workers set to 0 and maximum workers set to the value needed by simultaneous runs.  However, to avoid introducing operational complexity, we wanted an auto-scaling solution flexible enough to find right-sized available workers automatically, which the combination of the Ray Autoscaler and Nodeless K8s provides.  The Ray Autoscaler requests pod placement for scale-out when more workers are needed by the Ray cluster, and Nodeless adds appropriate nodes to the K8s cluster in response.  And the Ray Autoscaler requests pod removal for scale-in when fewer workers are needed by the Ray cluster, and Nodeless then removes the associated nodes from the K8s cluster.

With respect to reducing the workload running cost, one obvious opportunity is that Ray workers are not needed during the 1.6 hours comprising pre-search dataset load and post-search per trial best model evaluation.  A detailed analysis of resource use during the search itself exposed additional opportunities that the Ray Autoscaler plus Nodeless K8s combination can exploit.  Figure 4 displays the lifetimes of the baseline workload trials, showing that there can be significant periods during the Ray Tune search in which all three workers are not needed.  The reason is that Ludwig AutoML specifies that Ray Tune use the async hyperband scheduler, which terminates unpromising trials to avoid wasting resources on them.  Depending on the dataset, many trials are short-lived, and once two or fewer of the maximum 10 trials remain to be explored, only a subset of the three workers is needed.  Figure 5 displays how the three workers were used by the baseline trials over time, showing the idle periods.

In the next 2 sections, we present the results of applying the Ray Autoscaler with Nodeless K8s to this workload.

NODELESS K8s, GPU Ray Cluster

Configuration

With the goal of reducing the elapsed time and computing cost for the Ludwig AutoML validation workload, while also reducing operational complexity, we next ran the workload on a Nodeless K8s cluster with the Ray Autoscaler enabled.

With this setup, we ran the three 1-hour runs in parallel, then the three 2-hour runs in parallel, and then the three 4-hour runs in parallel.  Hence, in the fully-busy steady state, the Ray cluster was running with 9 nodes (the head and 8 workers).

AutoML Results

The model accuracy performance results, shown in Table 2, are comparable (within 2% noise) to the fixed-size cluster.

DatasetTaskAutoML Score, 1hr, fixed-sizeAutoML Score, 2hr, fixed-sizeAutoML Score, 4hr, fixed-sizeAutoML Score, 1hr, nodelessAutoML Score, 2hr, nodelessAutoML Score, 4hr, nodeless
Higgsbclass0.7560.7600.7620.7560.7600.768
ForestCovermclass0.9510.9540.9560.9470.9570.955
MushroomEdibilitymclass1.0001.0001.0001.0001.0001.000
Table 2: AutoML validation runs on Nodeless K8s GPU Ray Cluster w/GPU-enabled Ray Head

Observations

Comparing this run to the baseline, we observe the following:

NODELESS K8s, GPU Ray Cluster w/CPU head

Configuration

While the Nodeless run described above reduced the computing costs and operational overhead of obtaining and freeing right-sized GPU nodes, the cost for the idle Ray head GPU resources remained after the workload ended.  And the operational overhead of manually spinning down the Ray cluster to remove that GPU cost and redeploying the cluster when needed again remained; it would be simpler operationally if the Ray cluster head were left up.

To remove this remaining idle GPU cost and the operational overhead to avoid it, we reran the previous workload on a configuration in which the Ray head was deployed on a CPU-only node, with 0-9 workers being deployed on GPU-enabled nodes.  We note that in this deployment, the AutoML jobs must explicitly request GPU resources, because by default, Ludwig AutoML only requests GPU resources if it observes them on the Ray head; this involves a minor change in the AutoML job parameters to explicitly request GPU resources.  We also note that maxWorkers needed to be set to 9 (not 8), since in this case, the Ray head could not run any of the trials.  

AutoML Results

The model accuracy performance results, shown in Table 3, were again comparable within noise to those of the fixed-size baseline.

DatasetTaskAutoML Score, 1hr, fixed-sizeAutoML Score, 2hr, fixed-sizeAutoML Score, 4hr, fixed-sizeAutoML Score, 1hr, nodeless, CPU headAutoML Score, 2hr, nodeless, CPU headAutoML Score, 4hr, nodeless, CPU head
Higgsbclass0.7560.7600.7620.7560.7600.766
ForestCovermclass0.9510.9540.9560.9500.9560.954
MushroomEdibilitymclass1.0001.0001.0001.0001.0001.000
Table 3: AutoML validation runs on Nodeless K8s GPU Ray Cluster w/CPU-only Ray Head

Observations

Observations for this run include:

LESSONS LEARNED

The Ray/Ludwig AutoML DL training experiments showed the value of using the Ray Autoscaler with Nodeless K8s.  Table 4 lists key metrics from our experiments.

ConfigurationWorkload Elapsed HoursWorkload Total CostIdle Ray Cluster Cost per Hr
Baseline Fixed-size GPU Ray Cluster22.60$81.631$3.612
Nodeless K8s GPU Ray Cluster w/GPU-enabled Ray Head8.75$37.833$1.204
Nodeless K8s GPU Ray Cluster w/CPU-only Ray Head9.00$37.840$0.452
Table 4: Key Metrics for AutoML Validation Workload Experiments

Use of Nodeless K8s on a GPU Ray cluster with a GPU-enabled Ray Head reduced elapsed time by 61%, computing cost by 54%, and idle Ray cluster cost by 66%, while retaining the performance quality of the AutoML results.  And operational complexity was reduced by replacing manual worker choice and static deployment with automated choice of less expensive available workers and dynamic scaling coordinated with Ray Autoscaler.

Further reductions in idle cost with respect to the baseline, with a minor configuration update, were obtained using a Ray cluster with a CPU-only head and GPU-enabled workers, with the idle cluster cost reduced by 87% vs the baseline.  This idle cost reduction enabled lower operational complexity by replacing manual undeploy/redeploy of the Ray cluster, done to avoid GPU expense, with the convenience of keeping the Ray cluster always on.

These substantial savings in workload elapsed time, execution costs, idle costs, and operational complexity are expected to apply to a variety of DL training use cases.

SUMMARY AND FUTURE WORK

In this blog, we’ve shown how Nodeless K8s resource management can significantly reduce the elapsed time, computing costs, and operational complexity of Deep Learning training using Ludwig and Ray in the Public Cloud relative to manually-deployed fixed-size clusters.  In future work, we plan to focus on demonstrating how Nodeless K8s can also improve these attributes for DL serving, which can have stricter latency requirements and shorter burst periods.

As future work, we plan to share a general comparison of Nodeless K8s resource management with other Kubernetes cluster Autoscalers.

If you would like to learn more about Nodeless Kubernetes and how it could help reduce operational complexity and wasted spend of your Kubernetes clusters along with improving multi-tenant security, please contact Madhuri at madhuri@elotl.co for a free trial.

If you would like to learn more about Ludwig and how to bring the benefits of low-code, declarative machine learning to your organization, please contact Travis at travis@predibase.com.

If you would like to learn more about Ray, please check out https://www.ray.io/.

Figure 4: Workload Trial Lifetimes, Baseline Configuration
Figure 5: Ray Workers Usage Per Trial, Baseline Configuration
Figure 6: Workload Trial Lifetimes, Nodeless K8s Cluster with GPU-enabled Ray Head
Figure 7: Ray Worker Usage Per Trial, Nodeless Kubernetes Cluster with GPU-enabled Ray Head
Figure 8: Ray Worker Usage Across Trials, Nodeless Kubernetes Cluster with GPU-enabled Ray Head
Figure 9: Ray Autoscaler Node Lifetimes, Nodeless Kubernetes Cluster with GPU-enabled Ray Head
Figure 10: Ray Worker usage across trials, Nodeless Kubernetes Cluster with CPU-only Ray Head
Figure 11: Ray Autoscaler Node Lifetimes, Nodeless Kubernetes Cluster with CPU-only Ray Head