Advanced Kubernetes pod to node scheduling

By Ben Hirschberg July 27, 2021

Guest post by Ben Hirschberg, VP R&D & Co-founder, ARMO

In Kubernetes, the task of scheduling pods to specific nodes in the cluster is handled by the kube-scheduler. The default behavior of this component is to filter nodes based on the resource requests and limits of each container in the created pod. Feasible nodes are then scored to find the best candidate for the pod placement.

In many scenarios, scheduling pods based on resource constraints is a desired behavior. However, in certain use cases Kubernetes administrators want to schedule pods to specific nodes according to other constraints. This is when advanced pod scheduling should be considered.

In this article, I’ll review some of the use cases for advanced pod scheduling in Kubernetes as well as best practices for implementing it in real-world situations. This will be especially helpful to app engineers and K8s administrators looking to implement advanced application deployment patterns involving data locality, pod co-location, high availability, and efficient resource utilization of their K8s clusters.

Use Cases for Manual Pod-to-Node Scheduling

In the production Kubernetes setting, customizing how pods are scheduled to nodes is a necessity. These are some of the most common scenarios when advanced pod scheduling would be desirable:

Running pods on nodes with dedicated hardware: Some Kubernetes apps may have specific hardware requirements. For example, pods running ML jobs require performant GPUs instead of CPUs, while Elasticsearch pods would be more efficient on SSDs than HDDs. Thus, the best practice for any resource-aware K8s cluster management is to assign pods to the nodes with the right hardware.
Pods colocation and codependency: In a microservices setting or a tightly coupled application stack, certain pods should be collocated on the same machine to improve performance, avoid network latency issues, and connection failures. For example, it’s a good practice to run a web server on the same machine as an in-memory cache service or database.
Data Locality: The data locality requirement of data-intensive applications is similar to the previous use case. To ensure faster reads and better write throughput, these applications may require the databases to be deployed on the same machine where the customer-facing application runs.
High Availability and Fault Tolerance: To make application deployments highly available and fault-tolerant, it’s a good practice to run pods on nodes deployed in separate availability zones.

Kubernetes Resources for Advanced Pod Scheduling

Kubernetes provides many API resources and strategies that help implement these use cases. Below, I’ll go over nodeSelector, node affinity, and inter-pod affinity concepts. I’ll also walk you through some examples and show you how to implement them in your K8s cluster.

Manual Scheduling of Pods with the nodeSelector

In the earlier K8s versions, users could implement a manual pod scheduling using a nodeSelector field of the PodSpec. In essence, nodeSelector is a label-based pod-to-node scheduling method where users assign certain labels to nodes and make sure that the nodeSelector field matches those labels.

For example, let’s assume that one of the node labels is “storage=ssd” to indicate the type of storage on the node.

kubectl describe node “host01”
Name: host01
Roles: node
Labels: storage=ssd,
……

To schedule pods onto the node with this label, I’ll specify the nodeSelector field with that label in the Pod manifest.

apiVersion: v1
kind: Pod
metadata:
name: pod
labels:
env: dev
spec:
containers:
– name: your-container
image: your-container-image
imagePullPolicy: IfNotPresent
nodeSelector:
storage: ssd

Node selectors are the simplest method of advanced pod scheduling. However, they are not very useful when other rules and conditions should be considered during pod scheduling.

Node Affinity

The node affinity feature is a qualitative improvement compared to the manual pod placement approach discussed above. It offers an expressive affinity language using logical operators and constraints that provides fine-grained control over pod placement. It also supports “soft” and “hard” scheduling rules that allow controlling the strictness of node affinity constraints depending on the user requirements.

In the example below, we use node affinity to place pods on nodes in specific availability zones. Let’s look at the manifest below:

apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
– matchExpressions:
– key: kubernetes.io/cp-az-name
operator: In
values:
– cp-1a
– cp-1b
preferredDuringSchedulingIgnoredDuringExecution:
– weight: 7
preference:
matchExpressions:
– key: custom-key
operator: In
values:
– custom-value
containers:
– name: node-affinity
image: your-container-image

“Hard” affinity rules are specified under the “required during scheduling ignored during execution” field of the nodeAffinity section of the pod manifest. In this example, I told the scheduler to place the pod only on nodes with the label that has a key kubernetes.io/cp-az-name and values cp-1a or cp-1b.

To achieve this, I used the In logical operator that filters the array of existing label values. Other operators I could use include NotIn, Exists, DoesNotExist, Gt, and Lt.

The “soft” rule is specified under the “preferred during scheduling ignored during execution” field of the spec. In this example, it states that among the nodes that meet “hard” criteria I want to use nodes with a label that has a key named “custom-key” and the value “custom-value.” However, if no such nodes exist, I don’t object to scheduling pods to other candidates if they meet the “hard” criteria.

It’s a good practice to construct node affinity rules in a way that incorporates both “hard” and “soft” rules. Following this “best-effort” approach—i.e., use some option if possible but not reject scheduling if the option is not available—makes deployment scheduling more flexible and predictable.

Inter-Pod Affinity

Inter-pod affinity in Kubernetes is a feature that allows you to schedule pods based on their relationship to other pods. This feature enables various interesting use cases, such as colocation of pods that are part of the codependent service(s) or the implementation of data locality where data pods run on the same machine as the main service pod.

Inter-pod affinity is defined similarly to node affinity. In this case, however, I’ll use the podAffinity field of the pod spec.

apiVersion: v1
kind: Pod
metadata:
name: example-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
– labelSelector:
matchExpressions:
– key: security
operator: In
values:
– S1
topologyKey: failure-domain.beta.kubernetes.io/zone
containers:
– name: pod-affinity
image: your-container

Similar to node affinity, pod affinity supports match expressions and logical operators. In this case, however, they are applied to the label selector of the pods running on a particular node. If the specified expression matches the pod label of the target pod, a new pod is collocated with the target pod on the same machine.

Anti-Affinity

In certain cases, it’s preferable to adopt the “blacklist” approach to pod scheduling. In this approach, pods are prevented from being scheduled onto particular nodes when certain conditions are not met. This vision is implemented in the Kubernetes node-to-pod anti-affinity, and inter-pod anti-affinity.

The main usage for pod-to-node anti-affinity is with dedicated nodes. To control resource utilization in the cluster, K8s administrators may allocate certain nodes to specific pod types and/or applications.

Other interesting use cases for inter-pod anti-affinity include:

Avoiding a single point of failure: This can be achieved by spreading pods of the same service across different machines, which requires preventing pods from being collocated with other pods of the same type.
Preventing inter-service competition for resources: To improve the performance of certain services, avoid placing them with other services that consume a lot of resources.

Pod-to-node anti-affinity can be achieved with taints and tolerations in Kubernetes. Let’s take a closer look at this feature.

Taints and Tolerations

Taints (conditions) and tolerations can help you control the scheduling of pods to specific nodes without modifying existing pods.

By default, all pods that don’t have tolerations for the node taint will be rejected or evicted from the node. This behavior allows for flexible cluster and application deployment patterns, where you don’t need to change the pod spec if you don’t want a pod to run on specific nodes.

Implementing taints and tolerations is quite simple. First, add a taint to a node that needs to apply some non-standard scheduling behavior. For example:

kubectl taint nodes host2 storage=ssd:NoSchedule
node “host1” tainted

The taint format is <taintKey>=<taintValue>:<taintEffect> . The taint effect I’ve used here prevents any pod without the matching tolerations from being scheduled to this node.

Other supported taint effects include NoExecute and PreferNoSchedule (the “soft” version of NoSchedule). If the PreferNoSchedule taint effect is applied, the kube-scheduler will try not to place the pod without the required toleration onto the tainted node, but this is not required.

Finally, the NoExecute effect causes instant eviction of all pods without certain toleration from the node. This effect may be useful if you already have pods running on the node and you no longer need them.

Creating a taint is only the first part of the configuration. To allow pods to be scheduled on a tainted node, we need to add the toleration:

apiVersion: v1
metadata:
name: esearch
spec:
containers:
– name: esearch
image: your-es-container
resources:
requests:
cpu: 0.8
memory: 4Gi
limits:
cpu: 3.0
memory: 22Gi
tolerations:
– key: “storage”
operator: “Equal”
value: “ssd”
effect: “NoSchedule”

In this example, I added the toleration for the above taint using the “Equal” operator. I could also use the “Exists” operator, which will apply toleration to any node matching the key of the taint. However, the value doesn’t need to be specified.

In this case, I’ll use the taint “storage=ssd: NoSchedule” to schedule the pod we define above to the node.

Pod Anti-Affinity

It’s possible to repel pods from each other via the pod anti-affinity feature. As mentioned above, one of the best practices in Kubernetes is to avoid a single point of failure by spreading pods across different availability zones. I can configure similar behavior in the anti-affinity part of the pod spec. For pod anti-affinity we’ll need two pods:

The first pod:

apiVersion: v1
kind: Pod
metadata:
name: s1
labels:
security: s1
spec:
containers:
– name: c1
image: first-image

Note that the first pod has the label “security: s1.”

apiVersion: v1
kind: Pod
metadata:
name: s2
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
– labelSelector:
matchExpressions:
– key: security
operator: In
values:
– s1
topologyKey: kubernetes.io/hostname
containers:
– name: pod-anti-affinity
image: second-image

The second pod refers to the label selector “security:s1” under the spec.affinity.podAntiAffinity. As a consequence, this pod won’t be scheduled to a node that already hosts any pods with the label “security:s1”.

Conclusion

Advanced pod scheduling in Kubernetes allows for the implementation of many interesting use cases and best practices for deploying complex applications and microservices on Kubernetes. With pod affinity, you can implement pod colocation and data locality for tightly coupled application stacks and microservices.

Below, you can find a “cheat sheet” summarizing some key data for each resource type.

Resource Type	Use Cases	Pros	Cons	Best Practices
nodeSelector	Assigning pods to nodes with specific labels	Easy to use, small changes to the PodSpec	Does not support logical operators,hard to extend with complex scheduling rules	This resource should be used only in the early versions of K8s before introduction of node affinity
Node affinity	Implementing data locality, running pods on nodes with dedicated software	Expressive syntax with logical operators,fine-grained control over pod placement rules,support for “hard” and “soft” pod placement rules	Requires modification of existing pods to change behavior	Use a combination of “hard” and “soft” rules to cover different use cases and scenarios
Inter-pod affinity	Colocation of pods in the co-dependent service,enabling data locality	The same as for node affinity	Requires modification of existing pods to change behavior	Proper pod label management and documentation of labels used
Pod anti-affinity	Enabling high availability (via pod distribution),preventing inter-service competition for resources	Fine-grained control over inter-pod repel behavior,support for hard and soft pod anti-affinity rules	Requires modification of existing pods to change behavior	Similar to node affinity
Taints and tolerations	Nodes with dedicated software,separation of team resources, etc.	Does not require modification of existing pods,supports automatic eviction of pods without required toleration,supports different taint effects	Does not support expressive syntax using logical operators	Be careful when applying multiple taints to the node,ensure that the pods you need have required tolerations

Table 1: Overview of Kubernetes resources for advanced pod scheduling

Using node anti-affinity and taints, you’re able to run nodes with hardware dedicated to specific applications and services, achieving efficient resource utilization in the cluster. With pod anti-affinity and node anti-affinity, you can also secure high availability of applications and avoid a single point of failure by making different components run on different nodes.

Affinity and anti-affinity are two powerful concepts in Kubernetes that every cluster administrator should master. As you get started, this article can provide guidance on the usage of these concepts in Kubernetes.

Hong Kong