Guest post by Ben Hirschberg, VP R&D & Co-founder, ARMO

In Kubernetes, the task of scheduling pods to specific nodes in the cluster is handled by the kube-scheduler. The default behavior of this component is to filter nodes based on the resource requests and limits of each container in the created pod. Feasible nodes are then scored to find the best candidate for the pod placement.

In many scenarios, scheduling pods based on resource constraints is a desired behavior. However, in certain use cases Kubernetes administrators want to schedule pods to specific nodes according to other constraints. This is when advanced pod scheduling should be considered. 

In this article, I’ll review some of the use cases for advanced pod scheduling in Kubernetes as well as best practices for implementing it in real-world situations. This will be especially helpful to app engineers and K8s administrators looking to implement advanced application deployment patterns involving data locality, pod co-location, high availability, and efficient resource utilization of their K8s clusters.

Use Cases for Manual Pod-to-Node Scheduling

In the production Kubernetes setting, customizing how pods are scheduled to nodes is a necessity. These are some of the most common scenarios when advanced pod scheduling would be desirable:

Kubernetes Resources for Advanced Pod Scheduling 

Kubernetes provides many API resources and strategies that help implement these use cases. Below, I’ll go over nodeSelector, node affinity, and inter-pod affinity concepts. I’ll also walk you through some examples and show you how to implement them in your K8s cluster.

Manual Scheduling of Pods with the nodeSelector

In the earlier K8s versions, users could implement a manual pod scheduling using a nodeSelector field of the PodSpec. In essence, nodeSelector is a label-based pod-to-node scheduling method where users assign certain labels to nodes and make sure that the nodeSelector field matches those labels.

For example, let’s assume that one of the node labels is “storage=ssd” to indicate the type of storage on the node.

kubectl describe node “host01”
Name:               host01
Roles:              node
Labels:             storage=ssd,
                    ……

To schedule pods onto the node with this label, I’ll specify the nodeSelector field with that label in the Pod manifest.

apiVersion: v1
kind: Pod
metadata:
name: pod
labels:
  env: dev
spec:
containers:
– name: your-container
  image: your-container-image
  imagePullPolicy: IfNotPresent
nodeSelector:
  storage: ssd

Node selectors are the simplest method of advanced pod scheduling. However, they are not very useful when other rules and conditions should be considered during pod scheduling.

Node Affinity 

The node affinity feature is a qualitative improvement compared to the manual pod placement approach discussed above. It offers an expressive affinity language using logical operators and constraints that provides fine-grained control over pod placement. It also supports “soft” and “hard” scheduling rules that allow controlling the strictness of node affinity constraints depending on the user requirements. 

In the example below, we use node affinity to place pods on nodes in specific availability zones. Let’s look at the manifest below:

apiVersion: v1
kind: Pod
metadata:
name: node-affinity
spec:
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      – matchExpressions:
        – key: kubernetes.io/cp-az-name
          operator: In
          values:
          – cp-1a
          – cp-1b
    preferredDuringSchedulingIgnoredDuringExecution:
    – weight: 7
      preference:
        matchExpressions:
        – key: custom-key
          operator: In
          values:
          – custom-value
containers:
– name: node-affinity
  image: your-container-image

“Hard” affinity rules are specified under the “required during scheduling ignored during execution” field of the nodeAffinity section of the pod manifest. In this example, I told the scheduler to place the pod only on nodes with the label that has a key kubernetes.io/cp-az-name and values cp-1a or cp-1b. 

To achieve this, I used the In logical operator that filters the array of existing label values. Other operators I could use include NotIn, Exists, DoesNotExist, Gt, and Lt.

The “soft” rule is specified under the “preferred during scheduling ignored during execution” field of the spec. In this example, it states that among the nodes that meet “hard” criteria I want to use nodes with a label that has a key named “custom-key” and the value “custom-value.” However, if no such nodes exist, I don’t object to scheduling pods to other candidates if they meet the “hard” criteria.

It’s a good practice to construct node affinity rules in a way that incorporates both “hard” and “soft” rules. Following this “best-effort” approach—i.e., use some option if possible but not reject scheduling if the option is not available—makes deployment scheduling more flexible and predictable. 

Inter-Pod Affinity

Inter-pod affinity in Kubernetes is a feature that allows you to schedule pods based on their relationship to other pods. This feature enables various interesting use cases, such as colocation of pods that are part of the codependent service(s) or the implementation of data locality where data pods run on the same machine as the main service pod.

Inter-pod affinity is defined similarly to node affinity. In this case, however, I’ll use the podAffinity field of the pod spec.

apiVersion: v1
kind: Pod
metadata:
name: example-pod-affinity
spec:
affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    – labelSelector:
        matchExpressions:
        – key: security
          operator: In
          values:
          – S1
      topologyKey: failure-domain.beta.kubernetes.io/zone
    containers:
– name: pod-affinity
  image: your-container

Similar to node affinity, pod affinity supports match expressions and logical operators. In this case, however, they are applied to the label selector of the pods running on a particular node. If the specified expression matches the pod label of the target pod, a new pod is collocated with the target pod on the same machine. 

Anti-Affinity

In certain cases, it’s preferable to adopt the “blacklist” approach to pod scheduling. In this approach, pods are prevented from being scheduled onto particular nodes when certain conditions are not met. This vision is implemented in the Kubernetes node-to-pod anti-affinity, and inter-pod anti-affinity.

The main usage for pod-to-node anti-affinity is with dedicated nodes. To control resource utilization in the cluster, K8s administrators may allocate certain nodes to specific pod types and/or applications. 

Other interesting use cases for inter-pod anti-affinity include:

Pod-to-node anti-affinity can be achieved with taints and tolerations in Kubernetes. Let’s take a closer look at this feature.

Taints and Tolerations

Taints (conditions) and tolerations can help you control the scheduling of pods to specific nodes without modifying existing pods.

By default, all pods that don’t have tolerations for the node taint will be rejected or evicted from the node. This behavior allows for flexible cluster and application deployment patterns, where you don’t need to change the pod spec if you don’t want a pod to run on specific nodes.

Implementing taints and tolerations is quite simple. First, add a taint to a node that needs to apply some non-standard scheduling behavior. For example: 

kubectl taint nodes host2 storage=ssd:NoSchedule
node “host1” tainted

The taint format is <taintKey>=<taintValue>:<taintEffect> . The taint effect I’ve used here prevents any pod without the matching tolerations from being scheduled to this node.

Other supported taint effects include NoExecute and PreferNoSchedule (the “soft” version of NoSchedule). If the PreferNoSchedule taint effect is applied, the kube-scheduler will try not to place the pod without the required toleration onto the tainted node, but this is not required.

Finally, the NoExecute effect causes instant eviction of all pods without certain toleration from the node. This effect may be useful if you already have pods running on the node and you no longer need them.

Creating a taint is only the first part of the configuration. To allow pods to be scheduled on a tainted node, we need to add the toleration:

apiVersion: v1
metadata:
  name: esearch
spec:
  containers:
  – name: esearch
    image: your-es-container
    resources:
      requests:
        cpu: 0.8
        memory: 4Gi
      limits:
        cpu: 3.0
        memory: 22Gi
  tolerations:
  – key: “storage”
    operator: “Equal”
    value: “ssd”
    effect: “NoSchedule”

In this example, I added the toleration for the above taint using the “Equal” operator. I could also use the “Exists” operator, which will apply toleration to any node matching the key of the taint. However, the value doesn’t need to be specified.

In this case, I’ll use the taint “storage=ssd: NoSchedule” to schedule the pod we define above to the node.

Pod Anti-Affinity

It’s possible to repel pods from each other via the pod anti-affinity feature. As mentioned above, one of the best practices in Kubernetes is to avoid a single point of failure by spreading pods across different availability zones. I can configure similar behavior in the anti-affinity part of the pod spec. For pod anti-affinity we’ll need two pods:

The first pod:

apiVersion: v1
kind: Pod
metadata:
name: s1
labels:
  security: s1
spec:
containers:
– name: c1
  image: first-image

Note that the first pod has the label “security: s1.” 

apiVersion: v1
kind: Pod
metadata:
name: s2
spec:
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    – labelSelector:
        matchExpressions:
        – key: security
          operator: In
          values:
          – s1
      topologyKey: kubernetes.io/hostname
containers:
– name: pod-anti-affinity
  image: second-image

The second pod refers to the label selector “security:s1” under the spec.affinity.podAntiAffinity. As a consequence, this pod won’t be scheduled to a node that already hosts any pods with the label “security:s1”.

Conclusion

Advanced pod scheduling in Kubernetes allows for the implementation of many interesting use cases and best practices for deploying complex applications and microservices on Kubernetes. With pod affinity, you can implement pod colocation and data locality for tightly coupled application stacks and microservices.

Below, you can find a “cheat sheet” summarizing some key data for each resource type. 

Resource TypeUse CasesProsConsBest Practices
nodeSelectorAssigning pods to nodes with specific labelsEasy to use, small changes to the PodSpecDoes not support logical operators,hard to extend with complex scheduling rulesThis resource should be used only in the early versions of K8s before introduction of node affinity
Node affinityImplementing data locality, running pods on nodes with dedicated softwareExpressive syntax with logical operators,fine-grained control over pod placement rules,support for “hard” and “soft” pod placement rulesRequires modification of existing pods to change behaviorUse a combination of “hard” and “soft” rules to cover different use cases and scenarios
Inter-pod affinityColocation of pods in the co-dependent service,enabling data localityThe same as for node affinityRequires modification of existing pods to change behaviorProper pod label management and documentation of labels used 
Pod anti-affinityEnabling high availability (via pod distribution),preventing inter-service competition for resourcesFine-grained control over inter-pod repel behavior,support for hard and soft pod anti-affinity rulesRequires modification of existing pods to change behaviorSimilar to node affinity
Taints and tolerationsNodes with dedicated software,separation of team resources, etc.Does not require modification of existing pods,supports automatic eviction of pods without required toleration,supports different taint effectsDoes not support expressive syntax using logical operatorsBe careful when applying multiple taints to the node,ensure that the pods you need have required tolerations 
Table 1: Overview of Kubernetes resources for advanced pod scheduling

Using node anti-affinity and taints, you’re able to run nodes with hardware dedicated to specific applications and services, achieving efficient resource utilization in the cluster. With pod anti-affinity and node anti-affinity, you can also secure high availability of applications and avoid a single point of failure by making different components run on different nodes.

Affinity and anti-affinity are two powerful concepts in Kubernetes that every cluster administrator should master. As you get started, this article can provide guidance on the usage of these concepts in Kubernetes.