Guest post by Mohamed Ahmed, originally published on the Magalix Blog

An excellent cloud-native application design should declare any specific resources that it needs to operate correctly. Kubernetes uses those requirements to make the most efficient decisions to ensure maximum performance and availability of the application.

Additionally, knowing the application requirements firsthand allows you to make cost-effective decisions regarding the hardware specifications of the cluster nodes. We will explore in this article best practices to declare storage, CPU, and memory resources needs. We will also discuss how Kubernetes behaves if you don’t specify some of these dependencies.

Storage dependency

Let’s explore the most common runtime requirement of an application: persistent storage. By default, any modifications made to the filesystem of a running container are lost when the container is restarted. Kubernetes provides two solutions to ensure that changes persist: emptyDir and Persistent Volumes.

Using Persistent Volumes, you can store data that does not get deleted even if the whole Pod was terminated or restarted. There are several methods by which you can provision a backend storage to the cluster. It depends on the environment where the cluster is hosted (on-prem or in the cloud and the cloud provider). In the following lab, we use the host’s disk as the persistent volume backend storage. Provisioning storage using Persistent Volumes involves two steps:

  1. Creating the Persistent Volume: this is the disk on which Pods claim space. This step differs depending on the hosting environment.
  2. Creating a Persistent Volume Claim: this is where you actually provision the storage for the Pod by claiming space on the Persistent Volume.

In the following lab, we create a Persistent Volume using the host’s local disk. Create a new YAML definition file, PV.yaml, and add the following lines:

apiVersion: v1
kind: PersistentVolume
metadata:
 name: hostpath-vol
spec:
 storageClassName: local
 capacity:
   storage: 1Gi
 accessModes:
   - ReadWriteOnce
 hostPath:
   path: "/tmp/data"

This definition creates a Persistent Volume (PV for short) that uses the host disk as the backend storage. The volume is mounted on /tmp/data directory on the host. We need to create this directory before applying the configuration:

mkdir /tmp/data
$ kubectl apply -f PV.yaml
persistentvolume/hostpath-vol created

Now, can create a Persistent Volume Claim (PVC for short) and avail it to our Pod to store data through a mount point. The following definition file creates both a PVC and a Pod that uses it:


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: my-pvc
spec:
 storageClassName: local
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 100Mi
---
apiVersion: v1
kind: Pod
metadata:
 name: pvc-example
spec:
 containers:
 - image: alpine
   name: pvc-example
   command: ['sh', '-c', 'sleep 10000']
   volumeMounts:
     - mountPath: "/data"
       name: my-vol

 volumes:
   - name: my-vol
     persistentVolumeClaim:
      claimName: my-pvc

Applying this definition file creates the PVC followed by the Pod:

$ kubectl apply -f pvc_pod.yaml
persistentvolumeclaim/my-pvc created
pod/pvc-example created

Any data that gets created or modified on /data inside the container will be persisted to the host’s disk. You can check that by logging into the container, creating a file under /data, restarting the Pod and then ensuring that the file still exists on the Pod. You can also notice that files created in /tmp/data are immediately available to the Pod and its containers.

Notice that in the preceding lab, we used one node only so there shouldn’t be any problems when we need to schedule a Pod that requires a PVC to function correctly. However, if we were in a multi-node environment (which is often the case when using Kubernetes), and if a given node is incapable of provisioning the persistent volume, the Pod will never get scheduled to this node. A worse scenario may occur if all the nodes in the cluster cannot provide the requested volume. In such a case, the Pod will not get scheduled at all.

The hostPort dependency

If you are using the hostPort option, you are explicitly allowing the internal container port to be accessible from outside the host. A Pod that uses hostPort cannot have more than one replica on the same host because of port conflicts. If no node can provide the required the port (let’s say it is a standard port number like port 80 or 443), then the Pod using in the hostPort option will never get scheduled. Additionally, this creates a one-to-one relationship between the Pod and its hosting node. So, in a cluster with four nodes, you can only have a maximum of four Pods that use the hostPort option (provided that the port is available on each node).

Configuration dependency

Almost all applications are designed so that they can be customized through variables. For example, MySQL needs at least the initial root credentials; WordPress needs the database host and name, and so on. Kubernetes provides configMaps for injecting variables to containers inside Pods and Secrets for supplying confidential variables like account credentials. Let’s have a quick example on how to use configMaps to provision variables to a Pod:

The following definition file creates both a configMap and Pod that uses the variables defined in this configMap:


kind: ConfigMap
apiVersion: v1
metadata:
 name: myconfigmap
data:
 # Configuration values can be set as key-value properties
 dbhost: db.example.com
 dbname: mydb
---
kind: Pod
apiVersion: v1
metadata:
 name: mypod
spec:
 containers:
   - name: mycontainer
     image: nginx
     envFrom:
       - configMapRef:
           name: myconfigmap

Now, let’s apply this configuration and ensure that we can use the environment variables inside our container:


$ kubectl apply -f pod.yml
configmap/myconfigmap created
pod/mypod created
$ kubectl exec -it mypod -- bash
root@mypod:/# echo $dbhost
db.example.com
root@mypod:/# echo $dbname
mydb

However, this creates a dependency of its own: if the configMap was not available, the container might not work as expected. In our example, if this container an application that needs a constant database connection to work, then if it failed to obtain the database name and host, it may not work at all.

The same thing holds for Secrets, which must be available firsthand before any client containers can get spawned.

Resource dependencies

Kubernetes Patterns - Capacity

So far we discussed the different runtime dependencies that affect which node will the Pod get scheduled (if at all) and the various prerequisites that must be availed for the Pod to function correctly. However, you must also take into consideration that capacity requirement of the containers.

Controllable and uncontrollable resources

When designing an application, we need to be aware of the type of resources that this application may consume. Generally, resources can be classified into two main categories:

Declaring pods resource requirements

The distinction between both resource types is crucial for a good design. Kubernetes allows you to declare the amount of CPU and memory the Pod requires to function. There are two parameters that you can use for this declaration:

Let’s have a quick example for a hypothetical application that needs at least 512 MB and 0.25% of a CPU core to run. The definition file for such a Pod may look like this:


kind: Pod
apiVersion: v1
metadata:
 name: mypod
spec:
 containers:
   - name: mycontainer
     image: myapp
     resources:
       requests:
         cpu: 250m
         memory: 512Mi
       limits:
         cpu: 500m
         memory: 750Mi

When the scheduler manages to deploy this Pod, it will search for a node that has at least 512 MB of memory free. If a suitable node was found, the Pod gets scheduled on it. Otherwise, the Pod will never get deployed. Notice that only the requests field is considered by the scheduler when determining where to deploy the Pod.

How are the resource requests and limits calculated?

Memory is calculated in bytes, but you are allowed to use units like Mi and Gi to specify the requested amount. Notice that you should not specify a memory limit that is higher than the amount of memory on your nodes. If you did, the Pod would never get scheduled. Additionally, since memory is a non-shareable resource as we discussed, if a container tried to request more memory than the limit, it will get killed. Pods that are created through a higher controller like a ReplicaSet or a Deployment have their containers restarted automatically when they crash or get terminated. Hence, it is always recommended that you create Pods through a controller.

CPU is calculated through millicores. 1 core = 1000 millicores. So, if you expect your container needs at least half a core to operate, you set the request to 500m. However, since CPU belongs to shareable resources when the container requests more CPU than the limit, it will not get terminated. Rather, the Kubelet throttles the container, which may negatively affect its performance. It is advised here that you use liveness and readiness probes to ensure that your application’s latency does not affect your business requirements.

What happens when you (not) specify requests and limits?

Most of the Pod definitions examples ignore the requests and limits parameters. You are not strictly required to include them when designing your cluster. Adding or ignoring requests and limits affects the Quality of Service (QoS) that the Pod receives as follows:

Lowest priority pods: when you do not specify requests and limits, the Kubelet will deal with your Pod in a best-effort manner. The Pod, in this case, has the lowest priority. If the node runs out of non-shareable resources, best-effort Pods are the first to get killed.

Medium priority pods: if you define both parameters and set the requests to be less than the limit, then Kubernetes manages your Pod in the Burstable manner. When the node runs out of non-shareable resources, the Burstable Pods will get killed only when there are not more best-effort Pods running.

Highest priority pods: your Pod will be deemed as of the most top priority when you set the requests and the limits to equal values. It’s as if you’re saying, “I need this Pod to consume no less and no more than x memory and y CPU.” In this case, and in the event of the node running out of shareable resources, Kubernetes does not terminate those Pods until the best-effort, and the burstable Pods are terminated. Those are the highest priority Pods.

We can summarize how the Kubelet deals with Pod priority as follows:

Table shows comparision on how Kubelet deals with Pod priority

Pod priority and preemption

Sometimes you may need to have more fine-grained control over which of your Pods get evicted first in the event of resource starvation. You can guarantee that a given Pod get evicted last if you set the request and limit to equal values. However, consider a scenario when you have two Pods, one hosting your core application and another hosting its database. You need those Pods to have the highest priority among other Pods that coexist with them. But you have an additional requirement: you want the application Pods to get evicted before the database ones do. Fortunately, Kubernetes has a feature that addresses this need: Pod Priority and preemption. Pod Priority and preemption is stable as of Kubernetes 1.14 or higher. The feature is enabled by default since Kubernetes version 1.11 (beta release). If your cluster version is less than 1.11, you will need to enable this feature explicitly.

So, back to our example scenario, we need two high priority Pods, yet one of them is more important than the other. We start by creating a PriorityClass then a Pod that uses this PriorityClass:


apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
 name: high-priority
value: 1000
---
apiVersion: v1
kind: Pod
metadata:
 name: mypod
spec:
 containers:
 - image: redis
   name: mycontainer
 priorityClassName: high-priority

The definition file creates two objects: the PriorityClass and a Pod. Let’s have a closer look at the PriorityClass object:

Line 1: the API version. As mentioned, PriorityClass is stable as of Kubernetes 1.14.

Line 2: the object type

Line 3 and 4: the metadata where we define the object name.

Line 5: we specify the value on which the priority is calculated relative to other Pods in the cluster. A higher value indicates a higher priority.

Next, we define the Pod that uses this PriorityClass by referring its name.

How pods get scheduled given their PriorityClass value?

When we have multiple Pods with different PriorityClass values, the admission controller starts by sorting Pods according to their priority. Highest priority Pods (those having the highest PriorityClass numbers) get scheduled first as long as no other constraints are preventing their scheduling.

Now, what happens if there are no nodes with available resources to schedule a high-priority pod? The scheduler will evict (preempt) lower priority Pods from the node to give enough room for the higher priority ones. The scheduler will continue lower-priority Pods until there is enough room to accommodate the more upper Pods.

This feature helps you when you design the cluster so that you ensure that the highest priority Pods (for example, the core application and database) are never evicted unless no other option is possible. At the same time, they also get scheduled first.

Things to consider in your design when using QoS and Pod Priority

You may be asking what happens when you use resources and limits (QoS) combined with the PriorityClass parameter. Do they overlap or override each other? In the following lines, we show you some of the essential things to note when influencing the schedule decisions:

TL;DR