Decoding the self-healing Kubernetes: step by step

Posted on May 26, 2020

CNCF projects highlighted in this post

Guest post originally published on the Msys Technology blog by Atul Jadhav

Prologue

Business application that fails to operate 24/7 would be considered inefficient in the market. The idea is that applications run uninterrupted irrespective of a technical glitch, feature update, or a natural disaster. In today’s heterogeneous environment where infrastructure is intricately layered, a continuous workflow of application is possible via self-healing.

Kubernetes, which is a container orchestration tool, facilitates the smooth working of the application by abstracting machines physically. Moreover, the pods and containers in Kubernetes can self-heal.

Captain America asked Bruce Wanners in Avengers to get angry to transform into ‘The Hulk’. Bruce replied, “That’s my secret Captain. I’m always angry.”

You must have understood the analogy here. Let’s simplify – Kubernetes will self-heal organically, whenever the system is affected.

Kubernetes’s self-healing property ensures that the clusters always function at the optimal state. Kubernetes can self-detect two types of object – podstatus and containerstatus. Kubernetes’s orchestration capabilities can monitor and replace unhealthy container as per the desired configuration. Likewise, Kubernetes can fix pods, which are the smallest units encompassing single or multiple containers.

The three container states include

Waiting – created but not running. A container, which is in a waiting stage, will still run operations like pulling images or applying secrets, etc. To check the Waiting pod status, use the below command.

kubectl describe pod [POD_NAME]

Along with this state, a message and reason about the state are displayed to provide more information.

...
State: Waiting
Reason: ErrImagePull
...

Running Pods – containers that are running without issues. The following command is executed before the pod enters the Running state.

postStart

Running pods will display the time of the entrance of the container.

...
  State:          Running
   Started:      Wed, 30 Jan 2019 16:46:38 +0530
...

Terminated Pods – container, which fails or completes its execution; stands terminated. The following command is executed before the pod is moved to Terminated.

prestop

Terminated pods will display the time of the entrance of the container.

…
State: Terminated

Reason: Completed

Exit Code: 0

Started: Wed, 30 Jan 2019 11:45:26 +0530

Finished: Wed, 30 Jan 2019 11:45:26 +0530

…

Kubernetes’ self-healing Concepts – pod’s phase, probes, and restart policy.

The pod phase in Kubernetes offers insight into the pod’s placement. We can have

Pending Pods – created but not running
Running Pods – runs all the containers
Succeeded Pods – successfully completed container lifecycle
Failed Pods – minimum one container failed and all container terminated
Unknown Pods

Kubernetes execute liveliness and readiness probes for the Pods to check if they function as per the desired state. The liveliness probe will check a container for its running status. If a container fails the probe, Kubernetes will terminate it and create a new container in accordance with the restart policy. The readiness probe will check a container for its service request serving capabilities. If a container fails the probe, then Kubernetes will remove the IP address of the related pod.

Liveliness probe example.

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - args:
    - /server
    image: k8s.gcr.io/liveness
    livenessProbe:
      httpGet:
        # when "host" is not defined, "PodIP" will be used
        # host: my-host
        # when "scheme" is not defined, "HTTP" scheme will be used. Only "HTTP" and "HTTPS" are allowed
        # scheme: HTTPS
        path: /healthz
        port: 8080
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
      initialDelaySeconds: 15
      timeoutSeconds: 1
    name: liveness

The probes include

ExecAction – to execute commands in containers.
TCPSocketAction – to implement a TCP check w.r.t to the IP address of a container.
HTTPGetAction – to implement a HTTP Get check w.r.t to the IP address of a container.

Each probe gives one of three results:

Success: The Container passed the diagnostic.
Failure: The Container failed the diagnostic.
Unknown: The diagnostic failed, so no action should be taken.

Demo description of Self-Healing Kubernetes – Example 1

We need to set the code replication to trigger the self-healing capability of Kubernetes.

Let’s see an example of the Nginx file.

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: nginx-deployment-sample
spec:
  selector:
    matchLabels:
      app: nginx
  replicas:4
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

In the above code, we see that the total number of pods across the cluster must be 4.

Let’s now deploy the file.

kubectl apply nginx-deployment-sample

Let’s list the pods, using

kubectl get pods -l app=nginx

Here is the output.

As you see above, we have created 4 pods.

Let’s delete one of the pods.

kubectl delete nginx-deployment-test-83586599-r299i

The pod is now deleted. We get the following output

pod "deployment nginx-deployment-test-83586599-r299i" deleted

Now again, list the pods.

kubectl get pods -l app=nginx

We get the following output.

We have 4 pods again, despite deleting one.

Kubernetes has self-healed to create a new node and maintain the count to 4.

Demo description of Self-Healing Kubernetes – Example 2

Get pod details

$ kubectl get pods -o wide

Get first nginx pod and delete it – one of the nginx pods should be in ‘Terminating’ status

$ NGINX_POD=$(kubectl get pods -l app=nginx --output=jsonpath="{.items[0].metadata.name}")
$ kubectl delete pod $NGINX_POD; kubectl get pods -l app=nginx -o wide
$ sleep 10

Get pod details – one nginx pod should be freshly started

$ kubectl get pods -l app=nginx -o wide

Get deployement details and check the events for recent changes

$ kubectl describe deployment nginx-deployment

Halt one of the nodes (node2)

$ vagrant halt node2$ sleep 30

Get node details – node2 Status=NotReady

$ kubectl get nodes

Get pod details – everything looks fine – you need to wait 5 minutes

$ kubectl get pods -o wide

Pod will not be evicted until it is 5 minutes old – (see Tolerations in ‘describe pod’ ). It prevents Kubernetes to spin up the new containers when it is not necessary

$ NGINX_POD=$(kubectl get pods -l app=nginx --
output=jsonpath="{.items[0].metadata.name}")
$ kubectl describe pod $NGINX_POD | grep -A1 Tolerations

Sleeping for 5 minutes

$ sleep 300

Get pods details – Status=Unknown/NodeLost and new container was started

$ kubectl get pods -o wide

Get depoyment details – again AVAILABLE=3/3

$ kubectl get deployments -o wide

Power on the node2 node

$ vagrant up node2
$ sleep 70

Get node details – node2 should be Ready again

$ kubectl get nodes

Get pods details – ‘Unknown’ pods were removed

$ kubectl get pods -o wide

**Source: GitHub. Author: Petr Ruzicka

Conclusion

Kubernetes can self-heal applications and containers, but what about healing itself when the nodes are down? For Kubernetes to continue self-healing, it needs a dedicated set of infrastructure, with access to self-healing nodes all the time. The infrastructure must be driven by automation and powered by predictive analytics to preempt and fix issues beforehand. The bottom line is that at any given point in time, the infrastructure nodes should maintain the required count for uninterrupted services.

Hyderabad, India

Prologue

Kubernetes’ self-healing Concepts – pod’s phase, probes, and restart policy.

Demo description of Self-Healing Kubernetes – Example 1

Demo description of Self-Healing Kubernetes – Example 2