Member post originally published on the Middleware blog by Keval Bhogayata, covering the top 10 Kubernetes Troubleshooting Techniques. 

Regardless of its popularity, there can be times where even the most seasoned DevOps engineers must troubleshoot Kubernetes. While it excels at handling containerised applications at scale, it can present unique troubleshooting challenges. 

In this post, we’ll explore the top 10 Kubernetes troubleshooting techniques that every DevOps engineer should master. These K8s troubleshooting tips come from real-world scenarios, showing how to solve common and critical Kubernetes issues quickly and reliably.

1. Fixing CrashLoopBackOff Errors in Pods

One of the most common and frustrating issues in Kubernetes is when a pod repeatedly crashes during restarts; a situation known as the CrashLoopBackOff error. This occurs when a container fails to start correctly, and Kubernetes continually attempts to restart it, resulting in a loop of failures.

Step 1: List All Pods

The first step in debugging this error is to get a high-level overview of all pods running in your namespace. You can do this with the following command:

kubectl get pods

This will show the status, restart count, and age of each pod. Pods with a status that CrashLoopBackOff clearly indicates an issue requiring immediate attention.

Step 2: Describe the Affected Pod

Once you’ve identified the problematic pod, use the describe command to inspect its internal details:

kubectl describe pod 

This provides configuration information, recent events, and messages that might indicate the reason for the crash. Pay special attention to the Events section, which can help pinpoint issues such as failed image pulls, missing configuration, or permission errors.

Step 3: Review Container Logs

Logs are essential for understanding what went wrong inside a container. Use the following command to access a pod’s logs:

kubectl logs

If the pod fails before producing logs, you might find the logs are empty. In such cases, use the --previous Flag to inspect the logs from the last failed container instance. This often reveals the root cause, especially if the container exited before any meaningful activity.

If you’re trying to debug pods in real-time, don’t miss our detailed guide on tailing logs with kubectl, which walks you through using -tail, -f, and more advanced flags for live troubleshooting.

Example Scenario:

Let’s walk through a practical example:

1.Run kubectl get pods and observe the output:

kubectl get pods

my-webapp-pod   0/1   CrashLoopBackOff   5 (2m ago)

The pod my-webapp-pod is crashing repeatedly.

2. Try describing the pod:

kubectl describe pod my-webapp-pod

If this yields no helpful insight, the issue might be within the container logs.

3. Check the logs of the previous instance:

kubectl logs my-webapp-pod --previous:

Error: DATABASE_URL environment variable is not set

Process exiting with code 1

The logs show that the DATABASE_URL Environment variable was missing. This explains why the container failed. The application inside expected a configuration that wasn’t provided.

CrashLoopBackOff errors often result from misconfigurations such as missing environment variables, incorrect commands, or failed dependencies. By systematically inspecting pods, describing their events, and reviewing logs (including those from previous container instances), you can efficiently identify and resolve the underlying cause.

2. Kubernetes Troubleshooting Deployment Failures: ImagePullBackOff

When Kubernetes cannot pull a container image due to authentication issues or an incorrect image name, it triggers an ImagePullBackOff error.

Step 1: Identify Problematic Deployments

Start by checking the status of your deployments:

kubectl get deployments

This command displays all deployments and their replica counts. Pay close attention to the READY column. For example, a “0/3” status means that none of the pods are starting successfully, suggesting an issue at the pod level rather than with the application itself.

For deeper insight, run the following:

kubectl describe deployment

This provides detailed deployment data, including the pod template, conditions, and recent events. You may see messages like “ReplicaSet failed to create pods,” which can indicate underlying issues.

Step 2: Monitor Rollout Status and History

To track a deployment rollout in real time:

kubectl rollout status deployment 

This is useful when monitoring deployments during CI/CD pipeline runs.

To view the deployment history:

kubectl rollout history deployment

Use this to identify which revision introduced the issue and compare it with previous working versions.

Step 3: Investigate the Root Cause of ImagePullBackOff

Let’s list the pods and spot the error:

kubectl get pods

my-app-7d4b8c8f-xyz   0/1   ImagePullBackOff   0   2m

Now, inspect the failing pod:

kubectl describe pod my-app-7d4b8c8f-xyz

Failed to pull image "private-registry.com/my-app:v1.2.3":

Error response from daemon: pull access denied for private-registry.com/my-app

This indicates that Kubernetes cannot access the private container registry due to missing or invalid credentials.

Step 4: Fixing the ImagePullBackOff Using Secrets

To resolve this, create a Kubernetes Secret to store your private registry credentials securely:

kubectl create secret docker-registry my-registry-secret \

  --docker-server=private-registry.com \

  --docker-username=myuser \

  --docker-password=mypassword \

  --docker-email=my@email.com

Now, patch your deployment to reference the secret:

kubectl patch deployment my-app -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"my-registry-secret"}]}}}}'

Once patched, Kubernetes will re-trigger the deployment using the correct credentials. You can monitor the new rollout using:

kubectl rollout status deployment my-app

3. Kubernetes Troubleshooting: Fixing NotReady Node Errors

Another frequently faced error is the NotReady status on a node, which blocks pod scheduling and disrupts workloads. If the kubelet on the node cannot communicate with the server or the node fails health checks, the NotReady status appears, preventing pods from being scheduled, which often leads to application downtime.

Step1: Checking Node Status

First, check the status of all nodes:

kubectl get nodes -o wide

This command lists all cluster nodes, showing their status, container runtime, IP addresses, and OS details. The -o wide flag gives additional context, helping identify node-level issues such as OS mismatches or node location patterns (e.g., subnet-specific failures).

Step 2: Inspecting Node Conditions and Issues

Use the following command to get a detailed view of a node’s resource capacity and health:

kubectl describe node | grep -A 5 "Capacity\|Allocatable"

This will show whether the node’s allocatable resources are within limits or are being over-utilized.

For example, let’s say you have a node showing a NotReady status:

kubectl get nodesworker-node-1   NotReady     5d   v1.28.0

Step 3: To investigate further

kubectl describe node worker-node-1

Conditions:

Type             Status  LastHeartbeatTime   LastTransitionTime   Reason                 Message

DiskPressure     True    Mon, 01 Jan 2024    Mon, 01 Jan 2024     
KubeletHasDiskPressure kubelet has disk pressure

Fixing Disk Pressure Issues

In this case, the node is reporting disk pressure typically because the partitions or logs have consumed excessive space.

To resolve this, clear out system logs using:

sudo journalctl --vacuum-time=3d

This will remove logs older than 3 days and free up disk space, helping the kubelet return the node to a Ready state.

4. Diagnosing Service and Networking Problems: The Pending Error

When services or pods are stuck in a Pending state, Kubernetes troubleshooting becomes essential. This often indicates a selector mismatch, networking misconfiguration, or issues with DNS resolution.

Service connectivity problems are among the most frustrating issues in Kubernetes. Start by listing all services to verify their configuration:

kubectl get services --all-namespaces

This command returns service types and network details, including IP addresses and ports.

Step 1: Verifying services 

To identify if a service lacks matching pods, use the following command to list endpoints:

kubectl get endpoints

If an endpoint is empty, it means no pods match the service’s selector, which is commonly the root cause of connection failures.

You can further inspect the service configuration to verify the labels used to locate pods:

kubectl describe service 

Compare the selector labels in the service with the actual labels on your pods to ensure they align.

Step 2: Inspecting DNS Issues In The Cluster

If there’s a communication failure between microservices that causes the Pending state, DNS resolution might be the culprit. You can run the following commands from within a pod:

nslookup my-service

nslookup my-service.d

If the DNS name does not resolve to an IP address, there’s likely an issue with the cluster’s DNS configuration.

Testing HTTP Connectivity

To ensure that the service is responding correctly, test its endpoint with:

wget -qO- my-service:80/health

If the request fails, it could indicate a problem with the service configuration, network policies, or incorrect pod selectors.

5. Kubernetes Troubleshooting High Resource Usage: Solving OOMKilled Errors

Monitoring resources is a crucial part of Kubernetes troubleshooting, enabling the maintenance of healthy clusters and ensuring optimal application performance. When a container exceeds its allocated memory limit, Kubernetes forcefully terminates it, resulting in the infamous OOMKilled error. This can cause pod evictions, application downtime, or severe performance degradation.

Step 1: Checking Resource Usage 

To identify memory-intensive nodes and pods:

kubectl top nodes

This command provides real-time resource usage across nodes. If a node is using over 80% of its memory, it may be at risk of triggering OOMKilled errors.

You can also view pod-level resource usage and sort by CPU or memory:

kubectl top pods --all-namespaces --sort-by=cpu

kubectl top pods --all-namespaces --sort-by=memory

These commands help you pinpoint the most resource-hungry pods in the cluster.

Track OOMKilled and other errors in real-time with Middleware’s K8 agent.

Step 2: Investigating Resource Quotas, Limits, And Autoscaling

Understanding and monitoring resource limits is essential. To check quotas across all namespaces:

kubectl describe quota --all-namespaces

Memory leaks often appear as a gradual increase in memory usage over time. You can monitor a pod’s live memory consumption every 5 seconds using:

watch -n 5 'kubectl top pod memory-hungry-app'

This helps you spot memory leaks or gradually increase memory usage.

Avoid memory issues before they crash your pod. See how to detect them early via Kubernetes Monitoring.

Step 3: Checking Pod Resource Requests and Limits

If pods lack memory limits, they can consume excessive resources, potentially affecting other workloads. To inspect resource requests and limits:

kubectl describe pod memory-hungry-app | grep -A 10 "Requests\|Limits"

This is a useful Kubernetes troubleshooting trick to identify which pods lack proper memory restrictions.

Step 4: Mitigating Resource-Related Issues

To prevent OOMKilled errors and automatically balance resource usage, you can set up Horizontal Pod Autoscaling (HPA).

Pro Tip: Use autoscaling to manage resource spikes. For example, to autoscale a deployment based on 70% CPU usage:

kubectl autoscale deployment my-app --cpu-percent=70 --min=2 --max=10

You can verify if autoscaling is active and functioning correctly:

kubectl get hpa

kubectl describe hpa my-app

Proactively managing resource limits and enabling autoscaling helps prevent OOMKilled errors and ensures smoother application performance.

Using Monitoring and Tracing Tools

While built-in tools like kubectl are essential for initial debugging, they can fall short when dealing with complex, distributed systems. This is where third-party observability platforms come into play. Middleware, for instance, offers a comprehensive observability solution that provides deeper visibility into your Kubernetes clusters.