In Part 1 of our series, we explored essential Kubernetes troubleshooting techniques that help DevOps engineers diagnose and resolve common cluster and application issues effectively. However, Kubernetes environments are complex, and there’s always more to uncover.

In this Part 2, we’ll dive deeper into additional troubleshooting strategies, covering advanced techniques and real-world scenarios that can save you time and prevent downtime. Whether you’re dealing with persistent pod failures, network issues, or cluster misconfigurations, these tips will equip you with the knowledge to tackle Kubernetes challenges with confidence.

Kubernetes Troubleshooting Storage: Resolving PVC Pending Errors

The PersistentVolumeClaim (PVC) Pending status is a common storage issue in Kubernetes, preventing applications from accessing persistent data. This typically results from misconfigured storage classes, missing volume provisioners, or insufficient available storage in the cluster.

Step 1: Inspecting PV and PVC Status

Start by listing all Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) across all namespaces. This command provides an overview of their status, access modes, capacity, and whether they are bound:

kubectl get pv,pvc --all-namespaces

Step 2: Troubleshooting Mounting Issues

To further investigate an unbound PVC stuck in the Pending state, use the following command:

kubectl describe pvc

Check the Events section at the bottom of the output. It often reveals the root cause, such as:

Step 3: Verifying Storage Classes

Incorrect or non-existent storage classes are a common culprit. List all available storage classes:

kubectl get storageclass

Then, describe a specific one to inspect details like the provisioner and parameters:

kubectl describe storageclass 

Ensure that your PVC references a valid, correctly configured storage class. If the specified provisioner does not exist or is misspelled, the volume will fail to provision.

Step 4: Common Mistakes and Resolution

Suppose you’ve defined a PVC that references a storage class named fast-ssd, but it’s failing to provision. Run:

kubectl describe pvc my-data-pvc

You might see an error like:

Warning  ProvisioningFailed  3m    persistentvolume-controller 

storageclass.storage.k8s.io "fast-ssd" not found

Now, list all available storage classes to confirm:

kubectl get storageclass

If fast-ssd is missing, but gp2 or standard exists, update your PVC to use a valid class.

7. Using Event and Audit Logs: Deep System Analysis 

Kubernetes provides two powerful tools for debugging: events and audit logs. These help you track what happened, when it happened, and why, giving you a timeline of system activities for root cause analysis.

Step 1: Understanding Kubernetes Events

Events in Kubernetes record what’s happening inside the cluster. You can list events across all namespaces and sort them by their creation time to see the most recent activity at the bottom. This helps correlate issues with recent system behavior.

For a more comprehensive understanding of how Kubernetes handles logs across nodes and clusters, check out our complete guide to Kubernetes logging.

For View all events sorted by time:

kubectl get events --all-namespaces 
--sort-by='.metadata.creationTimestamp'

To filter events that occurred after a specific time:

kubectl get events 
--field-selector='lastTimestamp>2023-10-01T10:00:00Z'

To view only warning-type events (which often indicate potential problems):

kubectl get events --field-selector type=Warning

You can also monitor events in real-time using the --watch flag. This is helpful when you’re actively troubleshooting and want to immediately observe what happens after deploying or modifying resources:

kubectl get events --watch

If you’re investigating a specific pod, deployment, or service, you can filter events to focus only on that object. For example:

kubectl get events --field-selector involvedObject.name=my-pod

In case you’re dealing with pods not getting scheduled, you can filter events with a reason set to “FailedScheduling.” This will show why Kubernetes couldn’t place the pod on a node, such as due to insufficient resources or affinity conflicts:

kubectl get events --field-selector reason=FailedScheduling

Step 2: Using Audit Logs for In-Depth Troubleshooting

While events help you understand what’s happening, audit logs let you see who did what at the API level essential for security investigations or when tracking administrative actions.

Audit logs are not enabled by default. To enable them, you must configure an audit policy. Here’s a sample audit policy configuration that captures detailed logs for core resources like pods, services, deployments, etc.:

apiVersion: audit.k8s.io/v1

kind: Policy

rules:

- level: RequestResponse

  resources:

  - group: ""

    resources: ["pods", "services"]

  - group: "apps"

    resources: ["deployments", "replicasets"]

- level: Request

  resources:

  - group: ""

    resources: ["configmaps", "secrets"]

Once configured, audit logs can help you track down issues like:

{

  "kind": "Event",

  "apiVersion": "audit.k8s.io/v1",

  "level": "RequestResponse",

  "auditID": "4d2c8b7a-f3e1-4b2a-9c8d-1e3f5a7b9c2d",

  "stage": "ResponseComplete",

  "requestURI": "/api/v1/namespaces/production/pods/web-app-7d4b8c9f-xyz",

  "verb": "delete",

  "user": {

    "username": "admin@company.com",

    "groups": ["system:authenticated"]

  },

  "sourceIPs": ["192.168.1.100"],

  "userAgent": "kubectl/v1.28.0",

  "objectRef": {

    "resource": "pods",

    "namespace": "production",

    "name": "web-app-7d4b8c9f-xyz"

  },

  "responseStatus": {

    "code": 200

  },

  "requestReceivedTimestamp": "2024-01-15T10:30:00.000Z",

  "stageTimestamp": "2024-01-15T10:30:00.123Z"

}

Once audit logging is enabled, logs will show important details such as:

Example: An audit log might show that a pod was deleted at a specific time, by a specific admin user, from a certain IP address. This level of transparency is crucial when diagnosing problems caused by accidental or unauthorized changes.

8. Using Kubernetes Dashboard and Visual Tools

While command-line tools like kubectl offer powerful ways to inspect your Kubernetes cluster, visual tools simplify cluster management, especially when identifying patterns across metrics, logs, and events.

Step 1: Kubernetes Dashboard Overview

The Kubernetes Dashboard is a web-based user interface that lets you manage cluster resources visually. It provides detailed insights into deployments, resource usage, logs, and events, making it easier to diagnose issues without needing to run multiple CLI commands.

By default, the Dashboard is not installed in production environments due to security concerns. However, it can be manually deployed as follows:

1. Deploy the Dashboard: Run the following command to apply the recommended configuration:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml

2. Create a service account for access.

kubectl create serviceaccount dashboard-admin -n kubernetes-dashboard

kubectl create clusterrolebinding dashboard-admin --clusterrole=cluster-admin --serviceaccount=kubernetes-dashboard:dashboard-admin

3. Generate Access Token:

kubectl create token dashboard-admin -n kubernetes-dashboard

Once deployed, the Dashboard allows you to:

Example Use Case:

Suppose your application experiences intermittent failures. The Dashboard may show that CPU usage spikes align with these failures, and the events log shows that pods are being OOMKilled. This kind of pattern is easier to identify visually than by reading raw CLI logs.

9. Implementing Health Checks and Probes

Health checks in Kubernetes function similarly to routine medical checkups, helping to detect issues early and ensuring everything is functioning as expected.

Kubernetes uses probes to monitor the health and availability of your application containers. These probes enable the cluster to detect issues and take automated actions, such as restarting containers or stopping traffic routing, when necessary.

Understanding Readiness and Liveness Probes

Kubernetes provides three types of probes, each serving a specific role in maintaining container health:

  1. Liveness Probe: Checks if the container is still running. If it fails repeatedly, Kubernetes restarts the container.
  2. Readiness Probe: Checks if the container is ready to accept traffic. If this fails, the container is temporarily removed from the service endpoints.
  3. Startup Probe: Provides containers with additional time to complete their startup logic before other probes begin. This is useful for applications with longer boot times.

Example: Configuring All Three Probes

Below is a configuration example that combines all three types of probes in a single deployment:

apiVersion: apps/v1

kind: Deployment

metadata:

  name: web-application

spec:

  replicas: 3

  selector:

    matchLabels:

      app: web-application

  template:

    metadata:

      labels:

        app: web-application

    spec:

      containers:

      - name: web-app

        image: my-app:v1.2.3

        ports:

        - containerPort: 8080

        # Startup probe - gives the app time to initialize

        startupProbe:

          httpGet:

            path: /health/startup

            port: 8080

          initialDelaySeconds: 10

          periodSeconds: 5

          timeoutSeconds: 3

          failureThreshold: 30  # 30 * 5 = 150 seconds to start

          successThreshold: 1

        # Liveness probe - restarts container if unhealthy

        livenessProbe:

          httpGet:

            path: /health/live

            port: 8080

          initialDelaySeconds: 30

          periodSeconds: 10

          timeoutSeconds: 5

          failureThreshold: 3

          successThreshold: 1

        # Readiness probe - removes from service if not ready

        readinessProbe:

          httpGet:

            path: /health/ready

            port: 8080

          initialDelaySeconds: 5

          periodSeconds: 5

          timeoutSeconds: 3

          failureThreshold: 3

          successThreshold: 1

        resources:

          requests:

            memory: "256Mi"

            cpu: "250m"

          limits:

            memory: "512Mi"

            cpu: "500m"

How These Probes Work Together

10. Advanced Debugging Techniques

While standard Kubernetes debugging methods handle many day-to-day issues, there are times when more advanced techniques are needed, especially for diagnosing complex performance bottlenecks, unexpected application behavior, or deep network-level problems that basic tools can’t resolve.

Step 1: Using Ephemeral Containers for Live Debugging

Ephemeral containers are a powerful way to troubleshoot live applications without restarting pods or altering their state. They allow you to temporarily inject a debugging container into a running pod, ideal for production debugging where uptime is critical.

For example, to initiate a basic debugging container within a live pod:

kubectl debug -it --image=busybox --target=

To include specific debugging tools (like bash, curl, dig), use an image like Ubuntu:

kubectl debug database-pod -it --image=ubuntu --target=postgres -- bash

Practical Example: Network Issue Investigation

Imagine your web application is facing intermittent connectivity issues. You can attach a debugging container with networking tools like netshoot:

kubectl debug web-app-7d4b8c9f-xyz -it --image=nicolaka/netshoot --target=web-app

Inside the debugging container, we can now check network connectivity:

ping database-service

nslookup database-service

Inside the debugging container, you can perform several diagnostics:

Check service connectivity:

ping database-service

nslookup database-service

Test open ports:

telnet database-service 5432

Inspect networking interfaces:

ip addr show

ss -tuln

Validate DNS resolution:

dig database-service.default.svc.cluster.local

Monitor network traffic:

tcpdump -i any port 5432

Inspect running processes:

ps aux

And examine the file system: 

ls -la /app/

cat /app/config.yaml

This kind of live environment debugging allows for pinpointing issues that might only occur under real production conditions.

Step 2: Leveraging kubectl debug for Broader Scenarios

The kubectl debug command also supports more advanced operations beyond ephemeral containers:

Create a full debug copy of a pod:

kubectl debug web-app-7d4b8c9f-xyz --copy-to=web-app-debug --image=ubuntu --set-image=web-app=ubuntu -- sleep 1d

You can also create a new pod with the same configuration but a different image:

kubectl exec -it web-app-debug -- bash

Debug at the node level: You can launch a privileged pod on a node to investigate node-level issues:

kubectl debug node/worker-node-1 -it --image=ubuntu

Inside the privileged container, you can access the host’s filesystem and services:

chroot /host bash

systemctl status kubelet

journalctl -u kubelet -f

Add profiling containers for performance analysis:
If you’re looking into CPU profiling or memory leaks, a container with Go or another profiling tool can help:

kubectl debug web-app-7d4b8c9f-xyz -it --image=golang:1.21 --target=web-app

Why These Techniques Matter

Advanced debugging isn’t just about having extra commands; it’s about having flexibility to access low-level details without affecting production workloads. With ephemeral containers, node-level access, and full pod duplication, you can troubleshoot virtually any problem live and in context, minimizing guesswork and downtime.

Conclusion

Effectively troubleshooting Kubernetes relies on knowing when and how to apply the right debugging approach. Tools like kubectl, events, and audit logs are crucial for day-to-day debugging. However, combining these with a dedicated Kubernetes observability platform can enhance visibility, reduce MTTR (mean time to resolution), and ensure smoother operations across your Kubernetes environment.