In Part 1 of our series, we explored essential Kubernetes troubleshooting techniques that help DevOps engineers diagnose and resolve common cluster and application issues effectively. However, Kubernetes environments are complex, and there’s always more to uncover.
In this Part 2, we’ll dive deeper into additional troubleshooting strategies, covering advanced techniques and real-world scenarios that can save you time and prevent downtime. Whether you’re dealing with persistent pod failures, network issues, or cluster misconfigurations, these tips will equip you with the knowledge to tackle Kubernetes challenges with confidence.
Kubernetes Troubleshooting Storage: Resolving PVC Pending Errors
The PersistentVolumeClaim (PVC) Pending status is a common storage issue in Kubernetes, preventing applications from accessing persistent data. This typically results from misconfigured storage classes, missing volume provisioners, or insufficient available storage in the cluster.
Step 1: Inspecting PV and PVC Status
Start by listing all Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) across all namespaces. This command provides an overview of their status, access modes, capacity, and whether they are bound:
kubectl get pv,pvc --all-namespaces
Step 2: Troubleshooting Mounting Issues
To further investigate an unbound PVC stuck in the Pending state, use the following command:
kubectl describe pvc
Check the Events section at the bottom of the output. It often reveals the root cause, such as:
- No matching
PersistentVolume
available - Storage class mismatch
- Insufficient capacity
- Missing provisioned
Step 3: Verifying Storage Classes
Incorrect or non-existent storage classes are a common culprit. List all available storage classes:
kubectl get storageclass
Then, describe a specific one to inspect details like the provisioner and parameters:
kubectl describe storageclass
Ensure that your PVC references a valid, correctly configured storage class. If the specified provisioner does not exist or is misspelled, the volume will fail to provision.
Step 4: Common Mistakes and Resolution
Suppose you’ve defined a PVC that references a storage class named fast-ssd
, but it’s failing to provision. Run:
kubectl describe pvc my-data-pvc
You might see an error like:
Warning ProvisioningFailed 3m persistentvolume-controller
storageclass.storage.k8s.io "fast-ssd" not found
Now, list all available storage classes to confirm:
kubectl get storageclass
If fast-ssd
is missing, but gp2
or standard
exists, update your PVC to use a valid class.
7. Using Event and Audit Logs: Deep System Analysis
Kubernetes provides two powerful tools for debugging: events and audit logs. These help you track what happened, when it happened, and why, giving you a timeline of system activities for root cause analysis.
Step 1: Understanding Kubernetes Events
Events in Kubernetes record what’s happening inside the cluster. You can list events across all namespaces and sort them by their creation time to see the most recent activity at the bottom. This helps correlate issues with recent system behavior.
For a more comprehensive understanding of how Kubernetes handles logs across nodes and clusters, check out our complete guide to Kubernetes logging.
For View all events sorted by time:
kubectl get events --all-namespaces
--sort-by='.metadata.creationTimestamp'
To filter events that occurred after a specific time:
kubectl get events
--field-selector='lastTimestamp>2023-10-01T10:00:00Z'
To view only warning-type events (which often indicate potential problems):
kubectl get events --field-selector type=Warning
You can also monitor events in real-time using the --watch
flag. This is helpful when you’re actively troubleshooting and want to immediately observe what happens after deploying or modifying resources:
kubectl get events --watch
If you’re investigating a specific pod, deployment, or service, you can filter events to focus only on that object. For example:
kubectl get events --field-selector involvedObject.name=my-pod
In case you’re dealing with pods not getting scheduled, you can filter events with a reason set to “FailedScheduling.” This will show why Kubernetes couldn’t place the pod on a node, such as due to insufficient resources or affinity conflicts:
kubectl get events --field-selector reason=FailedScheduling
Step 2: Using Audit Logs for In-Depth Troubleshooting
While events help you understand what’s happening, audit logs let you see who did what at the API level essential for security investigations or when tracking administrative actions.
Audit logs are not enabled by default. To enable them, you must configure an audit policy. Here’s a sample audit policy configuration that captures detailed logs for core resources like pods, services, deployments, etc.:
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
resources:
- group: ""
resources: ["pods", "services"]
- group: "apps"
resources: ["deployments", "replicasets"]
- level: Request
resources:
- group: ""
resources: ["configmaps", "secrets"]
Once configured, audit logs can help you track down issues like:
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "RequestResponse",
"auditID": "4d2c8b7a-f3e1-4b2a-9c8d-1e3f5a7b9c2d",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/production/pods/web-app-7d4b8c9f-xyz",
"verb": "delete",
"user": {
"username": "admin@company.com",
"groups": ["system:authenticated"]
},
"sourceIPs": ["192.168.1.100"],
"userAgent": "kubectl/v1.28.0",
"objectRef": {
"resource": "pods",
"namespace": "production",
"name": "web-app-7d4b8c9f-xyz"
},
"responseStatus": {
"code": 200
},
"requestReceivedTimestamp": "2024-01-15T10:30:00.000Z",
"stageTimestamp": "2024-01-15T10:30:00.123Z"
}
Once audit logging is enabled, logs will show important details such as:
- Which user made the API request
- From which IP address
- The HTTP verb used (e.g., GET, POST, DELETE)
- The resource affected (e.g., pod, deployment)
- The timestamp of the action
- Whether the request succeeded
Example: An audit log might show that a pod was deleted at a specific time, by a specific admin user, from a certain IP address. This level of transparency is crucial when diagnosing problems caused by accidental or unauthorized changes.
8. Using Kubernetes Dashboard and Visual Tools
While command-line tools like kubectl
offer powerful ways to inspect your Kubernetes cluster, visual tools simplify cluster management, especially when identifying patterns across metrics, logs, and events.
Step 1: Kubernetes Dashboard Overview
The Kubernetes Dashboard is a web-based user interface that lets you manage cluster resources visually. It provides detailed insights into deployments, resource usage, logs, and events, making it easier to diagnose issues without needing to run multiple CLI commands.
By default, the Dashboard is not installed in production environments due to security concerns. However, it can be manually deployed as follows:
1. Deploy the Dashboard: Run the following command to apply the recommended configuration:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml
2. Create a service account for access.
kubectl create serviceaccount dashboard-admin -n kubernetes-dashboard
kubectl create clusterrolebinding dashboard-admin --clusterrole=cluster-admin --serviceaccount=kubernetes-dashboard:dashboard-admin
3. Generate Access Token:
kubectl create token dashboard-admin -n kubernetes-dashboard
Once deployed, the Dashboard allows you to:
- Monitor CPU and memory usage over time
- Visualize event timelines
- Explore relationships between Kubernetes resources
- Stream application logs directly in your browser
Example Use Case:
Suppose your application experiences intermittent failures. The Dashboard may show that CPU usage spikes align with these failures, and the events log shows that pods are being OOMKilled. This kind of pattern is easier to identify visually than by reading raw CLI logs.
9. Implementing Health Checks and Probes
Health checks in Kubernetes function similarly to routine medical checkups, helping to detect issues early and ensuring everything is functioning as expected.
Kubernetes uses probes to monitor the health and availability of your application containers. These probes enable the cluster to detect issues and take automated actions, such as restarting containers or stopping traffic routing, when necessary.
Understanding Readiness and Liveness Probes
Kubernetes provides three types of probes, each serving a specific role in maintaining container health:
- Liveness Probe: Checks if the container is still running. If it fails repeatedly, Kubernetes restarts the container.
- Readiness Probe: Checks if the container is ready to accept traffic. If this fails, the container is temporarily removed from the service endpoints.
- Startup Probe: Provides containers with additional time to complete their startup logic before other probes begin. This is useful for applications with longer boot times.
Example: Configuring All Three Probes
Below is a configuration example that combines all three types of probes in a single deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-application
spec:
replicas: 3
selector:
matchLabels:
app: web-application
template:
metadata:
labels:
app: web-application
spec:
containers:
- name: web-app
image: my-app:v1.2.3
ports:
- containerPort: 8080
# Startup probe - gives the app time to initialize
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # 30 * 5 = 150 seconds to start
successThreshold: 1
# Liveness probe - restarts container if unhealthy
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
# Readiness probe - removes from service if not ready
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
How These Probes Work Together
- Startup Probe: This is checked first. It runs every 5 seconds, allowing up to 150 seconds for the application to complete startup tasks such as initializing databases or loading configurations. During this time, other probes are paused.
- Liveness Probe: Once the startup probe succeeds, the liveness probe takes over. It ensures the container remains healthy. If the check fails three times in a row, Kubernetes automatically restarts the container.
- Readiness Probe: This ensures the container is prepared to handle incoming traffic. If the check fails (e.g., due to a temporary database outage), Kubernetes temporarily removes the pod from the load balancer without restarting it.
10. Advanced Debugging Techniques
While standard Kubernetes debugging methods handle many day-to-day issues, there are times when more advanced techniques are needed, especially for diagnosing complex performance bottlenecks, unexpected application behavior, or deep network-level problems that basic tools can’t resolve.
Step 1: Using Ephemeral Containers for Live Debugging
Ephemeral containers are a powerful way to troubleshoot live applications without restarting pods or altering their state. They allow you to temporarily inject a debugging container into a running pod, ideal for production debugging where uptime is critical.
For example, to initiate a basic debugging container within a live pod:
kubectl debug -it --image=busybox --target=
To include specific debugging tools (like bash, curl, dig), use an image like Ubuntu:
kubectl debug database-pod -it --image=ubuntu --target=postgres -- bash
Practical Example: Network Issue Investigation
Imagine your web application is facing intermittent connectivity issues. You can attach a debugging container with networking tools like netshoot
:
kubectl debug web-app-7d4b8c9f-xyz -it --image=nicolaka/netshoot --target=web-app
Inside the debugging container, we can now check network connectivity:
ping database-service
nslookup database-service
Inside the debugging container, you can perform several diagnostics:
Check service connectivity:
ping database-service
nslookup database-service
Test open ports:
telnet database-service 5432
Inspect networking interfaces:
ip addr show
ss -tuln
Validate DNS resolution:
dig database-service.default.svc.cluster.local
Monitor network traffic:
tcpdump -i any port 5432
Inspect running processes:
ps aux
And examine the file system:
ls -la /app/
cat /app/config.yaml
This kind of live environment debugging allows for pinpointing issues that might only occur under real production conditions.
Step 2: Leveraging kubectl debug
for Broader Scenarios
The kubectl debug command also supports more advanced operations beyond ephemeral containers:
Create a full debug copy of a pod:
kubectl debug web-app-7d4b8c9f-xyz --copy-to=web-app-debug --image=ubuntu --set-image=web-app=ubuntu -- sleep 1d
You can also create a new pod with the same configuration but a different image:
kubectl exec -it web-app-debug -- bash
Debug at the node level: You can launch a privileged pod on a node to investigate node-level issues:
kubectl debug node/worker-node-1 -it --image=ubuntu
Inside the privileged container, you can access the host’s filesystem and services:
chroot /host bash
systemctl status kubelet
journalctl -u kubelet -f
Add profiling containers for performance analysis:
If you’re looking into CPU profiling or memory leaks, a container with Go or another profiling tool can help:
kubectl debug web-app-7d4b8c9f-xyz -it --image=golang:1.21 --target=web-app
Why These Techniques Matter
Advanced debugging isn’t just about having extra commands; it’s about having flexibility to access low-level details without affecting production workloads. With ephemeral containers, node-level access, and full pod duplication, you can troubleshoot virtually any problem live and in context, minimizing guesswork and downtime.
Conclusion
Effectively troubleshooting Kubernetes relies on knowing when and how to apply the right debugging approach. Tools like kubectl
, events, and audit logs are crucial for day-to-day debugging. However, combining these with a dedicated Kubernetes observability platform can enhance visibility, reduce MTTR (mean time to resolution), and ensure smoother operations across your Kubernetes environment.