Guest post originally published on Logiq’s blog by Ajit Chelat

Kubernetes is one of the most popular choices for container management and automation today. A highly efficient Kubernetes setup generates innumerable new metrics every day, making monitoring cluster health quite challenging. You might find yourself sifting through several different metrics without being entirely sure which ones are the most insightful and warrant utmost attention. 

As daunting a task as this may seem, you can hit the ground running by knowing which of these metrics provide the right kind of insights into the health of your Kubernetes clusters. Although there are observability platforms to help you monitor your Kubernetes clusters’ right metrics, knowing exactly which ones to watch will help you stay on top of your monitoring needs. In this article, we take you through a few Kubernetes health metrics that top our list. 

Crash Loops

A crash loop is the last thing you’d want to go undetected. During a crash loop, your application breaks down as a pod starts and keeps crashing and restarting in a circle. Multiple reasons can lead to a crash loop, making it tricky to identify the root cause. Being alerted when a crash loop occurs can help you quickly narrow down the list of causes and take emergency measures to keep your application active. 

Cluster State Metrics

Another critical metric to keep an eye on is your cluster states. You should be able to track the aggregated resource usage throughout all the nodes in your cluster, including desired pods, node status, current pods, unavailable pods, and available pods. Monitoring your cluster states and evaluating the resultant metrics gives you a topline view of your cluster’s overall health. You’ll also stay apprised of issues with your nodes and pods. Based on the state metrics, you can decide if you need to investigate a larger problem or scale your cluster. 

Using this metric, you can also evaluate the number of resources your nodes are using. You’ll also see how many nodes you have, of which how many are still available, which in turn lets you know precisely what you’re paying for and whether you need to tweak the amount and size of nodes used. 

Disk and Memory Pressure

Disk pressure is a metric that indicates whether your nodes utilize disk space too quickly or too much of it, based on the usage thresholds you’ve set in your configuration. Monitoring this metric enables you to determine when you need to add additional disk space. It could also indicate that your application isn’t functioning as designed and uses more disk space than required. 

Memory pressure is a metric that indicates the amount of memory a node is using. Monitoring this metric helps you keep nodes from running out of memory and indicate nodes with over-allocated memory resources that are unnecessarily increasing your infrastructure spends. A high memory pressure can also tell if your applications are leaking memory. 

Network Unavailable

You’d immediately want to know when there’s something wrong with your network. After all, your nodes and applications need network connectivity to function. This metric will let you know when issues are hampering the network connectivity of your nodes. These issues could be a result of improper network configuration or a physical connection issue with your hardware. 

CPU Utilization

Knowing how many CPU cycles your nodes use is vital to ensure that your nodes employ their allocated CPU resources judiciously. If your applications or nodes use up all of their allocated processing resources, you’d have to increase your CPU allocation or add additional nodes to your cluster. If your nodes or applications are using lesser CPU cycles than what you’re paying for, you’d have to revaluate the CPU allocation and downgrade if necessary. Monitoring CPU Utilization helps you stay on top of such scenarios and have your deployments run more efficiently. 

Job Failures

Kubernetes Jobs are controllers that ensure that pods execute for a certain amount of time and then retire them as soon as they serve their intended purpose. There are times when jobs don’t complete successfully – either due to nodes rebooting or going into crash loops, or even resource exhaustion. Either way, you’d want to know about job failures as soon as they occur. 

Job failures don’t necessarily mean that your application is inaccessible – but ignoring job failures could lead to more significant issues for your deployments down the line. Monitoring job failures closely can help in timely recovery and future avoidance of these issues. 

DaemonSets

DaemonSets ensure that all nodes in your Kubernetes cluster run a copy of a specific pod of your liking. DaemonSets are especially useful when you’d like to run a monitoring service pod on all your existing nodes and any new nodes added to your cluster. 

Monitoring DaemonSets helps you understand the health of your clusters. Ideally, the number of DaemonSets observed in a cluster should match the number of DaemonSets desired. If you notice that these numbers aren’t identical, at least one of your DaemonSets likely have failed.  

Monitoring Kubernetes Health Metrics

Staying on top of all Kubernetes health metrics is crucial to ensure early detection, prevention, and timely diagnosis of issues that can bring down your clusters. Arming yourself with the right monitoring strategy, knowledge of which Kubernetes health metrics to focus on, and the right set of monitoring tools is the best way to ensure that your production environment is always up and running. 

Us folks at LOGIQ have built a monitoring tool that helps monitor Kubernetes clusters of all sizes, ensures that nothing goes undetected, keeps costs at a bare minimum while providing the kind of observability for Kubernetes like no one else does. Talk to us about your Kubernetes infrastructure system and what you’re looking to monitor. We can get you set up in under five minutes and walk through you how LOGIQ can be the key pillar for your monitoring needs.