Guest post by Mohammed Naser, CEO of VEXXHOST

As we slowly continue our migration to Prometheus alarms with AlertManager, we took a strategy of building out a vague set of alerts and then building more accurate, narrowed down ones after we see them firing. In this case, I want to share an example of how we noticed an issue, built out more accurate alarms to identify precisely the source to reduce resolution time, and then fixed the said issue.

We currently use all of the rules that are provided by kubernetes-monitoring/kubernetes-mixin repository. In our case, we started seeing the KubeDaemonSetRolloutStuck firing, which meant that certain pods were reporting that they were not ready. However, we started doing some basic PromQL queries and noticed that all of these pods were up, running, and functional; their Ready condition status was, however, not True.

At this stage, we figured the next step would be finding a way to figure out the exact affecting pods more reliably. Therefore, we can alert those specific pods instead of a DaemonSet, in general, and avoid having to do all the research of figuring out which pods aren’t working for a much faster resolution time. We started with something as simple as this:

kube_pod_status_ready{condition="false"} == 1

We then proceeded to add a few more bits that are necessary for monitoring our infrastructure and joined them to both the pod info (to get the node) and pod owner metrics. Afterward, we reshuffled some labels so that we could adequately do inhibitions inside our monitoring infrastructure.

Once we implemented that, we went from a few firing alerts for DaemonSets to ~50ish firing (but the initial alert was inhibited). While there are more alerts overall, they’re all grouped by DaemonSet *and* include pod information. Therefore, it actually still fired 6 alerts in our monitoring system. But this time, we have complete information regarding the nodes and pods affected; therefore, we could start getting all the necessary information to resolve things.

At this point, we started noticing that these issues are all affecting specific nodes within the infrastructure for those DaemonSets by using this PromQL query, which helped us start narrowing down what was the most common issue across these nodes:

sum ((kube_pod_status_ready{condition="false"} == 1) + on(pod)

group_left(node) (0 * kube_pod_info)) by (node)

We started by deleting all of the affected pods and seeing if they would come back cleanly. In our architecture, killing and restarting these pods has no effect overall in the health of the system. Once we deleted all of the pods that were in an unready state, we noticed that they were coming back up with the Ready condition set to True, which means that everything was okay and all the alarms got cleared out.

Now, we hit a point where we know that everything is back up and running, but we still don’t know what happened exactly. It’s time to start digging and investigating the precise root cause of this issue. Solving mysteries by intense testing and deduction brings me joy. But that’s for another small post that goes through the step-by-step process!