What a two-person SRE team learned building an AI investigation pipeline. Spoiler: the runbooks mattered more than the model.

Why we built this

At STCLab, our SRE team supports multiple Amazon EKS clusters running high-traffic production workloads. We’ve got the full observability stack in place: OpenTelemetry feeding into Mimir, Loki, and Tempo. Robusta OSS enriches Prometheus alerts with error logs, Grafana links, and team mentions before dropping them into Slack.

So the data was never the problem. The problem was what happened next. Every alert meant the same drill: check the pod, query Prometheus, dig through Loki, pull traces, try to correlate. Fifteen to twenty minutes, every single time. We wanted that first pass to happen automatically and show up in the same Slack thread.

HolmesGPT: Letting the LLM decide what to investigate

A cooler image (flow chart) of the SRE Alerts Pipeline

Alerts Pipeline

We went with HolmesGPT (CNCF Sandbox) because of how it works: the ReAct pattern. The LLM reads an alert, picks a tool, reads the result, then decides what to check next. If a pod restarts, it might start with the exit code, pull Loki logs across clusters through VPC peering, then look at CPU pressure in Prometheus. The path isn’t scripted it depends on what the model actually finds.

That matters in our case, because not every namespace looks the same. Some have the full picture: centralized logs, distributed traces, the works. Multi-tenant workloads often have none of that; for those namespaces, it’s kubectl and Prometheus only. We capture these differences in markdown runbooks, each with a metadata header:

  1. ## Meta
  2. scope: namespace=<target> only
  3. tools: kubectl, prometheus, loki, tempo
  4. caution: some containers excluded from log collection → use kubectl logs

Holmes calls fetch_runbook early in its investigation. The metadata tells it which tools are available and which ones to skip.

Making it work with Robusta

Our custom playbook is about 200 lines of Python. It covers what HolmesGPT doesn’t.

Robusta posts the alert to Slack before Holmes is done investigating, so our playbook has to find the right thread after the fact and post results as a reply. When Prometheus fires one alert per pod during a rollout, the playbook fingerprints at the workload level and suppresses repeats for 30 minutes. And since Robusta routes to different Slack channels by namespace, the playbook replicates that mapping to find where to post.

Runbooks changed everything

We started by focusing on model selection. What actually determined investigation quality was the runbooks.

Without runbooks, the model just guesses. It might check Istio metrics in namespaces that have no sidecars, or query Loki where nothing is being collected. Eventually it loops, burns through its step budget, and comes back with “I need more information.”

What fixed this wasn’t a better model. It was telling the model what not to do. Once we added exclusion rules to our runbooks (“no Loki, no Tempo, no Istio here; use kubectl and PromQL only”), wasted tool calls dropped from 16 to 2 per investigation.

We ran a controlled comparison to confirm this: the same ClickHouse handshake alert, tested four ways. With runbooks, Holmes matched the known error pattern in 3 to 4 tool calls and used the rest of its budget to verify. Without runbooks, it chased three entirely different hypotheses (proxy scaling, schema mismatch, port misconfiguration) and burned through 20+ steps before reaching a conclusion. Same model, same alert. The runbook didn’t hand it the answer. It just narrowed the search space enough that a 12-step budget was plenty.

We now maintain seven runbooks, organized by namespace and alert type. When an investigation comes back wrong, the first question we ask is “does the runbook cover this?” Not “do we need a better model?”

Screenshots: Same alert, same model. Left: with runbook guidance. Right: without runbooks.

Same alert, same model. Left: with runbook guidance. Right: without runbooks.

The model journey

We tested seven models across self-hosted and managed hosting.

Self-hosted came first, running on Spot GPUs managed by KubeAI (CNCF). The 7B model couldn’t produce valid tool calls. The 9B model’s thinking mode clashed with the ReAct loop and returned empty responses. A 14B looked promising, but Spot evictions kept killing our runs, and cold starts took 5 to 8 minutes while Karpenter spun up nodes.

Then we tried managed APIs through VPC endpoints, which keeps cluster data inside our infrastructure. Most models didn’t work; several choked on HolmesGPT’s prompt caching markers. Only one model family passed everything we needed: Korean output, Slack formatting, runbook retrieval, and cross-cluster log correlation. We also contributed a three-line upstream fix for pod identity authentication (PR #1850, merged).

Today we run a hybrid setup: self-hosted in staging, managed API in production. Switching between them is one YAML block:

modelList:

 primary:

   model: "provider/model-name"  # swap provider and model ID

   api_base: "https://endpoint"  # managed API or self-hosted

   temperature: 0

Cost comes out to about $0.04 per investigation, roughly $12 a month. Pipeline, playbook, runbooks — all unchanged regardless of backend.

What actually mattered

Some numbers. Workload-level deduplication takes around 40 raw daily alerts down to about 12 unique investigations. Engineers read a threaded summary in under two minutes instead of spending 15 to 20 on manual triage. Roughly 40% of investigations resolve on their own: OOMKilled, ImagePullBackOff, and other known patterns where Holmes matches a runbook and the root cause is obvious.

Here’s what we’d tell another team starting this.

Runbooks over models. We ran a controlled test where the same model scored 4.6 out of 5 with runbooks and 3.6 without, on the exact same alert. The exclusion rules we wrote into our runbooks moved the needle more than any model swap ever did.

Glue code is real work. That 200-line playbook handles timing, dedup, routing, and thread matching. HolmesGPT handles reasoning. You need both.

Design for model migration. We’ve swapped backends three times now without touching the pipeline. The playbook is the stable core. The model is the part you replace.

What’s next: we’re looking at Inspektor Gadget (CNCF) to feed eBPF-level network metrics — TCP retransmits, connection latency — into the same pipeline through Prometheus. The architecture stays the same. Holmes just gets better data to work with.

About STCLab: STCLab builds high-performance online traffic management software. NetFUNNEL (Virtual Waiting Room) and BotManager (bot mitigation) serve millions of concurrent users across 200 countries.

Questions? grace@stclab.com | stclab.com