If you’ve ever debugged a production incident, you know that the hardest part often isn’t the fix, it’s finding where to begin. Most on-call engineers end up spending hours piecing together clues, fighting time pressure, and trying to make sense of scattered data. You’ve probably run into one or more of these challenges: 

The challenges require a solution that can look across signals, recall patterns from past incidents, and guide you toward the most likely cause. 

This is where HolmesGPT, a CNCF Sandbox project, could help. 

 
HolmesGPT was accepted as a CNCF Sandbox project in October 2025. It’s built to simplify the chaos of production debugging – bringing together logs, metrics, and traces from different sources, reasoning over them, and surfacing clear, data-backed insights in plain language. 

What is HolmesGPT?

HolmesGPT is an open-source AI troubleshooting agent built for Kubernetes and cloud-native environments. It combines observability telemetry, LLM reasoning, and structured runbooks to accelerate root cause analysis and suggest next actions. 

Unlike static dashboards or chatbots, HolmesGPT is agentic: it actively decides what data to fetch, runs targeted queries, and iteratively refines its hypotheses – all while staying within your environment. 

Key benefits:

How it works 

When you run:

holmes ask “Why is my pod in crash loop back off state” 

HolmesGPT: 

  1. Understands intent → it recognizes you want to diagnose a pod restart issue 
  2. Creates a task list → breaks down the problem into smaller chunks and executes each of them separately  
  3. Queries data sources → runs Prometheus queries, collects Kubernetes events or logs, inspects pod specs including which pod 
  4. Correlates context → detects that a recent deployment updated the image   
  5. Explains and suggests fixes → returns a natural language diagnosis and remediation steps. 

Here’s a simplified overview of the architecture:

HolmesGPT architecture

Extensible by design 

HolmesGPT’s architecture allows contributors to add new components: 

Example of a simple custom tool: 

holmes:
  toolsets:
    kubernetes/pod_status:
      description: "Check the status of a Kubernetes pod."
      tools:
        - name: "get_pod"
          description: "Fetch pod details from a namespace."
          command: "kubectl get pod {{ pod }} -n {{ namespace }}"

Getting started

  1. Install Holmesgpt 

There are 4-5 ways to install Holmesgpt, one of the easiest ways to get started is through pip

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt

The detailed installation guide has instructions for helm, CLI and the UI. 

  1. Setup the LLM (Any Open AI compatible LLM) by setting the API Key  

In most cases, this means setting the appropriate environment variable based on the LLM provider.

  1. Run it locally 
holmes ask "what is wrong with the user-profile-import pod?" --model="anthropic/claude-sonnet-4-5" 

        

  1. Explore other features  

How to get involved 

HolmesGPT is entirely community-driven and welcomes all forms of contribution: 

Area How you can help 
Integrations Add new toolsets for your observability tools or CI/CD pipelines. 
Runbooks Encode operational expertise for others to reuse. 
Evaluation Help build benchmarks for AI reasoning accuracy and observability insights. 
Docs and tutorials Improve onboarding, create demos, or contribute walkthroughs. 
Community Join discussions around governance and CNCF Sandbox progression. 

All contributions follow the CNCF Code of Conduct

Further Resources