I’ve been using AI coding agents as part of my daily engineering workflow and wanted to understand how well they actually perform on real-world bugs.

To test this, I ran a series of structured experiments using bug reports from the Kubernetes repository, evaluating whether agents could produce correct, complete fixes without guidance in a large, multi-million-line codebase. My initial assumption was simple. Success would largely depend on retrieval. Whether via retrieval-augmented generation (RAG) or filesystem search, a model that finds the right code should be able to generate the right fix.

That assumption didn’t fully hold. Even when agents surfaced the right files, they often failed to connect changes across them, misidentified the true scope of the issue, or produced fixes that were locally plausible but globally incorrect. The bottleneck wasn’t just finding code, it was reasoning over it in context.

A flow chart showing the journey from input, through agent sessions, to output

The setup

I took open pull requests from the kubernetes GitHub repo. Real bugs, actively being fixed by real contributors. I extracted just the issue description (not the PR description, not the diff, nothing that would leak the solution) and gave each issue to three different agent configurations:

Each agent ran in a completely isolated session. Same model (Claude Opus 4.6), same timeout (5 minutes), same output format. The only variable was how they could see code.

One critical constraint: the RAG and Hybrid agents were prompted to make RAG queries before producing any fix. Early experiments showed that without this mandate, hybrid agents would skip RAG entirely when the issue mentioned specific filenames, defeating the whole point of the comparison.

The test cases (Benchmark)

I used a set of real, in-flight bug fixes from the Kubernetes repository as a benchmark. Each test case is an open pull request where the agent is given only the issue description and asked to produce a patch. They span kubelet, scheduler, networking, storage, and apps. From a one-line guard clause to a 900-line multi-file refactor.

PRBugSIGSizeFiles
#134540SubPath volume mount race condition (EEXIST)sig/storageXS3
#138211Image pull record poisoning (preloaded images)sig/nodeM5
#138244NUMA topology stall on GB200 systems (O(2^n))sig/nodeL2
#136013OOM score panic on zero memory capacitysig/nodeS2
#138269Scheduler SchedulingGates missing Pod/Update eventsig/schedulingS2
#138158ServiceCIDR deletion race in allocatorsig/networkM3
#135970Missing initializer in StatefulSet slice buildersig/appsM4
#138000Windows kube-proxy stale remote endpoint cleanupsig/networkXL5
#138191Container status sort by attempt (clock rollback)sig/nodeM5

Each generated diff was compared against the corresponding pull request diff and graded across five dimensions (0–4 each):

This evaluation uses the PR diff as the reference implementation. That isn’t a perfect ground truth. PRs reflect iterative discussion and may not represent the only or final correct solution. Here, it serves as a proxy for “what human contributors converged on under review constraints.”

Timing

Every session had a 5-minute ceiling. Here’s how long each actually took:

PRRAGHybridLocal
#134540 55s3m00s5m00s
#1382112m30s3m30s2m30s
#13824441s44s2m00s
#13601331s55s41s
#13826944s1m12s1m23s
#1381581m41s1m48s52s
#1359701m23s2m43s1m32s
#1380001m28s4m03s4m57s
#1381911m34s3m54s2m52s
Average1m16s2m25s2m24s

RAG is consistently the fastest. It doesn’t read files, navigate directories, or wait for grep results. It asks the index, gets snippets, and generates. Average wall-clock: 1 minute 16 seconds. This suggests retrieval reduces exploration overhead but also reduces structural exploration depth.

Hybrid is the slowest on average at 2 minutes 25 seconds. The mandatory RAG-first phase adds overhead before local exploration even begins. On complex PRs like #138000 and #138191, Hybrid pushed past 3-4 minutes. The tool switching loop dominates latency more than actual file access.

Local matches Hybrid’s average but for different reasons. Hybrid spends time on RAG queries followed by targeted file reads. Local burns time on broad exploration through grepping, finding, and iterating directories. Exploration cost is highly sensitive to how well the issue text anchors the search.

Token economics

Token usage isn’t just about how much context the agent reads—it’s dominated by how many times it calls the model.

Claude’s API is stateless, so every call replays the full conversation. That makes call count the primary cost multiplier.

A few terms used below:

Averages across all runs:

ApproachCallsNew TokensOutputCached (replay)Total
RAG444k2k141k187k
Hybrid827k3k234k264k
Local623k4k162k189k

Three patterns stand out:

Across all runs, call count was the biggest driver of both cost and latency.

What I actually learned

Agents fix locally, not systemically

Agents are good at fixing the problem directly in front of them, but struggle to reason about the broader system.

PR #134540 is a good example. The correct fix required preserving an error (%w) so the caller could handle it, and then updating the caller to tolerate that specific case. Every agent instead swallowed the error at the source. Functionally similar, but architecturally wrong.

This shows up consistently: agents fix symptoms in isolation, but miss the contracts between components.

Scope discovery is the real bottleneck

The biggest failure mode wasn’t incorrect fixes, it was incomplete ones.

Agents routinely fixed the “main” bug but missed adjacent changes:

The common pattern that emerged was agents don’t ask “what else needs to change?” They stop once the immediate issue appears resolved. This is most visible in multi-file changes, where agents fix the local bug but fail to propagate required updates across integration points.

Retrieval changes discovery, not reasoning

RAG meaningfully affects how agents find code, but not how they reason about it.

Forcing RAG usage improved results in some cases. On #138211, mandatory retrieval pushed the agent to discover the policy evaluation layer before implementing a fix, leading to a better architectural choice.

But the limitation remains, once the relevant code is found, the agent still reasons locally. Retrieval helps with navigation, not with understanding system-wide implications.

Agents prefer adding over reusing

When given a choice, agents tend to introduce new abstractions rather than reuse existing ones.

On #138191, the correct fix used the existing `RestartCount` field. All agents instead introduced a new `Attempt` field. Functionally correct, but architecturally heavier.

This pattern showed up repeatedly. Agents don’t reliably recognize when the system already contains the concept they need. They solve the problem in isolation rather than integrating with existing design. This also shows up in tests. Agents tend to add new tests rather than update existing ones, even when behavior changes require modifying assertions.

Issue quality dominates everything

Well-specified issues flatten the differences between approaches.

On PRs like #136013 and #138269, the issue descriptions named the exact file, function, and expected behavior. All approaches converged to high scores, with Local slightly ahead simply due to speed and direct access.

When the issue effectively acts as a spec, retrieval method matters far less. When the issue is vague, performance diverges significantly.

The bottom line

These results come from large codebases (millions of lines), where correctness depends on system-wide understanding rather than local edits.

They are not about raw model capability, but about how agent workflows behave under scale.

Indexing helps, but doesn’t solve scope

Semantic search (RAG) improves discovery in large repositories, but does not ensure coverage of all affected components.

Workflow matters more than model choice

Hybrid setups perform best only when retrieval is enforced. Otherwise, agents default to local reasoning and miss broader context.

Scope discovery is the core bottleneck

Agents reliably fix visible issues, but fail to consistently identify all dependent changes across the system.

This is the dominant failure mode in multi-file and integration-heavy changes.

Issue quality is the strongest lever

Well defined bug reports reduce performance variance more than tooling or retrieval strategy.

Skills can help, but they don’t remove the system problem

This experiment did not use any explicit coding “skills” or curated agent playbooks. Only prompts and tool access.

In principle, structured skills (such as repo exploration strategies or architectural summarization) could improve performance by guiding agents toward better system-level reasoning.

However, in large, evolving codebases, these skills would need to be continuously maintained and updated to remain aligned with the actual repository structure. That makes them less like a one-time improvement and more like an additional system to operate and maintain.

As a result, while skills may improve outcomes, they do not remove the core bottleneck observed in this study: scope discovery and system-level reasoning under scale.

The benchmark used Claude Opus via OpenClaw with KAITO RAG Engine indexing the Kubernetes/Kubernetes repository at HEAD. All sessions ran in isolated subagents with no shared state, April 2026. Ground truth was the actual PR difference from each open pull request.