Diagnosing and Recovering etcd: Practical tools for Kubernetes Operators
When Kubernetes clusters experience serious issues, the symptoms are often vague but the impact is immediate. Control plane requests slow down. API calls begin to time out. In the worst cases, clusters stop responding altogether.
More often than not, etcd sits at the center of these incidents.
Because etcd is both small and critical, even minor degradation can cascade quickly. And when something goes wrong, operators are frequently left piecing together logs, metrics, and tribal knowledge under pressure. The goal of the recent work around etcd diagnostics and recovery is simple: help platform teams move faster from symptom to signal and only reach for recovery when it’s truly necessary.
This post walks through the motivation behind that work, introduces the etcd-diagnosis tooling, and explains how it fits into real-world Kubernetes operations, including environments like vSphere Kubernetes Service (VKS).
Why etcd incidents are so hard to reason about
etcd failures rarely announce themselves clearly. Instead, operators tend to encounter messages like:
apply request took too long
etcdserver: mvcc: database space exceeded
These errors don’t immediately tell you why the system is unhealthy. Is it disk I/O? Network latency between members? Resource pressure? Why does etcd run out of its space quota? Or some combination of all of the above?
Historically, diagnosing these issues has required:
- Deep familiarity with etcd internals
- Knowing which metrics matter and where to find them
- Manually collecting evidence that upstream maintainers will eventually ask for anyway
That gap between “something is wrong” and “here’s what’s actually happening” is where most time is lost during an incident.
From symptoms to clarity with etcd-diagnosis
The etcd-diagnosis [1] tool was designed to close that gap.
At its core, the tool provides a single command—etcd-diagnosis report—that generates a comprehensive diagnostic report describing the state of an etcd cluster at a point in time. Rather than asking operators to guess which signals matter, the report gathers the data that consistently proves useful during real production incidents.
This includes:
- Cluster health and membership status
- Disk I/O latency, including WAL fsync behavior
- Network round-trip times between members
- Resource pressure signals (memory, disk usage)
- Relevant etcd metrics that often require manual scraping
The output serves two equally important purposes:
- Local triage: helping operators quickly understand whether an issue is related to storage, networking, or resource pressure.
- Escalation readiness: providing a concrete artifact that can be shared upstream without repeated back-and-forth.
Quick checks vs. deep diagnostics
Not every issue requires a full diagnostic report. For initial triage, standard etcdctl commands are often sufficient to answer basic questions:
- Are all members healthy?
- Is quorum intact?
- Are Raft indexes and applied indexes progressing?
Commands like:
etcdctl endpoint status --clusteretcdctl endpoint health --clusteretcdctl member list
can quickly confirm whether the cluster is fundamentally functional.
In VKS environments, it’s worth noting that etcdctl may not be available directly on the host VM. In those cases, running etcdctl inside the etcd container provides equivalent visibility without additional tooling.
When these commands fail or when symptoms persist despite healthy-looking output and that’s the signal to move beyond surface checks and generate a full diagnostic report.
Understanding common etcd failure modes
Two classes of issues show up repeatedly in production environments and are explicitly addressed by the diagnostic tooling.
Database space exhaustion
The error “mvcc: database space exceeded” indicates that etcd has reached its storage quota, which defaults to 2GiB. While compaction and defragmentation are often necessary, they are not the first question operators should ask.
The more important question is: what data is consuming the space?
The diagnostic workflow emphasizes identifying high-volume keys and understanding why they exist. Even when an etcd instance is down, tools like iterate-bucket can inspect the on-disk database and surface which prefixes are driving growth and critical information for preventing repeat incidents.
“Apply request took too long”
This message typically points to performance degradation rather than functional failure.
Common root causes include:
- Disk I/O latency, often visible through slow WAL fsync operations
- Network latency, between etcd members
- Resource pressure, such as CPU saturation or memory contention
Rather than forcing operators to manually correlate logs and metrics, these signals are already captured in the diagnostic report, making it easier to distinguish between environmental issues and etcd-specific behavior.
Recovery is a last resort, and that’s intentional
When an etcd cluster truly loses quorum or becomes unrecoverable through normal means, etcd-recovery [2] exists to rebuild the cluster safely from persisted data.
However, recovery is intentionally framed as a last resort.
If a single member fails but quorum is still intact, automated systems, such as Cluster API or Kubernetes control plane reconciliation, are typically responsible for replacing the failed node. Recovering the entire cluster prematurely can introduce more risk than it removes.
This distinction is critical: diagnostics help operators decide whether recovery is warranted at all. In many cases, the right action is to fix the underlying infrastructure issue and allow the system to heal itself.
Building calmer, more predictable operations
The real value of this work isn’t just faster recovery, it’s fewer unnecessary recoveries in the first place.
By giving operators better visibility into etcd behavior, the diagnostics tooling helps replace guesswork with evidence. Incidents become easier to reason about, escalations become more productive, and recovery actions are taken deliberately rather than under panic.
For platforms like vSphere Kubernetes Service, where reliability and operational clarity matter at scale, that shift, from reactive heroics to disciplined diagnosis, is a meaningful step forward.