The Kubernetes Surgeon’s Handbook: Precision Recovery from etcd Snapshots

Posted on May 8, 2025 by Seshachalam Y.V, etcd-druid maintainer

TL;DR: Precision Recovery from etcd in 5 Steps

Restore Snapshot to a local etcd data directory.
Spin up a local etcd instance with the restored data.
Query the specific key (e.g., ConfigMap) using etcdctl.
Decode with auger to extract clean YAML.
Apply with kubectl to restore into your live cluster.

This guide walks you through surgical resource recovery from etcd snapshots without triggering a full cluster restore. Whether you’re troubleshooting accidental deletions or forensic debugging, this lightweight and targeted approach keeps downtime minimal.

Introduction: 🩺 Emergency in the Kubernetes Operating Room

etcd is the beating heart of your Kubernetes cluster—a distributed key-value store that faithfully maintains the state of every object inside your system. But what happens when just one resource—like a ConfigMap, Secret, or Deployment—gets deleted or corrupted?

Initiating a full cluster restore feels like performing open-heart surgery to fix a paper cut. It’s disruptive, risky, and often unnecessary.

That’s where surgical precision comes in.

Imagine your production environment is in crisis—an essential ConfigMap disappears, pods crash, and users are staring at error pages. A full rollback would cause more damage than the issue itself. What you need is a surgeon’s approach: restore just what’s broken, and nothing else.

In this blog, you’ll learn how to:

Isolate and extract specific resources from an etcd snapshot
Restore only what’s needed, directly into your live Kubernetes cluster
Avoid unnecessary downtime and preserve cluster stability

Perfect for DevOps, SREs, and Kubernetes administrators who value minimal impact recoveries.

Prerequisites 🔧

To follow along, make sure you have:

etcd v3.4+ — etcd server binary available here
etcdctl — CLI to interact with etcd
auger — CLI tool to decode etcd’s binary payloads into YAML
kubectl — CLI to apply resources to your Kubernetes cluster
Snapshot file — e.g., live-cluster-snapshot.db

Always work in a staging environment first. Start with a clean snapshot:

etcdctl snapshot save live-cluster-snapshot.db

The Surgical Recovery Process 🏥

Let’s say a critical ConfigMap app-config in the production namespace was accidentally deleted. Here’s how to surgically bring it back:

🧬 Step 1: Prepare the Snapshot

If compressed, decompress your snapshot:

gunzip live-cluster-snapshot.db.gz

Then restore it:

etcdctl snapshot restore live-cluster-snapshot.db --data-dir=recovery-etcd

🩻 Step 2: Launch a Local etcd Instance

etcd --data-dir=recovery-etcd --listen-client-urls=http://localhost:2379

Verify:

etcdctl --endpoints=localhost:2379 endpoint status

🔍 Step 3: Locate and Extract the Resource

Kubernetes stores ConfigMaps at keys like /registry/configmaps/<namespace>/<name>. List keys in the production namespace:

etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --keys-only

You’ll see something like:

/registry/configmaps/production/app-config

Extract and decode the ConfigMap:

etcdctl --endpoints=localhost:2379 get /registry/configmaps/production/app-config --print-value-only | auger decode > app-config.yaml

The resulting app-config.yaml might look like:

 apiVersion: v1
  kind: ConfigMap
  metadata:
    name: app-config
    namespace: production
  data:
    api-url: "https://api.example.com"
    log-level: "debug"

Step 4: Restore to the Cluster

Test the restoration with a dry-run:

kubectl apply -f app-config.yaml --dry-run=server

If all checks out, apply it:

kubectl apply -f app-config.yaml

Output:

configmap/app-config created

🧹 Step 5: Clean Up

pkill etcd
rm -rf recovery-etcd app-config.yaml

Key etcd Paths Cheat Sheet 📋

Resource Type	etcd Key Pattern
ConfigMaps	/registry/configmaps/<namespace>/<name>
Secrets	/registry/secrets/<namespace>/<name>
Deployments	/registry/deployments/<namespace>/<name>
Pods	/registry/pods/<namespace>/<name>
ServiceAccounts	/registry/serviceaccounts/<namespace>/<name>
CRDs	/registry/<group>/<resource>/<namespace>/<name>

Advanced Scenarios 🔍

💠 Restoring Across Namespaces

cat app-config.yaml | yq eval '.metadata.namespace = "dev"' | kubectl apply -f -

🔐 Encrypted Clusters (KMS)
Use etcdctl with decryption keys configured as per the etcd encryption guide.
📦 Bulk Recovery

etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --print-value-only | auger decode > all-cm.yaml

Real-World Example ⏱️

Picture this: a developer mistakenly runs:

kubectl delete configmap app-config -n production

Or your CI/CD pipeline wipes it out in a bad deployment.
Instead of rolling back the entire cluster or restoring outdated backups, you perform a 5-minute precision restore—and your app is back online before the next Prometheus alert fires.

Troubleshooting Tips 🛠️

Issue	Possible Fix
etcd not starting	Ensure no other etcd instance is running
Connection refused	Confirm etcd is listening on localhost:2379
YAML not applying	Validate schema and resource references
Conflicts on apply	Try kubectl replace or delete+apply

Final Thoughts 💡

Kubernetes admins often prepare for catastrophic failures but overlook the value of precision recovery. Being able to surgically extract and restore a resource:

Reduces downtime
Prevents collateral damage
Builds confidence in your incident response

Next time disaster strikes, you won’t scramble—you’ll operate.