TL;DR: Precision Recovery from etcd in 5 Steps
- Restore Snapshot to a local etcd data directory.
- Spin up a local etcd instance with the restored data.
- Query the specific key (e.g., ConfigMap) using etcdctl.
- Decode with auger to extract clean YAML.
- Apply with kubectl to restore into your live cluster.
This guide walks you through surgical resource recovery from etcd snapshots without triggering a full cluster restore. Whether you’re troubleshooting accidental deletions or forensic debugging, this lightweight and targeted approach keeps downtime minimal.
Introduction: 🩺 Emergency in the Kubernetes Operating Room
etcd is the beating heart of your Kubernetes cluster—a distributed key-value store that faithfully maintains the state of every object inside your system. But what happens when just one resource—like a ConfigMap, Secret, or Deployment—gets deleted or corrupted?
Initiating a full cluster restore feels like performing open-heart surgery to fix a paper cut. It’s disruptive, risky, and often unnecessary.
That’s where surgical precision comes in.
Imagine your production environment is in crisis—an essential ConfigMap disappears, pods crash, and users are staring at error pages. A full rollback would cause more damage than the issue itself. What you need is a surgeon’s approach: restore just what’s broken, and nothing else.
In this blog, you’ll learn how to:
- Isolate and extract specific resources from an etcd snapshot
- Restore only what’s needed, directly into your live Kubernetes cluster
- Avoid unnecessary downtime and preserve cluster stability
Perfect for DevOps, SREs, and Kubernetes administrators who value minimal impact recoveries.
Prerequisites 🔧
To follow along, make sure you have:
- etcd v3.4+ — etcd server binary available here
- etcdctl — CLI to interact with etcd
- auger — CLI tool to decode etcd’s binary payloads into YAML
- kubectl — CLI to apply resources to your Kubernetes cluster
- Snapshot file — e.g., live-cluster-snapshot.db
Always work in a staging environment first. Start with a clean snapshot:
etcdctl snapshot save live-cluster-snapshot.db
The Surgical Recovery Process 🏥
Let’s say a critical ConfigMap app-config in the production namespace was accidentally deleted. Here’s how to surgically bring it back:
🧬 Step 1: Prepare the Snapshot
If compressed, decompress your snapshot:
gunzip live-cluster-snapshot.db.gz
Then restore it:
etcdctl snapshot restore live-cluster-snapshot.db --data-dir=recovery-etcd
🩻 Step 2: Launch a Local etcd Instance
etcd --data-dir=recovery-etcd --listen-client-urls=http://localhost:2379
Verify:
etcdctl --endpoints=localhost:2379 endpoint status
🔍 Step 3: Locate and Extract the Resource
Kubernetes stores ConfigMaps at keys like /registry/configmaps/<namespace>/<name>. List keys in the production namespace:
etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --keys-only
You’ll see something like:
/registry/configmaps/production/app-config
Extract and decode the ConfigMap:
etcdctl --endpoints=localhost:2379 get /registry/configmaps/production/app-config --print-value-only | auger decode > app-config.yaml
The resulting app-config.yaml might look like:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: production
data:
api-url: "https://api.example.com"
log-level: "debug"
Step 4: Restore to the Cluster
Test the restoration with a dry-run:
kubectl apply -f app-config.yaml --dry-run=server
If all checks out, apply it:
kubectl apply -f app-config.yaml
Output:
configmap/app-config created
🧹 Step 5: Clean Up
pkill etcd
rm -rf recovery-etcd app-config.yaml
Key etcd Paths Cheat Sheet 📋
Resource Type | etcd Key Pattern |
ConfigMaps | /registry/configmaps/<namespace>/<name> |
Secrets | /registry/secrets/<namespace>/<name> |
Deployments | /registry/deployments/<namespace>/<name> |
Pods | /registry/pods/<namespace>/<name> |
ServiceAccounts | /registry/serviceaccounts/<namespace>/<name> |
CRDs | /registry/<group>/<resource>/<namespace>/<name> |
Advanced Scenarios 🔍
💠 Restoring Across Namespaces
cat app-config.yaml | yq eval '.metadata.namespace = "dev"' | kubectl apply -f -
🔐 Encrypted Clusters (KMS)
Use etcdctl with decryption keys configured as per the etcd encryption guide.
📦 Bulk Recovery
etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --print-value-only | auger decode > all-cm.yaml
Real-World Example ⏱️
Picture this: a developer mistakenly runs:
kubectl delete configmap app-config -n production
Or your CI/CD pipeline wipes it out in a bad deployment.
Instead of rolling back the entire cluster or restoring outdated backups, you perform a 5-minute precision restore—and your app is back online before the next Prometheus alert fires.
Troubleshooting Tips 🛠️
Issue | Possible Fix |
etcd not starting | Ensure no other etcd instance is running |
Connection refused | Confirm etcd is listening on localhost:2379 |
YAML not applying | Validate schema and resource references |
Conflicts on apply | Try kubectl replace or delete+apply |
Final Thoughts 💡
Kubernetes admins often prepare for catastrophic failures but overlook the value of precision recovery. Being able to surgically extract and restore a resource:
- Reduces downtime
- Prevents collateral damage
- Builds confidence in your incident response
Next time disaster strikes, you won’t scramble—you’ll operate.