TL;DR: Precision Recovery from etcd in 5 Steps

  1. Restore Snapshot to a local etcd data directory.
  2. Spin up a local etcd instance with the restored data.
  3. Query the specific key (e.g., ConfigMap) using etcdctl.
  4. Decode with auger to extract clean YAML.
  5. Apply with kubectl to restore into your live cluster.

This guide walks you through surgical resource recovery from etcd snapshots without triggering a full cluster restore. Whether you’re troubleshooting accidental deletions or forensic debugging, this lightweight and targeted approach keeps downtime minimal.

Introduction: 🩺 Emergency in the Kubernetes Operating Room

etcd is the beating heart of your Kubernetes cluster—a distributed key-value store that faithfully maintains the state of every object inside your system. But what happens when just one resource—like a ConfigMap, Secret, or Deployment—gets deleted or corrupted?

Initiating a full cluster restore feels like performing open-heart surgery to fix a paper cut. It’s disruptive, risky, and often unnecessary.

That’s where surgical precision comes in.

Imagine your production environment is in crisis—an essential ConfigMap disappears, pods crash, and users are staring at error pages. A full rollback would cause more damage than the issue itself. What you need is a surgeon’s approach: restore just what’s broken, and nothing else.

In this blog, you’ll learn how to:

Perfect for DevOps, SREs, and Kubernetes administrators who value minimal impact recoveries.

Prerequisites 🔧

To follow along, make sure you have:

Always work in a staging environment first. Start with a clean snapshot:

etcdctl snapshot save live-cluster-snapshot.db

The Surgical Recovery Process 🏥

Let’s say a critical ConfigMap app-config in the production namespace was accidentally deleted. Here’s how to surgically bring it back:

🧬 Step 1: Prepare the Snapshot

If compressed, decompress your snapshot:

gunzip live-cluster-snapshot.db.gz

Then restore it:

etcdctl snapshot restore live-cluster-snapshot.db --data-dir=recovery-etcd

🩻 Step 2: Launch a Local etcd Instance

etcd --data-dir=recovery-etcd --listen-client-urls=http://localhost:2379

Verify:

etcdctl --endpoints=localhost:2379 endpoint status

🔍 Step 3: Locate and Extract the Resource

Kubernetes stores ConfigMaps at keys like /registry/configmaps/<namespace>/<name>. List keys in the production namespace:

etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --keys-only

You’ll see something like:

/registry/configmaps/production/app-config

Extract and decode the ConfigMap:

etcdctl --endpoints=localhost:2379 get /registry/configmaps/production/app-config --print-value-only | auger decode > app-config.yaml

The resulting app-config.yaml might look like:

 apiVersion: v1
  kind: ConfigMap
  metadata:
    name: app-config
    namespace: production
  data:
    api-url: "https://api.example.com"
    log-level: "debug"

Step 4: Restore to the Cluster

Test the restoration with a dry-run:

kubectl apply -f app-config.yaml --dry-run=server

If all checks out, apply it:

kubectl apply -f app-config.yaml

Output:

configmap/app-config created

🧹 Step 5: Clean Up

pkill etcd
rm -rf recovery-etcd app-config.yaml

Key etcd Paths Cheat Sheet 📋

Resource Typeetcd Key Pattern
ConfigMaps/registry/configmaps/<namespace>/<name>
Secrets/registry/secrets/<namespace>/<name>
Deployments/registry/deployments/<namespace>/<name>
Pods/registry/pods/<namespace>/<name>
ServiceAccounts/registry/serviceaccounts/<namespace>/<name>
CRDs/registry/<group>/<resource>/<namespace>/<name>

Advanced Scenarios 🔍

💠 Restoring Across Namespaces

cat app-config.yaml | yq eval '.metadata.namespace = "dev"' | kubectl apply -f -

🔐 Encrypted Clusters (KMS)
Use etcdctl with decryption keys configured as per the etcd encryption guide.
📦 Bulk Recovery

etcdctl --endpoints=localhost:2379 get --prefix "/registry/configmaps/production" --print-value-only | auger decode > all-cm.yaml

Real-World Example ⏱️

Picture this: a developer mistakenly runs:

kubectl delete configmap app-config -n production

Or your CI/CD pipeline wipes it out in a bad deployment.
Instead of rolling back the entire cluster or restoring outdated backups, you perform a 5-minute precision restore—and your app is back online before the next Prometheus alert fires.

Troubleshooting Tips 🛠️

IssuePossible Fix
etcd not startingEnsure no other etcd instance is running
Connection refusedConfirm etcd is listening on localhost:2379
YAML not applyingValidate schema and resource references
Conflicts on applyTry kubectl replace or delete+apply

Final Thoughts 💡

Kubernetes admins often prepare for catastrophic failures but overlook the value of precision recovery. Being able to surgically extract and restore a resource:

Next time disaster strikes, you won’t scramble—you’ll operate.