Guest post originally published on Rookout’s blog by Gilad Weiss, Full Stack Developer at Rookout

The first lesson you learn as you start to work around the DevOps field is that being optimistic is not a good virtue. While that might seem overly pessimistic, let me explain. We plan our architecture to fit our needs, deal with edge cases, scale our applications up and as wide as we see fit, and with all of that, we still always expect for the unexpected to happen. As engineers, we are expected to deal with that unexpected. We need to plan ahead and to set out as many insurance layers as possible in order to combat any unforeseen (or seen) circumstance. 

Short of a potential zombie apocalypse, you and your developers should be prepared for the worst of the worst and know how to make sure that your company’s most precious assets – such as running apps in various environments and your data – are protected. And by protected I mean that your data is still accessible and intact and your infrastructure is entirely up-and-running when a disaster could potentially occur. Because if something goes wrong, it’s not just your team who’s suffering. Your customers suffer too. And unhappy customers? Well, that’s a top tier disaster.

You might be reading this and thinking to yourself, “I’m well aware of the worst-case scenario, my apps are running in an elastic cloud environment and there’s a daily backup for the database”. Whether you rely on your cloud’s DR capabilities or know that your dynamic backups are working, it’s all fine and dandy. However, I want you to ask yourself this one question:

Does my DevOps team know how to act in the time of disaster?

Knowing that you have backups and recovery options for your running apps is only half the battle. You also need to minimize the risk of human error and poor documentation when it comes to disaster recovery. It may seem obvious, but oftentimes companies skip basic planning procedures. Here’s some ideas from our own experiences on how you can avoid evident pitfalls. So grab that coffee, and sit back, because we’re going on a Disaster Recovery Plan journey.

Put your hard hat on

In case you’re unfamiliar with the various terms of disaster recovery, here’s a refresher:

In simple terms: RTO (Recovery Time Objective) is defined as the time it takes for an organization’s IT infrastructure to come back online and be fully functional post a disaster. RPO (Recovery Point Objective) reflects the number of transactions lost from the time of the event up to the full recovery of the IT infrastructure.sungardas.com 

Your software product – whether hosted on-prem or on cloud – is vulnerable to various types of system failures, such as cyber attacks, power outages, or even physical hazards.

In order to have minimum RTO in those cases, you need to make sure that all your saved data is backed up and your running apps have an alternative hosting area.

There are dozens of options in the IT market for DR abilities, whether it’s cloud-internal like AWS’s CloudEndure, or external like Acronis, Rackware and other DRaas services that let you backup your apps and data and switch to them quickly when you need to. So how do you start?

The first step of the plan is to identify the various components in your infrastructure and the treatment they need in terms of backups:

  1. Should I use online backup or offline backup for my DB?
  2. Are the availability zones enough for my kubernetes deployment?
  3. Should I move my data to an on-site unit in the worst case scenario?
  4. What are the options my cloud provider has for me?
  5. Is moving my pods to a new cluster enough?
  6. What about secrets rotation?

The answers to these questions define your DRP and what you need to do when you need to evacuate your data. Often, the chosen practice is deploying your infrastructure to DR dedicated hosts, restoring DB backups at a secondary location and ensuring routing to those new locations.

Keep it warm

As I previously mentioned, having a plan is only half the battle. Since the DevOps team is the one that should operate the DRP and they’re only human (shocking, I know), you’ll want to minimize the chances for human error. In order to ensure that happens, the plan’s execution should be practiced as a dry-run on a regular-basis by your team. That will ensure you and your team are knowledgeable and fully prepared for the real world scenario.

Here at Rookout, we have an annual dry-run day in which we test out our DRP. It also serves as a good opportunity for seniors on the team to sharpen their skills and for the juniors to make sure they are familiar with the infrastructure itself. After each dry-run of the DRP, the plan is updated, so that when a real disaster occurs the team won’t face any unexpected obstacles. Dealing with those obstacles is crucial for minimizing production downtime and the RPO.

Documentation and rewriting are the essence of the annual dry-run. We check the relevancy of each step along the way in order to have the simplest and most coherent plan. During the execution of the DRP we will periodically ask ourselves: Is this image still used? Is the version of the Helm chart supported? How can I test that the deployment went well? How can this be done if the backup is up-to-date? 

Not once did we find old images and obsolete steps that could really affect our RPO and RTO if the plan was to be executed for real.

This is actually our desired state. Be cool when the other shoe drops.

Some practical suggestions

In your knowledge-shared environment, have a DRP-related file with all the steps that one needs to run in order to activate your DR plan. The steps should be easy to follow and cover all the work that needs to be done in order to achieve a working, full-system environment in a new “location”.

A new location might need to be configured with new load balancer, APM components, SSL enforcer, cyber-defense related components and the list goes on. A full-system means a full package.

Set an annual date for the dry-run DRP so the responsible engineers will know to prepare for it. I also suggest having a private or internal repository hosted in your source control management system, that has all the related scripts that need to be run for the plan.

In the dry-run DRP, the DevOps team should follow all steps and make sure that each step is up-to-date and relevant. The plan should be able to run alongside the production deployment and not interfere with customer’s actions. 

Don’t forget to test out the new location! Is it accessible? Is it ready to withstand scale? Unit testing, by itself, isn’t enough for testing a DRP. In case you have periodically running testing, also run them against your new deployment environment to make sure all is as it should be. Even better, you could take down your pre-prod environment and replace it with the DR deployment.

Where to go from here?

Having a DRP is a MUST for every software that needs to run indefinitely, and having a plan needs research first, by writing up the steps, and then a regular, non-skippable dry-run just to keep the scripts and the people in shape.

The DevOps team might see this as a chore – rebuilding the already existing infrastructure, but at a different place – and it shouldn’t be like that. Operating it with less stress of a real disaster makes sure that when s*** hits the fan, you and your team are going to be able to make your company and your customers are happy. And don’t forget: “Practice Makes Perfect”.