Member post originally published on the InfraCloud blog by Ruturaj Kadikar

Microservices architecture is a popular choice for businesses today due to its scalability, agility, and continuous delivery. However, microservices architectures are not immune to outages. Outages can be caused by a variety of factors, including network communication, inter-service dependencies, external dependencies, and scalability issues.

Several well-known companies, such as SlackTwitterRobinhood TradingAmazonMicrosoftGoogle, and many more have recently experienced outages that caused significant downtime costs. These outages highlight the diverse sources of outages in microservices architectures, which can range from configuration errors and database issues to infrastructure scaling failures and code problems.

To minimize the impact of outages and improve system availability, businesses should prioritize resiliency principles in the design, development, and operation of microservices architectures. In this article, we will learn how to improve the resiliency of a system with the help of chaos engineering to minimize system outages. I recently spoke at Chaos Carnival on the same topic, you can also watch my talk here.

What is Chaos Engineering?

Chaos engineering is a method for testing the resiliency and reliability of complex systems by intentionally introducing controlled failures into them. The goal of chaos engineering is to identify and highlight faults in a system before they can cause real-world problems such as outages, data loss, or security breaches.

This is done by simulating various failure scenarios, such as network outages, server failures, or unexpected spikes in traffic, and observing how the system responds. By intentionally inducing failure in a controlled environment, chaos engineering enables teams to better understand the limits and failure domains of their systems and develop strategies to mitigate or avoid such failures in the future.

Many big companies like Netflix, Amazon, Google, Microsoft, etc. are emphasizing chaos engineering as the crucial part of site reliability. Netflix introduced tools to test chaos like Chaos Monkey, Chaos Kong, and ChAP at different infrastructure levels to maintain their SLAs. Amazon incorporated the concept of Gamedays in their AWS Well-Architected Framework, wherein various teams collaborate and test chaos in their environment to educate, and reinforce the system knowledge in order to increase the overall reliability.

What is Resiliency Testing?

Resiliency testing is primarily concerned with evaluating a system’s ability to recover from disruptions or failures and continue to function as intended. The goal of resiliency testing is to improve the overall reliability and availability of a system and minimize the impact of potential disruptions or failures. By identifying and addressing potential vulnerabilities or weaknesses in system design or implementation, resiliency testing can help ensure that the system continues to function in the face of unexpected events or conditions.

Why should I test Resiliency?

Resiliency testing is essential for a number of reasons. Here are a few examples:

In general, testing resiliency is important to ensure that your system is reliable, available, and able to recover quickly from failures or outages. By identifying and fixing potential failure points, you can build a more robust and resilient system that provides a better user experience and meets regulatory requirements.

Why should I test Resiliency in Kubernetes?

Testing resiliency in Kubernetes is important because Kubernetes is a complex and distributed system designed for large-scale, mission-critical applications. Kubernetes provides many features to ensure resiliency, such as automatic Kubernetes scaling, self-healing, and rolling updates, but it’s still possible for a Kubernetes cluster to experience glitches or failures.

Here are the top reasons why we should test resiliency in Kubernetes:

Considering all of this, testing resiliency in Kubernetes is important to ensure that your application can handle interruptions and continue to function as intended.

Chaos vs Resiliency vs Reliability

Chaos, resiliency, and reliability are related concepts, but they aren’t interchangeable. Here you’ll find an overview of each concept:

In a nutshell, chaos engineering is a way to build failures into your system to test resilience, which is the ability of a system to recover from failures, while reliability is a measure of the consistent and predictable performance of a system over time. All three concepts are important for building and maintaining robust and trustworthy systems, and each plays a different role in ensuring the overall quality and resilience of a system.

Choas engineering

What are available Tools to Test System Resiliency?

LitmusGremlinChaos Mesh, and Chaos Monkey are all popular open-source tools used for chaos engineering. As we will be using AWS cloud infrastructure, we will also explore AWS Fault Injection Simulator (FIS). While they share the same goals of testing and improving the resilience of a system, there are some differences between them. Here are some comparisons:

ScopeChaos MeshChaos MonkeyLitmusGremlinAWS FIS
Kubernetes-nativeYesYesYesYesNo
Cloud-nativeNoNoYesYesYes (AWS)
BaremetalNoNoNoYesNo
Built-in LibraryBasicBasicExtensiveExtensiveBasic
CustomizationUsing YAMLUsing YAMLUsing OperatorUsing DSLUsing SSM docs
DashboardNoNoYesYesNo
OSSYesYesYesYesNo

The bottom line is that while all four tools share similar features, we choose Litmus as it provides flexibility to leverage AWS SSM documents to execute chaos in our AWS infrastructure. Now let’s see how we can use Litmus to execute chaos like terminating pods and EC2 instances in Kubernetes and AWS environments respectively.

Installing Litmus in Kubernetes

At first, we will see how to install Litmus in Kubernetes to execute chaos in an environment.

Here are the basic installation steps for LitmusChaos:

  1. Set up a Kubernetes cluster: LitmusChaos requires a running Kubernetes cluster. If you don’t already have one set up, you can use a tool like kubeadm or kops to set up a cluster on your own infrastructure or use a managed Kubernetes service like GKE, EKS, or AKS. For this article, we will use k3d. k3d cluster create $ kubectl cluster-info Kubernetes control plane is running at https://0.0.0.0:38537 CoreDNS is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/kube-dns:dns/proxy Metrics-server is running at https://0.0.0.0:38537/api/vl/namespaces/kube-system/services/https:metrics-server:https/proxy To further debug and diagnose cluster problems, use 'kubectl cluster—info dump’.
  2. Install Helm: Helm is a package manager for Kubernetes that you’ll need to use to install Litmus. You can install Helm by following the instructions on the Helm website.Add the LitmusChaos chart repository: Run the following command to add the LitmusChaos chart repository:helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
  3. Install LitmusChaos: Run the following command to install LitmusChaos: helm install litmuschaos litmuschaos/litmus --namespace=litmus This will install the LitmusChaos control plane in the litmus namespace. You can change the namespace to your liking.
  4. Verify the installation: Run the following command to verify that LitmusChaos is running: kubectl get pods -n litmus Output: $ kubectl get pods -nlitmus NAME READY STATUS RESTARTS AGE chaos-litmus-frontend-6££c95c884-x2452 1/1 Running 0 6m22s chaos-litmus-auth-server-b8dcdf66b-v8hf9 1/1 Running 0 6m22s chaos-litmus-server-585786dd9c-16x37 1/1 Running 0 6m22s This should show the LitmusChaos control plane pods running.
  5. Login into the Litmus portal using port-forwarding. kubectl port-forward svc/chaos-litmus-frontend-service -nlitmus 9091:9091 Litmus chaos installedOnce you log in, a webhook will install litmus-agent (called self-agent) components in the cluster. Verify it.Output: $ kubectl get pods -n litmus NAME STATUS RESTARTS AGE READY chaos-litmus-frontend-6£fc95c884-x245z Running 0 9m6s 1/1 chaos-litmus-auth-server-b8dcdf66b-v8he9 Running 0 9m6s 1/1 chaos-litmus-server-585786dd9c-16x37 Running 0 9m6s 1/1 subscriber-686d9b8dd9-bjgih Running 0 9m6s 1/1 chaos-operator-ce-84bc885775-kzwzk Running 0 92s 1/1 chaos-exporter-6c9b5988cd-1wmpm Running 0 94s 1/1 event-tracker-744b6fd8cf-rhrfc Running 0 94s 1/1 workflow-controller-768b7d94dc-xr6vy Running 0 92s 1/1

With these steps, you should have LitmusChaos installed and ready to use on your Kubernetes cluster.

Experimenting with chaos

Experimenting with chaos within a cloud-native environment typically involves using a chaos engineering tool to simulate various failure scenarios and test the resilience of the system. Most of the cloud-native application infrastructure consists of Kubernetes and corresponding Cloud components. For this article, we will see chaos in Kubernetes and in the cloud environment i.e. AWS.

Chaos in Kubernetes

In order to evaluate the resilience of a Kubernetes cluster we can test the following failure scenarios:

By testing these failure scenarios, you can identify potential vulnerabilities in the cluster’s resilience and improve the system to ensure high availability and reliability.

Scenario: Killing a Pod

In this experiment, we will kill a pod using Litmus. We will use an Nginx deployment for a sample application under test (AUT).

kubectl create deploy nginx --image=nginx -nlitmus

Output:

$ kubectl get deploy -nlitmus | grep nginx

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
nginx  1/1 	1        	1       	109m

Go to Litmus portal, and click on Home.

Click on Schedule a Chaos Scenario and select Self Agent.

Schedule Chaos Scenario

Next, select chaos experiment from ChaosHubs.

Select chaos experiment from ChaosHubs

Next, name the scenario as ‘kill-pod-test’.

Next, click on ‘Add a new chaos Experiment’.

Choose generic/pod-delete experiment.

Choose experiment

Tune the experiment parameters to select the correct deployment labels and namespace.

Tune the selected chaos scenario

Enable Revert Schedule and click Next.

Assign the required weight for the experiment, for now, we will keep 10 points.

Click Schedule Now and then Finish. The execution of the Chaos Scenario will start.

To view the Chaos Scenario, click on ‘Show the Chaos Scenario’.

View Chaos scenario

You will see the Chaos Scenario and experiment crds getting deployed and the corresponding pods getting created.

Once the Chaos Scenario is completed, you will see that the existing Nginx pod is deleted and a new pod is up and running.

New Pod
$ kubectl get pods -nlitmus
NAME                                      	READY   STATUS	RESTARTS   AGE
chaos-litmus-frontend-6ffc95c884-x245z    	1/1 	Running   0      	32m
chaos-mongodb-68f8b9444c-w2kkm            	1/1 	Running   0      	32m
chaos-litmus-auth-server-b8dcdf66b-v8hf9   	1/1 	Running   0      	32m
chaos-litmus-server-585786dd9c-16xj7      	1/1 	Running   0      	32m
subscriber-686d9b8dd9-bjgjh                	1/1 	Running   0      	24m
chaos-operator-ce-84bc885775-kzwzk         	1/1 	Running   0      	24m
chaos-exporter-6c9b5988c4-1wmpm            	1/1 	Running   0      	24m
event-tracker-744b6fd8cf-rhrfc             	1/1 	Running   0      	24m
workflow-controller-768f7d94dc-xr6vv      	1/1 	Running   0      	24m
kill-pod-test-1683898747-869605847         	0/2 	Completed 0      	9m36s
kill-pod-test-1683898747-2510109278       	2/2 	Running   0      	5m49s
Pod-delete-tdoklgkv-runner			1/1	Running   0 		4m29s
Pod-delete-swkok2-pj48x			1/1	Running   0 		3m37s
nginx-76d6c9b8c-mnk8f                      	1/1 	Running   0      	4m29s

You can verify the series of events to understand the entire process. Some of the main events are shown below stating experiment pod was created, the nginx pod (AUT) getting deleted, the nginx pod getting created again, and the experiment was successfully executed.

$ kubectl get events -nlitmus
66s   Normal    Started            pod/pod-delete-swkok2-pj48x                  Started container pod-delete-swkok2
62s   Normal    Awaited            chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Awaited
58s   Normal    PreChaosCheck      chaosengine/pod-delete-tdok1gkv              AUT: Running
58s   Normal    Killing            pod/nginx-76d6c9b8c-c8vv7                    Stopping container nginx
58s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-mnk8f
44s   Normal    Killing            pod/nginx-76d6c9b8c-mnk8f                    Stopping container nginx
44s   Normal    Successfulcreate   replicaset/nginx-76d6c9b8c                   Created pod: nginx-76d6c9b8c-kqtgq
43s   Normal    Scheduled          pod/nginx-76d6c9b8c-kqtgq                    Successfully assigned litmus/nginx-76d6c9b8c-kqtgq to k3d-k3s-default-server-0
128   Normal    PostChaosCheck     chaosengine/pod-delete-tdok1gkv              AUT: Running
8s    Normal    Pass               chaosresult/pod-delete-tdoklgkv-pod-delete   experiment: pod-delete, Result: Pass
8s    Normal    Summary            chaosengine/pod-delete-tdok1gkv              pod-delete experiment has been Passed
3s    Normal    Completed          job/pod-delete-swkok2                        Job completed

Chaos in AWS

Here are some potential problems that can be simulated to assess the ability of an application running on AWS to recover from failures:

Scenario: Terminate EC2 instance

In this scenario, we will include one chaos experiment of terminating an EC2 instance. Litmus leverages AWS SSM documents for executing experiments in AWS. For this scenario, we will require two manifest files; one for configMap consisting of the script for the SSM document and the other consisting of a complete workflow of the scenario. Both these manifest files can be found here.

Apply the configMap first in the ‘litmus’ namespace.

Kubectl apply -f https://raw.githubusercontent.com/rutu-k/litmus-ssm-docs/main/terminate-instance-cm.yaml

Then, go to the Litmus portal, and click on Home.

Click on Schedule a Chaos Scenario and select Self Agent. (Refer Installation and Chaos in Kubernetes)

Now, instead of selecting chaos experiment from ChaosHubs, we will select Import a Chaos Scenario using YAML and upload our workflow manifest.

Click Next and Finish.

To View the Chaos Scenario, click on Show the Chaos Scenario.

Chaos scenario

You will see the Chaos Scenario and experiment crds getting deployed and the corresponding pods getting created.

Verify the logs of the experiment pods. It will show the overall process and status of each step.

$ kubectl logs aws-ssm-chaos-by-id-vSoazu-w6tmj -n litmus -f
time="2023-05-11T13:05:10Z" level=info msg="Experiment Name: aws-ssm-chaos-by-id"
time="2023-05-11T13:05:14Z" level=info msg="The instance information is as follows" Chaos Namespace=litmus Instance ID=i-0da74bcaa6357ad60 Sequence=parallel Total Chaos Duration=960
time="2023-05-11T13:05:14Z" level=info msg="[Info]: The instances under chaos(IUC) are: [i-0da74bcaa6357ad60]"
time="2023-05-11T13:07:252" level=info msg="[Status]: Checking SSM command status"
time="2023-05-11T13:07:26Z" level=info msg="The ssm command status is Success"
time="2023-05-11T13:07:28Z" level=info msg="[Wait]: Waiting for chaos interval of 120s"
time="2023-05-11T13:09:28Z" level=info msg="[Info]: Target instanceID list, [i-0da74bcaa6357ad60]"
time="2023-05-11T13:09:28Z" level=info msg="[Chaos]: Starting the ssm command"
time="2023-05-11T13:09:28Z" level=info msg="[Wait]: Waiting for the ssm command to get in InProgress state”
time="2023-05-11T13:09:28Z" level=info msg="[Status]: Checking SSM command status”
time="2023-05-11T13:09:30Z" level=info msg="The ssm command status is InProgress”
time="2023-05-11T13:09:32Z" level=info msg="[Wait]: waiting for the ssm command to get completed”
time="2023-05-11T13:09:32Z" level=info msg="[Status]: Checking SSM command status"
time="2023-05-11T13:09:32Z" level=info msg="The ssm command status is Success"

Once the Chaos Scenario is completed, you will see that the SSM document is executed.

SSM document is executed

You can verify that the EC2 instance is being terminated.

What to do next?

Design a Resiliency Framework

Resiliency Framework

A resiliency framework refers to a structured approach or set of principles and strategies leveraging chaos engineering to build resilience and ensure overall reliability. The following is a detailed description of the typical steps or lifecycle involved in the resiliency framework:

Assign a Resiliency Score

The resiliency score is a metric used to measure and quantify the level of resiliency or robustness of a system (refer). The Resiliency Score is typically calculated based on various factors, including the system’s architecture, its mean time to recover (MTTR), its mean time between failures (MTBF), redundancy measures, availability, scalability, fault tolerance, monitoring capabilities, and recovery strategies. It varies from system to system and organization to organization depending upon their priorities and requirements.

The resiliency score helps organizations evaluate their system’s resiliency posture and identify areas that need improvement. A higher resiliency score indicates a more resilient system, capable of handling failures with minimal impact on its functionality and user experience. Organizations can track their progress in improving system resiliency over time by continuously measuring and monitoring the resiliency score.

Gamedays

Gamedays are controlled and planned events where organizations simulate real-world failure scenarios and test their system’s resiliency in a safe and controlled environment. During a Gameday, a team deliberately introduces failures or injects chaos into the system to observe and analyze its behavior and response.

The organization should practice Gamedays as they offer a chance for teams to practice and improve their incident response and troubleshooting skills. It enhances team collaboration, communication, and coordination during high-stress situations, which are valuable skills when dealing with real-world failures or incidents. It demonstrates an organization’s proactive approach in ensuring that the system can endure unexpected events and continue operating without experiencing significant disruptions.

Overall, Gamedays serve as a valuable practice to improve system resiliency, validate recovery mechanisms, and build a culture of preparedness and continuous improvement within organizations.

Incorporate Resiliency Checks in CI/CD Pipelines

Integrating resiliency checks into CI/CD pipelines offers several advantages, helping to enhance the overall robustness and reliability of software systems. Here are some key benefits of incorporating resiliency checks in these pipelines:

Improve Observability Posture

It is presumed that the system is actively monitored, and relevant metricslogtraces, and other events are captured and analyzed before inducing chaos in the system. Make sure your observability tools and processes provide visibility into the system’s health, performance, and potential issues, triggering alerts or notifications when anomalies or deviations from the steady state are detected.

In case, there is no visibility for any issue that occurs during the chaos, you have to incorporate and improve your observability measures accordingly. With several different chaos experiments, you can analyze missing observability data and add it to your system accordingly.

Conclusion

In this article, we learned what is chaos engineering, resiliency, reliability, and how all three are related to each other. We saw what are available tools for executing chaos and why we chose Litmus for our use case.

Further, we explored what types of chaos experiments we can execute in Kubernetes and AWS environments. We also saw a demo of chaos experiments executed in both environments. In addition to this, we have learned how we can design a resiliency framework and incorporate resiliency scoring, gamedays, and resiliency checks to improve the overall observability of a platform.

Thanks for reading! Hope you found this blog post helpful. If you have any questions or suggestions, please do reach out to Ruturaj.

References