LitmusChaos 3.0 Beta Rolls On With Multiple Enhancements

Posted on April 21, 2023

CNCF projects highlighted in this post

During KubeCon Detroit 2022,the maintainers of LitmusChaos announced the start of 3.0 Beta, with several planned enhancements to make the chaos platform more robust, leaner and developer-friendly. In short, make it more effective and helpful for teams trying to improve the resilience of their cloud-native applications. There have been 5 releases in the interim, during which the project team has gained valuable feedback from the early adopters/beta test group. Some of these have already made it into the platform, while others will be included in the upcoming releases. In this post, let’s take a look at some of the important features delivered as of 3.0 Beta5 & take a quick peek at those coming over the next few releases, before the 3.0 goes GA at KubeCon Chicago, 2023.

What’s Delivered

Chaos Agent (Delegate) Helm Charts

A popular ask in the Litmus community, the agent helm chart simplifies the process of setting up the execution plane and reduces the turnaround time in getting started with experiments. Those automating their testbed setup or building self-service platforms that bootstrap users/developers with their own functional chaos instances can leverage this chart.

Experiment Scalability With Just-In-Time Per-Node Helpers

One of the differentiating features of Litmus is its ability to launch just-in-time and limit the execution of its privileged pods, also commonly called “helpers” in the litmus lingua franca (which are used to perform operations such as manipulation of cgroups, injection of network rules etc.,) only for the duration of chaos as opposed to running them continuously as part of the agent. However, in the past, one such helper pod was launched for each target pod, causing increased resource utilization, especially when injecting chaos on 10s of targets simultaneously. This has been optimized with a per-node helper model which can carry out fault injection on all pods on the given node, while still retaining its transient nature.

Improved Debuggability With Error Outputs & Probe Verdicts

Amongst the feedback received from the community around experiment execution, one was that the interpretation of the overall experiment results are often very subjective and don’t typically fall under the black & white categorization of pass/fail, which is what the verdict attribute in the ChaosResult status resource threw at them. Instead, the verdict would be more applicable to the probe, whose result can be binary. The hypothesis was either proved, or disproved. A follow-on requirement then, was that (each) probe status should carry more information about the failure to understand its verdict, and not have to dig through logs. Oftentimes, users would want to parse this out from the result object and display it elsewhere. Here again, we’d want to relate to the errors in an easy way (think ERROR_CODES).

All this resulted in enriching the ChaosResult resource, especially the probe schema, adding in more debuggability and better support for interpretation/integrations.

Sidecar Support for Experiment Pods

Another ask from advanced users of the LitmusChaos platform was the ability to define the inclusion of sidecars into the experiment pod resources. Mostly for the purpose of shipping logs to custom sinks. The ChaosEngine schema has been enhanced to take inputs from users to inject sidecars into the chaos-runner, experiment & helper-pods, with necessary guardrails to avoid race-conditions in container startup/termination.

Application Chaos Experiments For Spring Boot

While LitmusChaos stayed focused on infrastructure chaos (which continues to be the most-needed type of chaos) until now, there have been requests to provide ALFI (Application Level Fault Injection) support in the platform. We’ve taken the first steps in this direction by adding support for spring-boot chaos (latency, exceptions, cpu/memory stress & app kill) which is now available on the ChaosHub.

What’s Coming

While we have addressed some important items from our 3.0 goals over the past few releases, we will be adding other important capabilities in the next couple of quarters, including:

A completely refactored & more simplified UI with a resilience-score driven UX, which has provisions for logical grouping of test environments
An updated ChaosHub with easy-to-use instructions of leveraging workflows for experiments
An updated Chaos-CI-Lib along with Gitlab templates & GitHub Actions that work with Chaos Workflows (with references to the workflow dashboard on Chaos-Center) instead of the ChaosEngine CRs.
An improved SDK that would help integrations and custom experiment building exercises
Provisions to perform a “lean” setup of the execution plane without auxiliary components such as the event-tracker, exporter.

As always, this would not have been possible without the amazing contributions from the community, in the form issues, code, slack conversations, attendance on the sync-up calls etc.,

A huge shout-out to you all and the maintainers look forward to working with you to help improve your chaos engineering experience!