LitmusChaos 3.0: making chaos engineering robust, lean, and developer-centric

Posted on November 7, 2023 by Saranya Jena

CNCF projects highlighted in this post

Project post by Saranya Jena, Litmuschaos Maintainer

The idea behind building LitmusChaos 3.0 as a community was to build a more robust, leaner, and developer centric Chaos Engineering platform which caters to how resiliency is perceived in today’s ecosystem.

During KubeCon + CloudNativeCon North America (Detroit) 2022, the maintainers announced a journey that took around a year to culminate into making chaos engineering more inclusive and effective. The community promised to bring 3.0 GA Live by KubeCon Chicago 2023 and here we are in November live with 3.0 already which is being embraced warmly by all LitmusChaos community members.

After announcing LitmusChaos 3.0 in early October this year, this blog covers LitmusChaos 3.0 in detail and also sheds light on what’s coming next for the community.

The release of LitmusChaos 3.0 is a milestone cherished by the chaos engineering community, signifying a remarkable advancement for the project since becoming a CNCF project. This release is packed with exciting new features and enhancements that will empower users to take their chaos experiments to the next level.

Let’s dive into what you can expect from LitmusChaos 3.0:

Getting Started

Litmus 3.0 introduces an abundance of new features and improvements, which, in turn, results in a lack of support for backward compatibility with older versions. To initiate your journey with Litmus 3.0, you have two installation options. The first method is through Helm, and you can find the comprehensive installation guide in Litmus Docs. Alternatively, you can opt for a direct installation on Kubernetes, following the steps outlined in the readme.

Once you’ve successfully installed Chaoscenter, the next crucial step is configuring your Chaos Infrastructure to commence Chaos Experimentation. You have the flexibility to set up Chaos Infrastructure either in Cluster-wide mode or with a Namespaced scope. Detailed instructions for both options can be found in the Litmus Documentation.

Revamped User Experience

In LitmusChaos 3.0, we’ve gone the extra mile by completely revamping the user interface (UX). Our goal was to provide users with a sleek and intuitive experience, and to achieve this, we made a significant move. We’ve transitioned from the conventional Material UI to the cutting-edge [Harness UIcore library](https://uicore.harness.io/) library. The adoption of Harness UIcore has brought an array of benefits, including an even smoother user journey. Moreover, it offers a treasure trove of useful components that users can plug and play, such as the Step Wizard and Pipeline Studio. These components have not only enhanced the visualization of chaos experiments but have also made the overall user experience significantly better.

Screenshot showing Litmus dashboard "overview" page

Environments for Chaos Infrastructure Organization

Managing Chaos Infrastructures, especially when running chaos experiments across different clusters and environments, can be a daunting task. LitmusChaos 3.0 introduces a groundbreaking feature – Environments. This feature empowers users to streamline and organize their chaos infrastructure more effectively. Whether it’s your production or non-production environment, Environments simplify the entire process, ensuring that you have a clear and structured view of your chaos experiments. Say goodbye to chaos infrastructure clutter and hello to organized chaos engineering.

Screenshot showing new environment page on Litmus, enter "qa-test-environment" on Environment Name, QA, testing for tags, and select Pre-Production for environment type

Screenshot showing environments page with 3 environments (prod-env1, qa-test-environment, dem-environment)

Screenshot showing [demenvironment] Chaos Infrastuctures page on Litmus

Chaos Studio for Simplified Experiment Tuning

LitmusChaos 3.0 proudly introduces Chaos Studio, a powerful tool that simplifies the process of fine-tuning your chaos experiments. Chaos Studio empowers users to fine-tune chaos parameters and configurations with ease. It’s a one-stop solution, offering both a visual experiment builder and a potent YAML editor based on the Monaco Editor. This versatility allows users to construct experiments precisely the way they want. One of the standout features of Chaos Studio is its ability to seamlessly add the resilience probe through the UI, along with fault configurations. Chaos Studio follows an easy three-step process to create, schedule, and run chaos experiments. It’s time to bid farewell to the complexities of setup procedures and embrace streamlined chaos engineering.

Screenshot showing Faults Library on Litmus

Screenshot showing pod-delete-16s page on Litmus

Screenshot showing select a probe on Litmus

Screenshot showing Chaos Experiments page on Litmus

Resilience Probes as First-Class Citizens

Recognizing the pivotal role of resilience probes in Chaos Engineering, LitmusChaos 3.0 has elevated them to “first-class citizens.” This enhancement allows resilience probes to support a plug-and-play architecture, enabling users to create them once and deploy them across multiple experiments. This comprehensive support caters to steady-state validation, enhancing system resilience. As of now, Litmus 3.0 supports four types of resilience probes: Http, CMD, K8s, and Prom Probe. Users can configure these probes multiple times, ensuring flexibility and robustness. What’s more, users can easily track which faults these probes are referenced in and how they are used. The addition of resilience probes has become a mandatory step in the experiment creation process, embracing a resilience-driven approach to chaos engineering.

Screenshot showing select your probe type on Litmus page

Screenshot showing Resilience Probes on Execution History page on Litmus

Support for high availability of MongoDB

A major highlight of LitmusChaos 3.0 is its out-of-the-box support for MongoDB Replicas. Users can now install MongoDB Replicas via Helm using Bitnami Mongo, seamlessly integrating chaos engineering into their MongoDB infrastructure. This support has unlocked the usage of transactions in the codebase, eliminating inconsistencies, and ensuring high availability for MongoDB. Chaos Engineering now extends its reach to MongoDB, bringing resilience to the forefront.

Terminology Changes

We believe in clarity and consistency. In this release, we’ve refined our terminology to reflect the functionality better:

Chaos Agents/Delegates are now referred to as Chaos Infrastructures.
Chaos Scenarios/Workflows is now known as Chaos Experiments.
Chaos Experiments have been rebranded as Chaos Faults.

API Refactors and Improved Code Architecture

Behind the curtain, we’ve made substantial investments to improve the developer experience. Extensive API refactoring, the addition of backend unit tests, and code architecture enhancements have made it easier for developers to contribute to the LitmusChaos ecosystem. We invite developers and chaos engineering enthusiasts to explore these changes and get involved in our ever-evolving community.

Helm Charts for Chaos Agents (Delegates)

A widely requested feature within the Litmus community, the Helm chart for Chaos Agents simplifies the setup of the execution plane, significantly reducing the time needed to get started with experiments. This is particularly beneficial for those who are automating their testbed configuration or developing self-service platforms, allowing users and developers to create their own functional chaos instances using this chart.

Scaling Experiments with Just-In-Time Per-Node Helpers

Litmus’s distinctive feature lies in its ability to launch just-in-time and limit the execution of privileged pods, often referred to as “helpers” in the Litmus community. These helpers perform actions such as manipulating cgroups and injecting network rules during chaos. In the past, a helper pod was launched for each target pod, which led to increased resource consumption, especially when conducting chaos experiments on multiple targets simultaneously. To optimize this, Litmus has introduced a per-node helper model that can carry out fault injections on all pods within a given node while still maintaining its transient nature.

Enhanced Debugging with Error Outputs and Probe Verdicts

Community feedback revealed that interpreting experiment results can be subjective and often doesn’t fit into the binary pass/fail categorization provided by the ChaosResult status resource. Users found that the verdict attribute was more applicable to the probe, which can yield binary results – either proving or disproving a hypothesis. Consequently, there was a need for each probe’s status to provide more information about failures for better understanding of the verdict, without the need to sift through logs. Users frequently sought to extract and display this information elsewhere. This led to the enrichment of the ChaosResult resource, particularly the probe schema, which now offers improved debugging capabilities and better support for interpretation and integrations.

Support for Sidecars in Experiment Pods

Advanced users of the LitmusChaos platform requested the ability to include sidecars in experiment pod resources, primarily for the purpose of forwarding logs to custom sinks. The ChaosEngine schema has been enhanced to allow users to inject sidecars into chaos-runner, experiment, and helper pods. This feature includes necessary safeguards to prevent race conditions during container startup and termination.

Application Chaos Experiments for Spring Boot

While LitmusChaos has traditionally focused on infrastructure chaos (which remains a crucial type of chaos), there have been growing requests for ALFI (Application Level Fault Injection) support within the platform. In response to this demand, Litmus has taken the initial steps in that direction by introducing support for Spring Boot chaos, encompassing latency, exceptions, CPU/memory stress, and application termination. These new features are now available on the ChaosHub, expanding the platform’s capabilities to address application-level chaos.

Enhancement in Litmus SDK

Litmus 3.0 brings forth an enhanced Software Development Kit (SDK) designed to streamline integrations and enable seamless custom experiment-building activities. This SDK empowers users to create, modify, and tailor their chaos experiments according to their specific needs, further expanding the versatility of the platform. Moreover, Litmus 3.0 offers extensive support for bootstrapping experiments on a wide range of platforms, including Azure, GCP, AWS, and VMware. This multi-platform compatibility ensures that users can carry out their chaos engineering efforts across diverse environments, making it even more accessible and adaptable to their unique infrastructure requirements.

What Lies on the Horizon

Our commitment to a monthly release cadence means that in the forthcoming updates, we’ll be bringing you a host of improvements, bug fixes, and exciting new features. Here’s a sneak peek into what you can expect:

Stopping Chaos Experiment Runs: With the 3.0 release, we received valuable feedback from our community, suggesting the addition of an option to stop ongoing experiment runs. This feature is particularly useful for experiments that run for extended durations, providing users with greater control and flexibility.
Toggling Cron Experiments: Another user-driven enhancement on our radar is the ability to toggle the enabling and disabling of cron-based experiments. This feature, stemming from community input, will be incorporated into the Stop Experiment functionality, making it more comprehensive and user-friendly.
Strengthening Codebase Stability: We recognize the importance of ensuring code stability, and to achieve this, we will be introducing a series of unit tests for both the backend and frontend components. Specifically, we will focus on enhancing unit tests for the authentication module and the frontend, bolstering the reliability of our codebase.
In addition to these highlighted features, you can expect a slew of other enhancements and bug fixes in our upcoming releases. We’re dedicated to providing you with a robust and continuously evolving Chaos Engineering platform. Stay tuned for more exciting updates!

We can’t wait to see the incredible experiments and innovations that this release will inspire. Thank you for being a part of the LitmusChaos community!

Are you excited about the upcoming features?

Join the Litmus community(#litmus channel on the Kubernetes Slack) and share your thoughts with us!

Release notes: https://github.com/litmuschaos/litmus/releases/tag/3.0.0

Hyderabad, India