As a critical component of many production systems, including Kubernetes, the etcd project’s first priority is reliability. Ensuring consistency and data safety requires our project contributors to continuously improve testing methodologies. In this article, we describe how we use advanced simulation testing to uncover subtle bugs, validate the robustness of our releases, and increase our confidence in etcd’s stability. We’ll share our key findings and how they have improved etcd.

Enhancing etcd’s Robustness Testing

Many critical software systems depend on etcd to be correct and consistent, most notably as the primary datastore for Kubernetes. After some issues with the v3.5 release, the etcd maintainers developed a new robustness testing framework to better test for correctness under various failure scenarios. To further enhance our testing capabilities, we integrated a deterministic simulation testing platform from Antithesis into our workflow.

The platform works by running the entire etcd cluster inside a deterministic hypervisor. This specialized environment gives the testing software complete control over every source of non-determinism, such as network behavior, thread scheduling, and system clocks. This means any bug it discovers can be perfectly and reliably reproduced.

Within this simulated environment, the testing methodology shifts away from traditional, scenario-based tests. Instead of writing tests imperatively with strict assertions for one specific outcome, this approach uses declarative, property-based assertions about system behavior. These properties are high-level invariants about the system that must always hold true. For example, “data consistency is never violated” or “a watch event is never dropped.”

The platform then treats these properties not as passive checks, but as targets to break. It combines automated exploration with targeted fault injection, actively searching for the precise sequence of events and failures that will cause a property to be violated. This active search for violations is what allows the platform to uncover subtle bugs that result from complex combinations of factors. Antithesis refers to this approach as Autonomous Testing.

This builds upon etcd’s existing robustness tests, which also use a property-based approach. However, without a deterministic environment or automated exploration, the original framework resembled throwing darts while blindfolded and hoping to hit the bullseye. A bug might be found, but the process relies heavily on random chance and is difficult to reproduce. Antithesis’s deterministic simulation and active exploration remove the blindfold, enabling a systematic and reproducible search for bugs.

How We Tested

Our goals for this testing effort were to:

  1. Validate the robustness of etcd v3.6.
  2. Improve etcd’s software quality by finding and fixing bugs.
  3. Enhance our existing testing framework with autonomous testing.

We ran our existing robustness tests on the Antithesis simulation platform, testing a 3-node and a 1-node etcd cluster against a variety of faults, including:

We tested older versions of etcd with known bugs to validate the testing methodology, as well as our stable releases (3.4, 3.5, 3.6) and the main development branch. In total, we ran 830 wall-clock hours of testing, which simulated 4.5 years of usage.

What We Found

The results were impressive. The simulation testing not only found all the known bugs we tested for but also uncovered several new issues in our main development branch.

Here are some of the key findings:

Issues in the Main Development Branch

DescriptionReport LinkStatusImpactDetails
Watch on future revision might receive old eventsTriage ReportFixed in 3.6.2 (#20281)MediumNew bug discovered by Antithesis
Watch on future revision might receive old notificationsTriage ReportFixed in 3.6.2 (#20221)MediumNew bug discovered by both Antithesis and robustness tests
Panic when two snapshots are received in a short periodTriage ReportOpenLowPreviously discovered by robustness
Panic from db page expected to be 5Triage ReportOpenLowNew bug discovered by Antithesis
Operation time based on watch response is incorrectTriage ReportFixed test on main branch (#19998)LowBug in robustness tests discovered by Antithesis

Known Issues

Antithesis also successfully found and reproduced these known issues in older releases – the “Brown M&Ms” set by the etcd maintainers.

DescriptionReport Link
Watch dropping an event when compacting on deleteTriage Report
Revision decreasing caused by crash during compactionTriage Report
Watch progress notification not synced with streamTriage Report
Inconsistent revision caused by crash during defragTriage Report
Watchable runlock bugTriage Report

Conclusion

The integration of this advanced simulation testing into our development workflow has been a success. It has allowed us to find and fix critical bugs, improve our existing testing framework, and increase our confidence in the reliability of etcd. We will continue to leverage this technology to ensure that etcd remains a stable and trusted distributed key-value store for the community.