Testing asynchronous workflows using OpenTelemetry and Istio

Posted on October 9, 2025 by Arjun Iyer, SignaDot

CNCF projects highlighted in this post

Learn how to test complex asynchronous workflows in cloud native applications using OpenTelemetry for context propagation and Istio for traffic routing. Explore cost-effective approaches to isolate test environments without duplicating infrastructure.

Introduction

Asynchronous architectures have become a cornerstone of modern cloud native applications, enabling services to operate independently while maintaining system resilience and scalability. These architectures typically rely on message queues and event-driven communication patterns to decouple services, allowing them to handle varying loads and failures gracefully.

Popular message systems in the cloud native ecosystem include Apache Kafka, RabbitMQ, Redis Streams, Google Cloud Pub/Sub, AWS SQS, and Azure Service Bus. Each offers unique capabilities for different use cases, from high-throughput streaming to reliable message delivery. Regardless of which system you choose, testing end-to-end workflows that span multiple services and asynchronous boundaries presents unique challenges that traditional testing approaches struggle to address effectively.

This article explores how two CNCF projects—OpenTelemetry for distributed tracing and context propagation, and Istio for traffic management—can work together to create cost-effective, scalable testing environments for asynchronous workflows without the overhead of duplicating entire infrastructure stacks.

Challenges in testing asynchronous systems

Testing asynchronous systems introduces several complex challenges not found in synchronous, request-response architectures:

Environment setup complexity: Asynchronous systems require multiple coordinated components—brokers, producers, consumers, and often additional infrastructure like schema registries or monitoring tools. Setting up these components correctly with proper security, replication, and partitioning requires significant expertise and time.
Test isolation: Unlike synchronous calls where requests can be easily isolated, asynchronous messages in shared systems can interfere with each other. Ensuring that test data from one scenario doesn’t impact another requires careful coordination.
Resource costs: Traditional approaches often require duplicating entire message infrastructure for each test environment, leading to exponential cost growth as teams scale their testing practices.
Timing and ordering: Asynchronous systems introduce timing dependencies and message ordering considerations that make tests more complex to design and more prone to flaky behavior.
Environment drift: Maintaining multiple isolated environments often leads to configuration drift, where test environments diverge from production, reducing test confidence.

In this context, we define a test tenant as an isolated testing context that needs to run test scenarios without interfering with other concurrent tests or production traffic.

Three approaches to test environment isolation

There are three primary approaches to achieving test isolation in asynchronous systems, each with different trade-offs in terms of cost, complexity, and isolation guarantees:

1. Infrastructure-level isolation

The most straightforward approach is to create completely separate infrastructure for each test tenant. This means deploying independent instances of message brokers, databases, and all supporting services for every test scenario.

Advantages: Complete isolation between test environments with no risk of cross-tenant interference. Tests can modify any infrastructure configuration without affecting others.

Considerations: This approach becomes prohibitively expensive as the number of concurrent tests grows. Managing numerous independent environments creates significant operational overhead. Without automation, environments quickly become stale and drift from production configurations, reducing test reliability.

2. Resource-level isolation

This approach shares core infrastructure (like message brokers) between tenants but creates isolated resources within that infrastructure. For example, using dedicated topics in Kafka, separate queues in RabbitMQ, or isolated namespaces in Redis for each test tenant.

Advantages: Significant cost savings compared to full duplication while maintaining good isolation. Shared infrastructure components reduce operational complexity.

Considerations: Still requires duplicating and reconfiguring all producer and consumer services for each tenant. Complex automation is needed to create and tear down resources dynamically. Configuration management becomes challenging as the number of tenants grows.

3. Request-level isolation

The most cost-effective approach establishes a shared baseline environment with all infrastructure and services running in production-like configurations. Test isolation is achieved through dynamic routing of requests and messages based on tenant context, rather than physical resource separation.

In this model, each test tenant has a unique identifier that gets propagated through all synchronous and asynchronous communication. Services use this identifier to determine whether they should process a given request or message. When testing specific service versions, only those services under test are deployed as separate instances—everything else uses the shared baseline.

Advantages: Minimal infrastructure duplication leads to dramatic cost savings (often 85%+ reduction). No environment drift since the baseline is continuously updated through existing CI/CD pipelines. Fastest environment creation (seconds vs. minutes/hours). Lowest operational overhead.

Considerations: Requires instrumenting services for context propagation and implementing selective message processing. May not be suitable for scenarios requiring infrastructure-level isolation or testing infrastructure changes themselves.

The remainder of this article focuses on implementing this request-level isolation approach using OpenTelemetry and Istio.

Implementing request-level isolation

Implementing request-level isolation requires two key capabilities: propagating tenant context across service boundaries, and routing traffic based on that context. OpenTelemetry and Istio provide complementary solutions to these challenges.

The architecture works by assigning each test tenant a unique identifier with mappings to specific service versions being tested. This tenant ID gets propagated through both synchronous (HTTP/gRPC) and asynchronous (message queue) communication paths, enabling dynamic routing decisions at every hop.

Central RouteService for tenant mapping

A critical component in this architecture is a central RouteService that maintains the authoritative mapping of tenant IDs to specific service versions under test. This service acts as the source of truth that all consumers consult to determine whether they should process a given message.

When multiple versions of a consumer service are running (baseline and under-test versions), each instance receives all messages from the shared message queue. However, only one consumer should process each message based on the tenant context. The RouteService solves this consumer contention problem by providing a centralized decision point:

Centralized configuration: All tenant-to-service mappings are stored in one place, making it easy to create, modify, and delete test scenarios without coordinating across multiple services.
Consumer coordination: Multiple consumer instances can safely coexist, with each consulting the RouteService to determine if they should process a message based on the tenant ID and their own service version.

Dynamic updates: Test scenarios can be created and destroyed without restarting consumers—they simply get updated routing information from the RouteService.

Using OpenTelemetry for context propagation

OpenTelemetry provides the foundation for propagating tenant context across service boundaries through its powerful context propagation capabilities. The key mechanisms are:

Baggage for custom context: OpenTelemetry’s Baggage feature allows you to attach custom key-value pairs (like tenant IDs) to traces that automatically propagate across service calls. This works for both HTTP/gRPC requests and can be extended to asynchronous messaging systems.

Automatic header propagation: OpenTelemetry’s auto-instrumentation libraries for languages like Java, Node.js, Python, and Go automatically handle header propagation for popular frameworks, reducing the implementation burden on development teams.

Message queue integration: OpenTelemetry provides specific guidance and libraries for propagating context through message queues. For example, when publishing messages to Kafka, producers automatically inject trace context (including baggage) into message headers. Consumers can then extract this context to make routing decisions.

On the consumer side, services can extract the tenant ID from the propagated context and decide whether to process the message based on their tenant mappings:

Leveraging Istio for traffic routing

While OpenTelemetry handles context propagation, Istio provides sophisticated traffic routing capabilities that complement this by enabling dynamic routing of HTTP/gRPC requests based on headers, including the tenant context propagated by OpenTelemetry.Virtual services for header-based routing: Istio Virtual Services can route requests to different service versions based on header values. This allows you to direct traffic from specific tenants to the service versions under test while sending all other traffic to the baseline services.

Destination rules for service subsets: Istio Destination Rules define the different versions of services (baseline vs. under-test) and their load balancing policies:

Key benefits of Istio integration: Transparent routing without application code changes, sophisticated traffic shaping capabilities (weighted routing, fault injection, timeouts), and comprehensive observability of routing decisions through built-in metrics and tracing integration.

Implementation considerations

While this approach offers significant benefits, there are several important considerations for successful implementation:

Non-request-scoped workflows: Some asynchronous workflows aren’t triggered by external requests—batch jobs, scheduled tasks, or event-driven processes that start from database changes or time-based triggers. These scenarios require alternative approaches, such as embedding tenant context in data sources or using separate scheduling mechanisms for test scenarios.

Distributed cache management: When services cache tenant mappings for performance, cache invalidation becomes critical. Implementing proper cache coherence mechanisms (like Redis pub/sub notifications or short TTL values) ensures that services quickly recognize new tenants or mapping changes.

Consumer group lifecycle: Message queue consumer groups require careful lifecycle management. Create unique consumer group names for test services, ensure proper offset handling (often starting from the baseline consumer’s current offset), and clean up consumer groups when tests complete to avoid resource leaks.

Data isolation strategies: Different levels of data isolation may be needed depending on your use case. Options include tenant-prefixed keys in databases, separate database schemas, or carefully designed test data that doesn’t conflict with baseline data.

Observability and debugging: Implement comprehensive logging and metrics around tenant routing decisions. OpenTelemetry’s distributed tracing helps track how tenant context flows through the system and identify where routing decisions are made.

Security considerations: Ensure tenant IDs can’t be spoofed or manipulated by unauthorized actors. Consider using signed tokens or validating tenant context against authorized test scenarios to prevent accidental or malicious cross-tenant access.

Conclusion

Testing asynchronous workflows in cloud native applications presents unique challenges that traditional testing approaches struggle to address cost-effectively. The combination of OpenTelemetry for context propagation and Istio for traffic routing provides a powerful foundation for implementing request-level isolation that dramatically reduces infrastructure costs while maintaining high-quality testing practices.

The vendor-neutral nature of both OpenTelemetry and Istio makes this solution broadly applicable across different cloud providers, message systems, and application architectures. Success with this approach requires careful planning for edge cases, such as batch workflows, and proper lifecycle management of shared resources, but the benefits—reduced costs, faster feedback loops, and elimination of environment drift—make it a compelling choice for teams serious about scaling their testing practices in cloud native environments.