Project post originally published on the Linkerd Blog by Flynn

This blog post is based on a workshop that I recently delivered at Buoyant’s Service Mesh Academy. If this seems interesting, check out the full recording!

Linkerd 2.13 adds two long-requested features to Linkerd: dynamic request routing and circuit breaking.

While Linkerd 2.12 has been able to do some dynamic request routing, Linkerd 2.13 expands quite a bit on the feature. Circuit breaking is completely new in Linkerd 2.13.

Dynamic Request Routing

In Linkerd 2.11 and earlier, the only mechanism for any sort of dynamic routing used the TrafficSplit extension and the linkerd-smi extension to support a coarse-grained routing behavior based on the service name and the desired percentage of traffic to be split. For example:

Linkerd 2.12 introduced support for basic header-based routing, using the HTTPRoute CRD from the Gateway API. This allowed for routing based on the value of a header, but it didn’t support weighted routing at the same time.

Dynamic request routing in Linkerd 2.13 brings these two worlds together using the HTTPRoute CRD, and expands it further by supporting weighted routing based on request headers, verbs, or other attributes of the request (though not the request body). This is much more powerful than what was possible with 2.12 and earlier. For example:

The Gateway API is a Kubernetes SIG-Networking project started in 2020, primarily to address the challenges related to the proliferation of annotations in use on the Ingress resource. In 2022, the Gateway API project began the GAMMA (Gateway API for Mesh Management and Administration) initiative to explore how to use the Gateway API for mesh networking. Linkerd is an active participant in both efforts: the power and flexibility of the Gateway API makes it easier to expand Linkerd’s capabilities while maintaining its overall best-in-class operational simplicity.

One important caveat, though, is that since the Gateway API was originally designed to manage ingress traffic – traffic from outside the cluster coming in – its conformance tests are not yet well-suited to service meshes, so Linkerd can’t yet be fully conformant with the Gateway API. For this reason, Linkerd uses the HTTPRoute resource in the policy.linkerd.io APIGroup, rather than the official Gateway API APIgroup. There’s work actively underway to improve this situation.

Dynamic Request Routing Examples

First, a simple canary example. This example does a 5050 split for requests to the color Service, routing half to the endpoints being the actual color Service and half to those behind the color2 Service.

apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
 name: color-canary
 namespace: faces
spec:
 parentRefs:
   - name: color
     kind: Service
     group: core
     port: 80           # Match port numbers with what’s in the Service resource
 rules:
 - backendRefs:
   - name: color
     port: 80
     weight: 50         # Adjust the weights to control balancing
 - backendRefs:
   - name: color2
     port: 80
     weight: 50

I’m being careful here about the distinction between a Service and the endpoints behind the Service, because the HTTPRoute acts on requests sent to a particular service, routing them to endpoints behind a service. This is why having color in the parentRefs stanza and also in one of the backendRefs stanzas works, without creating a loop.

Here’s an example of A/B testing. Here, requests sent to the smiley Service with the header

X-Faces-User: testuser

get routed to endpoints behind the smiley2 Service, while other requests continue on to endpoints behind the smiley Service.

apiVersion: policy.linkerd.io/v1beta2
kind: HTTPRoute
metadata:
 name: smiley-a-b
 namespace: faces
spec:
 parentRefs:
   - name: smiley
     kind: Service
     group: core
     port: 80
 rules:
 - matches:
   - headers:
     - name: "x-faces-user"   # X-Faces-User: testuser goes to smiley2
       value: "testuser"
   backendRefs:
     - name: smiley2
       port: 80
 - backendRefs:
   - name: smiley
     port: 80

One critical point about the A/B test: Linkerd can do dynamic request routing anywhere, but of course if you want to route on a header, you need to make sure that header is present at the place you want to use it for routing! This may mean that you need to be careful to propagate headers through the various workloads of your application.

You can find more details about dynamic request routing in its documentation, at https://linkerd.io/2/tasks/configuring-dynamic-request-routing/.

Circuit Breaking

Circuit breaking is new to Linkerd 2.13, but it’s been long requested by users. It’s a mechanism to try to avoid overwhelming a failing workload endpoint with additional traffic:

In Linkerd 2.13, circuit breaking is a little limited:

Circuit breakers in Linkerd are expected to gain functionality rapidly, so keep an eye out as new releases happen (and the annotation approach should be supplanted with Gateway API CRDs).

Circuit Breaking Example

To break the circuit after four consecutive request failures, apply these annotations to a Service:

balancer.linkerd.io/failure-accrual: consecutive
balancer.linkerd.io/failure-accrual-consecutive-max-failures: 4

The failure-accrual: consecutive annotation switches on circuit breaking, and sets it to the “consecutive failure” mode (which is the only supported mode in 2.13).

All configuration for the “consecutive failure” mode of circuit breaking uses annotations that start with failure-accrual-consecutive-; the failure-accrual-consecutive-max-failures annotation sets the number of consecutive failures after which the circuit breaker will open.

Try reenabling traffic after 30 seconds:

balancer.linkerd.io/failure-accrual-consecutive-min-penalty: 30s

(This is for the first attempt. After that, the delay grows exponentially.)

Don’t ever wait more than 120 seconds between retries:

balancer.linkerd.io/failure-accrual-consecutive-max-penalty: 120s

More information on circuit breaking is available in its documentation, at https://linkerd.io/2/tasks/circuit-breakers/.

Gotchas

The biggest gotcha of them all is that in Linkerd 2.13, ServiceProfiles do not compose with dynamic request routing and circuit breaking.

Getting specific, this means that when a ServiceProfile defines routes, it takes precedence over other HTTPRoutes with conflicting routes, and it also takes precedence over circuit breakers associated with the workloads referenced in the ServiceProfile. This is expected to be the case for the foreseeable future, to minimize surprises when upgrading from a version of Linkerd without the new features.

The challenge here, of course, is that there are still several things that require ServiceProfiles in Linkerd 2.13 (for example, retries and timeouts). The Linkerd team is actively working to quickly make all of this better, with a particular short-term focus on rapidly bringing HTTPRoutes to feature parity with ServiceProfiles.

Debugging Dynamic Request Routing and Circuit Breaking

The most typical failure you’ll see when trying to use these new features is to enable a new feature and see that it doesn’t seem to be active. There are some simple rules of thumb for debugging:

Dynamic Request Routing and Circuit Breaking

Taken together, dynamic request routing and circuit breaking are two important new additions to Linkerd 2.13. While still a bit limited in 2.13, keep an eye out: we have big plans for these features as Linkerd’s development continues.

If you want more on this topic, check out the Circuit Breaking and Dynamic Request Routing Deep Dive Service Mesh Academy workshop for hands-on exploration of everything I’ve talked about here! And, as always, feedback is always welcome – you can find me as @flynn on the Linkerd Slack.