Linkerd Is Apester’s ‘Safety Net’ Against Cascading Failure from Forgotten Timeouts
An interactive content platform with microservice architecture, Apester experienced a few outages when developers forgot to put timeouts on service-to-service requests, causing cascading failures for the entire platform. “It’s critical for us to enable timeouts at the service level as config instead of code,” says Apester SRE Or Elimelech.
Elimelech implemented Linkerd service profiles for setting low timeouts. “We’re able to enforce timeouts via Linkerd, even if you forgot about it,” says Elimelech.
Elimelech reports that MTTR has been shortened by factor of 2: “Developers are able to detect the bad service in the mesh themselves,” says Elimelech. “It’s easier to spot bad deployments that have a spike in latency, something like memory leaks that are small enough that we wouldn’t see without Linkerd.”
When Site Reliability Engineer Or Elimelech arrived at Apester two and a half years ago, he rebuilt the infrastructure to enable DevOps. The company moved to GKE and adopted cloud native technologies including Kubernetes and Prometheus. “I think most organizations that adopt Prometheus and whitebox monitoring need to actively push developers to understand Prometheus and learn the Prometheus client,” he says. “We use a few programming languages, so I needed the visibility and common metric system that I can rely on for my production use.”
In search of a service mesh, Elimelech first experimented with Conduit while it was in alpha (“I tried the new edge software weekly instead of the stable one in my production environment!”), and when it was merged into Linkerd under the CNCF umbrella, he adopted it for Apester. “I just needed metrics at first, and now I’m enjoying all of the other parts,” he says.
Those other parts include the solution for a major pain point the company had been experiencing. With its microservice architecture, Apester experienced a few outages when developers forgot to set timeouts on service-to-service requests. “The timeout defaults to 60 seconds, and if one service goes down, it causes cascading failure for the entire infrastructure when all the services that depend on this service get stuck because of the timeouts,” says Elimelech.
And for Apester and its customers—not to mention everyone unable to find out their score on a quiz—that’s a disaster. “We get all the traffic that the publisher gets because we are embedded on their site,” he says, and the numbers are enormous: more than 20 billion requests per month. “It’s critical for us to enable timeouts at the service level as config instead of code.”
At his previous company, Elimelech had written a library similar to those used by the Finagle RPC framework, on which Linkerd 1.x is built. “It was for all developers to just import and use, so it’s transparent,” he says. “It gives you service discovery, timeouts, and retries instead of breaking all the things. When you get to a certain scale, you have to adopt something like this, because people forget to add this stuff over and over again. It’s tedious. We believe in automation.”
“Linkerd is shifting the developer mind to focus on the business logic and not on some technicality. It just makes it easier and faster to deploy and develop,
and you have a safety net.”
— Or Elimelech, Site Reliability Engineer at Apester
So to solve Apester’s problem, Elimelech implemented Linkerd service profiles for setting low timeouts. “Now a new developer just adds the API call and doesn’t need to add logic for retries and timeouts,” he says. “We’re able to enforce timeouts via Linkerd, even if you forgot about it. Instead of deploying a new version, you just change the profile and apply it, and you get it immediately without deployment, without code change.”
Linkerd has led to time savings on multiple fronts.
The service mesh “helps you separate the networking logic, so there are shorter iterations and developers can focus on the feature itself,” Elimelech says. “It’s shifting the developer mind to focus on the business logic and not on some technicality. It just makes it easier and faster to deploy and develop, and you have a safety net.”
Because Linkerd is language agnostic, “you don’t need to provide the same library for the Java stack and Node and Go and Python,” he adds. “You just have one sidecar container that takes over all the networking and does it in a transparent way without writing over and over again the same libraries in many languages.”
Plus, MTTR is shorter by a factor of 2. “Developers are able to detect a bad service in the mesh themselves,” Elimelech says. “It’s easier to spot bad deployments that have a spike in latency, something like memory leaks that are small enough that we wouldn’t see without Linkerd.”
“We’re able to enforce timeouts via Linkerd. Even if you forgot about it, you get it immediately without deployment, without code change.”
— Or Elimelech, Site Reliability Engineer at Apester
Next up, Elimelech is starting to implement canary deployments using Linkerd and the Kubernetes operator Flagger. “They will be very helpful in deploying new software and testing it with production traffic without actually switching the new deployment on for everyone,” he says.
Asked if he had any best practices to share, Elimelech said simply, “It’s such an easy-to-use product. Move to Linkerd and you’ll be very happy.”
Indeed, Apester has been. With all those billions of requests per month going through Linkerd, he says, “we’ve gone six months without a timeout—and counting.”