How Linkerd tamed gRPC for Entain Australia

By Steve Gray, Head of Feeds, and Steve Reardon, DevOps Engineer at Entain  

Entain is a leading global sports betting and gaming operator. For us, speed is everything: latency literally costs us money. When LeBron scores, the data needs to be processed by our pricing management systems within milliseconds.

To tackle this immense challenge, Entain Australia built our platform on modern, cloud-native technologies like Kubernetes, gRPC, containers, and microservices. These tools allowed us to build a high-performance, reliable, and scalable system—but it wasn’t perfect. In our initial iteration, the interaction between gRPC and Kubernetes around load balancing caused some parts of the system to run “hot” while other parts ran “cold”. This disrupted our customers’ experience, creating financial risk, and resulting in too many sleepless nights for the team.

To address these issues, we adopted Linkerd. Within several months, we had reduced max server load by over 50% and increased our request volume by 10x, all while getting many more nights of uninterrupted sleep. In this post, we’ll share what inspired us to get started with a service mesh, why we selected Linkerd, and our experience running it in production.

Backdrop

Entain may not be a household name, but we operate some of the industry’s most iconic brands including Ladbrokes, Coral, BetMGM, bwin, Sportingbet, Eurobet, partypoker, partycasino, Gala, and Foxy Bingo. We are a FTSE 100 company, have licenses in more than 20 countries, and employ a workforce of more than 24,000 across five continents.

The Feeds Team at Entain Australia is responsible for one part of this mission: to handle the immense amount of data that comes into the business, handle the price management, and make these results available to the rest of the platform as quickly as possible.

Hosted on Kubernetes, our proprietary online trading platform consists of approximately 300 microservices with over 3,000 pods per cluster in multiple geographic regions. To say it’s critical to the business would be an understatement. Our platform handles thousands of requests per second as users bet on live events. Any outage impacts revenue, user experience, and the accuracy of the prices being offered. Even a latency hit can have huge consequences. Because Entain relies on sports prices (based on probability of an outcome) to generate revenue, the user experience must be as real-time as possible. Imagine placing a bet on an event that you already know the result of!

Striving for high standards for performance, scale, and reliability didn’t come easy

Despite running a 24×7 environment, our engineers use AWS spot instances to keep costs low. We also use chaos engineering tools and practices to ensure we build a resilient platform and applications. Together this created a constantly changing environment. Nodes and pods come and go and we needed a way to ensure reliable application performance despite all these changes under the hood.

Although our developers were quick to seize on the efficiency and reliability gains provided by microservices and gRPC, the default load balancing in Kubernetes didn’t give us the best performance out-of-the-box and left us in a position where all requests from one pod in service would end up going to a single pod on another service. This “worked”,  but had two negative effects on the platform: 1) servers had to be exceptionally large to accommodate large traffic volumes, and 2) we couldn’t make use of horizontal scaling to process a larger number of requests. Because of this, we were unable to take advantage of many of the available spot instances or process a high volume of requests in a timely manner.

A lack of intelligent routing was another issue. Although we operate in one region—Australia—to hit our ambitious availability targets, Entain spans individual clusters across multiple AWS availability zones (AZ). This ensures no one availability zone is a single point of failure for the platform. For more casual Kubernetes users this isn’t a problem, but at Entain’s scale and request volume, cross AZ traffic began to represent a tangible source of both latency and cost. Transactions that needlessly crossed an AZ boundary would slow down performance of the platform and incur additional charges from AWS. 

The technology and business gains came quickly and easily

Once Linkerd had finished rolling out to all our pods, the Linkerd proxy took over routing requests between instances. This allowed our platform to immediately route traffic away from pods that were failing or being spun down.

We also saw immediate improvements in load balancing. Linkerd’s gRPC-aware load balancing immediately fixed the issues we had seen with gRPC load balancing on Kubernetes and started balancing requests properly across all destination pods. 

This allowed us to realize two major business gains. We increased the volume of requests the platform could handle by over tenfold and use horizontal scaling to add more smaller pods to a service. The latter gave us access to a broader range of AWS spot instances so that we could further drive down our compute costs—while delivering better performance to their users.

In that same vein, we realized an unexpected side benefit. Kubernetes’ load balancing naively selects endpoints in a round-robin fashion, essentially rotating through a list to distribute load between them. This arbitrary process means that a request could go to any node on a cluster regardless of latency, saturation, or proximity to the calling service.

With Linkerd, each proxy looks at its potential endpoints and selects the “best” target for its traffic based on an exponentially weighted moving average, or EWMA, of latency. When we introduced Linkerd to our clusters we began to see faster response times and lower cross AZ traffic costs. It turns out that Linkerd’s built-in EWMA routing algorithm would automatically keep more traffic inside an AZ and, in so doing, cut bandwidth costs to the tune of thousands of dollars a day. All without our platform team needing to configure anything!

The difference was night and day. Our applications began spreading the load evenly between application instances and response times went down. We could see it right away in our monitoring. The bandwidth decreased and the CPU just leveled out everywhere—an improvement across the board. We were so excited, we were literally jumping around. With Linkerd, we went from a fraction of our servers going flat out at their line speed (20gbit/s) to an even balance, with no server hitting above 9gbit/s sustained. Linkerd really made a difference.

Our journey to a service mesh approach

To help address these issues we started looking into a service mesh approach.

Like many organizations, we considered Istio. But our research led to the conclusion that we would need a team of developers just to run it. It was too complicated, requiring ongoing, active attention—it’s not fire and forget. While Istio’s numerous features are nice to have, when you’re working with an application day-to-day, you don’t have time to tweak and tune it. It’s almost too customizable, and for us, that meant too much of an overhead. 

We looked at other solutions and ended up with a shortlist of half a dozen different options, but the one that stood out was Linkerd.

Why Linkerd? Turnkey, easy, and it just works

Linkerd was our choice for several reasons.

First, we needed a mesh that was Kubernetes-native and would work with our current architecture without the need to introduce large numbers of new custom resource definitions or force a restructuring of our applications or environments.

Second, as an engineering team, we didn’t have the time or bandwidth to learn and run a complicated system. Linkerd was ideal because it is easy to get up and running and requires little overhead. It took us five to six hours to install and configure, and migrate 300 services to the mesh. It’s just the go-get-command and then the install process, and job done! It’s that simple. We installed it and have rarely had to touch it since—it just works.

We also found the Linkerd Slack community and Linkerd’s docs to be extremely helpful as we moved into production—a process that happened overnight.

Linkerd day-to-day—runs like a utility service

Within a week of trying Linkerd, we were able to take it into large-scale and highly performant Kubernetes clusters. We now run 3,000 pods in a single namespace in Kubernetes—a massive deployment that usually requires a team to manage. Yet, only one person manages our entire infrastructure. 

We’ve adopted Linkerd without needing to become experts in service meshes or any particular proxy. In fact, Linkerd just sits in the background and does its job. It’s like with the electricity or water supply. When it’s working well, we really don’t think about it and just enjoy the benefits of having electricity.

As well as being easy, Linkerd has met all our needs. It has solved our gRPC load balancing problems and augmented standard Kubernetes constructs in a way that allowed us to move forward without reconfiguring our applications. It solved our problems including some of those we didn’t realize we had.

Now we never stop taking bets (and we can sleep again)

Everything that our team does revolves around availability, reliability, cost reduction or control, and scalability. These are the things that matter and by which everything is measured. Every time Entain stops taking bets because of a service or application failure, there is a direct financial impact on the business. We have to make sure that we’re available, reliable, and rock-solid and Linkerd is a part of our strategy for achieving that.

Because of Linkerd, we’ve been able to easily increase the volume of requests by over 10x, reduce operating costs, and better hit our availability targets—while continuing to iterate on our business differentiators.

And from a personal standpoint, Linkerd has helped us sleep again. No more being woken up in the middle of the night is perhaps the most valuable feature of the service mesh.