Entain

How a Global Sports Betting and Gaming Company Realized 10x Throughput While Driving Down Costs – Powered by Linkerd

Challenge

Entain is one of the world’s leading sports betting and gaming companies. Hosted on Kubernetes, Entain’s Australian sports trading platform consists of approximately 300 microservices with more than 3,000 pods per cluster in multiple geographic regions. As users bet on live events, the platform handles thousands of requests per second. Outages impact user experience, revenue, and pricing accuracy. Even a small latency hit can have enormous consequences.

Solution

Faced with multiple performance challenges, such as issues with gRPC and load balancing as well as intelligent routing—all of which had the potential to disrupt the customer experience, created financial risk, and a few too many sleepless nights for the team, Entain Australia’s Trading Solutions team began to explore a service mesh approach. They considered various options but settled with Linkerd, the lightweight, ultra-fast CNCF service mesh. “Although our developers were quick to seize on the efficiency and reliability gains provided by microservices and gRPC, the default load balancing in Kubernetes didn’t give us the best performance out-of-the-box,” said Steve Gray, Entain Australia’s Head of Trading Solutions.

Impact

Quick to install and easy to understand, Linkerd was up and running in a matter of hours, and the benefits came quickly. Within several hours, the team had reduced maximum server load by more than 50% and increased request volume supported by tenfold while getting many more nights of uninterrupted sleep. “We have to make sure that we’re available, reliable, and rock-solid, and Linkerd is a part of our strategy for achieving that,” said Gray.

Challenges:

Latency

Industry:

Betting

Location:

Australia

Published:

July 15, 2021

Projects used

By the numbers

300

Microservices Hosted

Realized 10x throughput

While driving down costs

5-6 hours

To install, configure and migrate 300 services to the mesh

For Entain Australia, speed is everything. Latency literally costs money.

Entain, a leading global sports betting and gaming company, operates iconic brands, including Ladbrokes, Coral, BetMGM, bwin, Sportingbet, Eurobet, partypoker, partycasino, Gala, and Foxy Bingo. The company is listed on the FTSE 100, has licenses in more than 20 countries, and employs a global workforce of more than 24,000 people.

When Lewis Hamilton scores points for his F1 team or Messi takes the critical corner shot, the data must be processed by its pricing management systems within milliseconds.

To tackle this immense challenge at Entain Australia, the company’s Trading Solutions team built a platform on modern, cloud native technologies including Kubernetes, gRPC, and containers. These tools allowed the company to build a high-performance, reliable, and scalable system—but it wasn’t perfect.

At first, the interaction between Kubernetes and gRPC around load balancing led to some parts of the system running “hot” while other parts were running “cold”. This introduced latency, creating financial risk, but also created out-of-hours contact from the internal users of the system, reporting inconsistent performance.

Entain’s Australia’s Trading Solutions Team are responsible for:

Handling the immense amount of data coming into the business
Managing the company’s proprietary online trading platform (hosted on Kubernetes and consisting of 300 microservices and 3,000 pods in multiple geographic regions)
And making results of the processing available to the rest of the business as swiftly as possible.

The platform processes thousands of requests per second while users bet on live sports events. Any outage has consequences on pricing accuracy, revenue, and ultimately the user experience. Even a small latency hit can have enormous consequences. Entain relies on sports prices based on the probability of an outcome to generate revenue, so the data has to be as close to real-time as possible.

A critical challenge for the Trading Solutions Team was that their 24×7 environment is constantly changing. The team use a variety of practices to keep the data flowing but with an eye on efficiency and cost. This means using AWS spot instances to keep costs low, combined with Chaos engineering tools and practices also ensure a resilient platform and applications that survive failures of software and hardware gracefully.

“With nodes coming and going, we needed a way to ensure reliable application performance, despite all the changes under the hood.”
STEVE GRAY, HEAD OF TRADING SOLUTIONS, ENTAIN AUSTRALIA

“Although our developers were quick to seize on the efficiency and reliability gains provided by microservices and gRPC, the default load balancing and service discovery in Kubernetes didn’t give us the best performance out-of-the-box and left us in a position where all requests from one pod in service would end up going to a single pod on another service,” explained Gray.

While this approach worked, it had two negative effects on the platform. First, servers had to be exceptionally large to accommodate substantial traffic volumes. Second, the team couldn’t make efficient use of horizontal scaling. This prevented Entain Australia from taking advantage of the available spot instances or processing the high volume of requests in a timely manner.

Another issue was a lack of intelligent routing. To hit its ambitious availability targets, the company spans multiple clusters across various AWS availability zones (AZ), making sure no AZ can become a single point of failure for the platform.

“For more casual Kubernetes users, this isn’t a problem. But at Entain’s scale and request volume, cross-AZ traffic began to represent a tangible source of both latency and cost. Transactions that needlessly crossed an AZ boundary slow down performance of the platform and incur additional charges in the cloud.”
STEVE GRAY, HEAD OF TRADING SOLUTIONS, ENTAIN AUSTRALIA

To help address these issues, Gray’s team began looking into a service mesh approach. They considered a variety of options but settled with the Linkerd service mesh.

Linkerd stood out for several reasons.

First, Gray’s team needed a mesh that was Kubernetes-native and would work with their current architecture without needing to introduce large numbers of new concepts – such as custom resource definitions, and importantly not force a restructuring of applications or environments.

Second, the team lacked time to learn and run a complex system. “Linkerd was ideal because it is easy to get up and running and requires little overhead,” said Gray.

“It took us five to six hours to install and configure Linkerd and migrate 300 services to the mesh. It’s just the go-get-command and then the install process, and job done! It’s that simple. We installed it and have rarely had to touch it since—it just works.”
STEVE GRAY, HEAD OF TRADING SOLUTIONS, ENTAIN AUSTRALIA

Gray’s team also found the Linkerd Slack community and Linkerd’s docs to be extremely helpful as they moved into production—a process that happened literally overnight.

With Linkerd, the technology and business gains came quickly. Once Linkerd finished rolling out to all the Kubernetes pods, the Linkerd proxy took over routing requests between instances. This allowed the Entain Australia trading platform to immediately route traffic away from pods that were failing or being spun down.

The team also realized immediate improvements in load balancing. Linkerd’s gRPC-aware load balancing fixed the issues with gRPC load balancing on Kubernetes, and started properly balancing requests across all destination pods.

This allowed the team to achieve two significant business gains. “We increased the volume of requests the platform could handle by over tenfold and used horizontal scaling to add more smaller pods to a service,” explained Gray. “The latter gave us access to a broader range of AWS spot instances so that we could further drive down our compute costs—while delivering better performance to their users.”

In that same vein, Gray realized an unexpected side benefit. Kubernetes’ load balancing natively selects endpoints in a round-robin fashion, fundamentally rotating through a list to distribute the load between them. This random process means that requests could go to any node on a cluster no matter the latency, saturation, or proximity to the calling service.

With Linkerd, each micro-proxy looks at available endpoints, selecting the “best” target for its traffic based on an exponentially weighted moving average or EWMA of latency. When the team introduced Linkerd to its clusters they began to see faster response times and lower cross-AZ traffic costs. The built-in EWMA routing algorithm keeps more traffic inside an AZ automatically and cuts bandwidth costs of thousands of dollars a day.

“The difference was night and day,” said Gray. Our applications began spreading the load evenly between application instances and response times went down. We could see it right away in our monitoring. The bandwidth usage decreased, and the CPU just leveled out everywhere—an improvement across the board. We were so excited; we were literally jumping around. With Linkerd, we went from a fraction of our servers going flat out at their line speed (20gbit/s) to an even balance, with no server hitting above 9gbit/s sustained. Linkerd really made a difference.”

Within a few days of trying Linkerd, Entain was able to take it into large-scale and highly performant Kubernetes clusters. The Trading Solutions team alone runs more than 3,000 pods and hundreds of services in Kubernetes—an extensive deployment that generally requires a team to manage. But in Entain’s case, only one person manages the entire infrastructure.

“We’ve adopted Linkerd without needing to become experts in service meshes or any particular proxy. In fact, Linkerd just sits in the background and does its job. It’s like with the electricity or water supply. When it’s working well, we really don’t think about it and just enjoy the benefits of having electricity.”
STEVE GRAY, HEAD OF TRADING SOLUTIONS, ENTAIN AUSTRALIA

Linkerd was easy and met all Entain Australia’s needs. It solved the gRPC load balancing problems and augmented standard Kubernetes constructs in a way that allowed Gray and his team to move forward without reconfiguring applications. “It solved our problems, including some of those we didn’t realize we had.”

Because everything that Gray’s team does revolves around availability, reliability, cost reduction or control, and scalability — these things really matter, and everything is measured by it. If Entain stops taking bets because of a service or application failure, there is a direct financial business impact. “We have to make sure that we’re available, reliable, and rock-solid, and Linkerd is a part of our strategy for achieving that,” said Gray.

“Because of Linkerd, we’ve been able to easily increase the volume of requests by more than 10x, reduce operating costs, and better hit our availability targets—while continuing to iterate on our business differentiators.”

Additionally, Linkerd has helped the team sleep again. “No more being woken up in the middle of the night is perhaps the most valuable feature of the service mesh.”