Service mesh at scale: How Xbox Cloud Gaming secures 22k pods with Linkerd

Posted on May 10, 2022 by Abereham Wodajie + Chris Voss

CNCF projects highlighted in this post

Guest post by Abereham Wodajie and Chris Voss, Software Development Engineers at Xbox Cloud Gaming, Microsoft

Xbox Cloud Gaming is Microsoft’s game streaming service with a catalog of 100’s of games available in 26 markets around the world. So far, more than 10 million people around the world have streamed games through Xbox Cloud Gaming.

In this blog post and upcoming KubeCon + CloudNativeCon talk, we will share how we leveraged service meshes and their ability to secure in-transit communication to scale our services to millions of players around the world, while reducing costs and saving our team significant engineering hours.

The Xbox Cloud Gaming infrastructure

The services that power Xbox Cloud Gaming are massive. We have 26+ Kubernetes clusters across several Azure regions, each with 50+ microservices and 700 to 1,000 pods — that’s a total of 22k pods secured with Linkerd. If you have ever played a game on xbox.com/play (Like Fortnite, now available to stream to your browser for free with Xbox Cloud Gaming), you were using Linkerd.

This scale did not come overnight but is the result of years of arduous work incrementally maturing our infrastructure. Early on, one such improvement was our interest in using progressive deployments to reliably deploy our services. The subsequent investigation into how to implement canary releases began our journey with service meshes.

Evaluating a Service Mesh

Our team first came across the concept of a service mesh at KubeCon + CloudNativeCon NA (North America) 2018. Shortly after that, we made an evaluation of the current state of various service mesh projects as well as the maturity of our own infrastructure, while doing so we quickly realized that these projects and concepts were still in initial stages of development and adoption. We believed a service mesh in Kubernetes was a step in the right direction, but the team decided that we were not at a point where adopting a service mesh was a straightforward process due to all the moving pieces and extra setup that was needed.

We then shifted our focus to getting more familiarized with the problems service meshes were intending to solve and how that could help us have better controls and visibility of the communication inside our clusters with the goal of being more knowledgeable and in a better place when the time came to reevaluate service meshes and deciding which one would meet our requirements and uses cases.

In 2019, Xbox Cloud Gaming launched its first public preview and our team set out on a journey to continue improving the reliability of the infrastructure that was powering these new gaming experiences. We knew we wanted a solution that could facilitate progressive deployments and began investigating options. At the time, our team had attended KubeCon + CloudNativeCon NA 2019 in San Diego while there we learned more about the roadmap for service meshes, the definition of the new Service Mesh Interface and how all projects now had a clear direction of the scope and feature set expected of them. We also got insight into how a service mesh could facilitate traffic splitting for progressive deployments and other features like mutual TLS (mTLS) and service-level metrics.

Adopting Linkerd

Fast forward to 2020 and by this time there were multiple service mesh solutions available, and our selection process was driven by requirements, specifically we were looking for a solution: compliant with the Service Mesh Interface, with efficient resource utilization (CPU is particularly important for large deployment stamps), well documented and well known within the open-source community. The team’s approach was to prototype and evaluate the most well-known solutions available, our list of candidates included Istio, Linkerd, Consul Connect, among others. In the end the team decided to adopt Linkerd as it was the solution offering a feature set that closely matched to our requirements and it let us focus our attention on just the subset of features that the team wanted to leverage.

Easy mTLS that just works

We had built our own in-house mutual TLS (mTLS) solution, to ensure that traffic between services was secured even when it would not leave the clusters. But maintaining it was increasingly time-consuming and the team realized that we were trying to solve problems that had already been solved and for which the existing solutions were not only more robust but also had broader adoption which also meant better support. Adopting Linkerd allowed us to save valuable engineering time, improve reliability and focus the team efforts on more features for our products.

With 50+ microservices and a deployment infrastructure that had to be redesigned and restructured, preparing to make the switch took some time, but it was worth the effort. Once the services were ready, installing and configuring Linkerd was incredibly simple, thanks to the extensive documentation available. We used Kustomize to configure some of the Linkerd components, including Prometheus, Linkerd controller, and Linkerd destination. Setting up the proxy auto inject was also simple, and the service mesh can be easily enabled and disabled from specific services whenever needed.

More CNCF projects: Flagger, Prometheus, fluentd and Grafana

Besides Linkerd, we use other CNCF projects like Flagger, Prometheus, and Grafana projects. Currently, we have two Prometheus instances, one dedicated to Linkerd and another one for our own custom metrics — which we eventually intend to consolidate into a single instance to simplify management. We use Grafana for on-demand or live troubleshooting and, of course, Flagger for canary deployments — which was the initial goal of the project.

Both Linkerd and Flagger use the Service Mesh Interface (SMI) API, which enables advanced progressive deployment behavior within Kubernetes. By using Linkerd’s traffic splitter and Flagger the team has been able to automate canary deployments to our AKS (Azure Kubernetes Service) clusters. During this process, Flagger takes care of managing traffic routing between both the primary and canary deployments, evaluates success rate, and slowly migrates the traffic once all signals look healthy, significantly reducing the risks of downtime or customer impact. This allows our team to confidently test and roll out new features faster and more frequently.

Saving thousands of dollars per month on monitoring and more

With Linkerd, the technological gains have had a significant impact on the team’s success. We were able to secure traffic between 22k pods with zero-config mTLS. Considering the time and effort invested in developing and maintaining our in-house mTLS solution, we can say that thanks to Linkerd for offloading that work from the team, we are saving valuable engineering hours, which translates into more time for the team to focus on feature work and the quality of our service. Additionally, because mTLS happens at the proxy level now, instead of calling a specific mTLS service for some requests, we reduced latency by 100ms on average and saved thousands of dollars per month on pods and container monitoring.

In addition to service resource utilization metrics (CPU and memory), Linkerd’s telemetry and monitoring provided us with more visibility on areas like request volume, success/failure rates, and latency distributions per API per service — out-of-the-box. We further leveraged Linkerd metrics to build our own dashboards, replacing our in-house solution. Today, we push Linkerd-generated metrics from Prometheus to our own stores. It took some time to set up, but it is something that once it has been done once it is completely free.

We started our journey to discover service meshes a few years ago and thanks to our engagements at KubeCon + CloudNativeCon and the work done by the team we are glad that we have found a great solution that better enables us to continue creating amazing experiences. If you are attending KubeCon + CloudNativeCon Europe 2022 in Valencia, join our talk (Virtually or in-person). We will be covering our service mesh journey in more detail and are looking forward to chatting, sharing experiences and answering any questions.