Multi-cluster at scale: why Timescale chose Linkerd for its service mesh framework

Posted on February 15, 2023 by Nick Calibey

CNCF projects highlighted in this post

Guest post by Nick Calibey, Senior Cloud Engineer, Timescale

When we launched Timescale Cloud in 2020, our team supported a single cloud in a single region. As we grew, it became clear that we wouldn’t be able to support customers across the globe with that setup. Something needed to change, or we would lose business. To avoid that, we migrated the platform to a multi-cloud setup. In this blog post, I will share how we did that and how it unlocked our revenue potential. You will also learn how Linkerd improved this process going forward.

The engineering team and platform

With over 160 employees across 25 countries, Timescale has a remote, geographically distributed workforce. The cloud engineering team consists of over 20 developers, seven of whom work on the platform infrastructure. I am part of the latter where I’m responsible for maintaining and improving the infrastructure that Timescale Cloud runs on.

Our team’s primary focus is Timescale Cloud, a managed high-performance PostgreSQL database for mission-critical time-series and analytics applications. We built it to provide the performance, scalability, and reliability required to store relentless streams of time-series data for complex data analysis. All Timescale Cloud microservices (and databases) run in Kubernetes.

As a cloud service provider, we rely heavily on Cloud Native Computing Foundation (CNCF) projects. All Timescale Cloud microservices (and databases) run in Kubernetes and are deployed using Helm. And a big part of our services communicate via gRPC. We monitor our services with Prometheus and collect traces with OpenTelemetry and Jaeger. Timescale Cloud would not exist without the tireless efforts of the CNCF community!

Over the past two years, we saw a 7x community growth, with tens of thousands of organizations using TimescaleDB. We currently run hundreds of microservices and databases in Kubernetes across five different geographic regions: Virginia (USA), Oregon (USA), Dublin (Ireland), Frankfurt (Germany), and Sydney (Australia). While that is exciting, it also comes with its challenges.

Going Multi-Cluster

Initially, Timescale Cloud only supported the us-east-1 region. That quickly became a problem when we acquired clients running workloads in different US regions and countries. We had to migrate to a multi-cluster setup or lose business opportunities.

When researching the best multi-cluster approach, it became clear that a service mesh would be a key part of that strategy. Abstracting the east-west communication of our services would allow our engineers to focus on business logic rather than worry about communication patterns. While we had originally tried to use other container network interfaces or CNIs, such as Cilium (which we’re still big fans of), they required too many changes to our infrastructure to be a feasible lift. Given our time constraints, we ended up implementing a “homegrown” solution that allowed our clusters to talk to each other, but little more than that. Not long after, we officially launched our next two regions, us-west-2 (Oregon, USA) and eu-west-1 (Dublin, Ireland).

Our initial mesh used a “hub-and-spoke” topology. In this setup, us-east-1 acted as the “hub” that coordinated user requests to the other “spoke” regions. For example, if a user requested to create a database in eu-west-1, the request would first hit our API Gateway in us-east-1, where it would talk to some other services to validate and other business concerns. The request would then be routed to eu-west-1, where another service would spin up the instance. We continue to use this topology as it is easy to reason about, and since only a subset of services has to live in each cluster, it’s rather easy to set up new regions. To implement this, we used Terraform to allocate load balancers for each cluster. We then manually generated DNS records via external-DNS to direct traffic to the correct cluster.

Enter Linkerd

Although our homegrown mesh worked, we knew it was a temporary solution. Ultimately, we wanted to take advantage of modern service mesh features. One of the potential solutions we had discovered earlier, but lacked time to investigate, was Linkerd. It seemed to be a less intrusive solution than a service mesh like Istio and didn’t require the infrastructure changes needed to get Cilium to work. Also, the Linkerd proxy is written in Rust — a big plus, as we’re all big fans of that programming language here at Timescale!

Within a day, we had Linkerd running in our dev cluster and could proof-of-concept it as a feasible solution within the week. It was easy and did the job, so we migrated to Linkerd as our service mesh.

Thanks to Linkerd’s excellent Helm charts, we were able to adapt Linkerd to work with the previously described infrastructure. By default, the Linkerd gateways will allocate new load balancers (LBs) for cross-cluster communication. But we wanted something else since we already had our network load balancers (NLBs) defined in Terraform. So instead, we changed the gateways to be NodePorts that are attached to our NLBs via listening groups. This allowed our workloads to communicate across clusters while keeping traffic in our private network.

We also wanted to ensure that all the Linkerd control-plane and multi-cluster pods lived on our system EC2 nodes so they wouldn’t interfere with customer workloads. While most of this could be done with nodeSelectors and other mechanisms, we used Kustomize to ensure that the Link objects were placed on the appropriate instances. Most of this setup was quite easy to do, except for figuring out the cross-cluster trust-anchor rotation (to which Linkerd’s tutorial was a huge help).

Looking Forward

We recently ported all of our microservices to Linkerd and have had no issues. Enabling communication is now as easy as mirroring a service across the mesh instead of manually allocating DNS records via external-DNS. Whereas before, we had clunky, fully qualified domain names that we had to worry about when connecting services, now we point it to something as easy as foo-service-us-east-1. It also means we no longer have to manage our homegrown solution!

But DNS simplification isn’t the only benefit we’ve gotten out of Linkerd. Thanks to linkerd-viz, we now have insights into our cross-cluster traffic that we previously lacked, and the ease of mirroring services across the mesh has played a key role in centralizing all of our various observability pieces across our clusters. Linkerd’s mTLS support has also helped us secure our clusters, and since we were already using cert-manager (another CNCF project!), we’re able to automatically manage all of Linkerd’s certificates.

By adding Linkerd to our infrastructure stack, we’ve gained a lot of benefits, with more to come. Typically, when one adds a new piece of infrastructure, the maintenance costs can weigh against (or outweigh) the benefits provided. But with Linkerd, we need to do less maintenance than before because of how easily it solves our problems. It’s a win-win! As such, Linkerd has become a pivotal part of Timescale Cloud’s infrastructure, and we’re quite happy about that.

Hyderabad, India

The engineering team and platform

Going Multi-Cluster

Enter Linkerd

Looking Forward