mTLSing services with Linkerd at scale without impacting developer productivity
Guest post by Christian Hüning, Director Cloud Technologies & Switchkit at Finleap Connect
At Finleap Connect, we operate multiple high-density bare-metal Kubernetes clusters with up to 5,000 pods — keeping our customers’ highly sensitive financial data safe is business-critical. Needless to say, security is paramount.
In this blog post, I’ll share how Connect’s cloud team migrated its entire platform to a cloud native architecture while mTLsing all services to comply with strict regulatory requirements. It was a huge undertaking — especially considering our tight deadline of five months! To our surprise, the mTLS aspect was fairly easy. Linkerd was installed in an hour and running in production within a week without impacting our developer team.
Enabling next-gen financial services
I head the cloud team at Finleap Connect, the leading independent European open banking platform. Our full-stack platform enables organizations across banking, accounting, and lending to provide next-gen, mobile-first financial services to their customers. Services include data and analytics enrichment, default financial data accessibility, seamless payments across a range of applications, and much more.
At Connect, we understand how customers transact and interact, and that know-how is embedded into our platform. Additionally, the platform allows our clients to compliantly access their customer’s financial transactions and enrich that data with analytics tools, all while providing digital banking services that deliver high-quality, digital products and services.
The engineering team
Our team includes a handful of cloud engineers and around 60 engineers spread across multiple smaller teams. My team, the cloud team, is responsible for 50+ microservices spread across ten Kubernetes clusters, distributed in three geographic regions across GCP, AWS, and a bare-metal private cloud. Our largest cluster runs 52 nodes with 5,000 pods.
To operate Connect cloud, our cloud-agnostic private setup, we use SAP Gardener. Linkerd is deployed across all clusters and represents an integral part of our infrastructure. Linkerd, including its metrics, is centrally managed through Buoyant Cloud.
(By the way, we are hiring. If you’re interested in joining a dynamic engineering team that is constantly pushing the boundaries of what tech can do, check out our jobs board — we are always looking for great talent!)
mTLS across all services, a regulatory requirement
Connect operates in a highly regulated environment and, as such, important considerations have to be taken into account — after all, we are dealing with highly sensitive financial data. In 2018, that meant implementing mTLS across all services in our clusters, independent of the actual business code (i.e. solve it on a different layer).
To address that challenge, we evaluated a variety of available solutions. One of the options was Istio. We installed it on our test cluster and, while it worked fine, it also required a fair amount of configuration. When we realized we’d need a configuration for each service, Istio quickly became less feasible. This was back in 2018 (the early days). We were migrating our entire stack to a cloud native architecture on a very ambitious roadmap (the migration had to be finalized within five months). Our development teams were already dealing with a good amount of transformation tasks. We concluded that Istio would become an additional configuration burden and decided against it.
The other service mesh we looked at was Linkerd (back then known as Conduit). Its approach to simplicity, roadmap, and the various Slack discussions we had with the project maintainers gave us the confidence we needed to move forward with this project.
Linkerd: installed within an hour, in prod within a week
The Linkerd installation took less than an hour; the overall production setup was probably about a week. Some updates, including contributing to Linkerd’s certificate management feature, took a few additional weeks (more to that in a minute).
Overall, the entire experience with Linkerd has been great. While we ran into a bit of trouble when starting to run Linkerd at scale, it mostly boiled down to required scale-up and -out of Linkerd components.
There were a few things that were on the Linkerd roadmap but not yet implemented. Certificate management was one of them. Certificates expire after a year and we had one year to address that issue. We decided to contribute to the Linkerd project to help develop that feature. Today, certificate rotation is fully automated, and nothing we need to worry about. Applications using server-speaks-first protocols was another example but that is now also supported since Linkerd 2.10.
End-to-end encryption with minimal impact on developer productivity
We were able to implement mTLS across all services at scale while minimizing the impact on developer productivity. The entire process was fairly quick and allowed us to meet our initial critical deadline to go live with our new platform. Without Linkerd, we wouldn’t have been able to achieve that.
Additionally, Linkerd’s four golden signal metrics are particularly useful for uniform and generic platform-level debugging and service health observability. These metrics provide us with immediate insights when migrating our workloads to Kubernetes. The cloud team gets all these insights without having to dig too deep into application specifics — a big time-saver for the team and a great way to get started into cloud native application management for new developments.
We have also implemented canary deployments through Linkerd and Flagger and can now deliver features faster and with a lot more confidence.
All this was almost automatically enabled by deploying and activating Linkerd across our applications. Linkerd helped us avoid more complex TLS setups for certain services, saving my team lots of backlog time. This is all pretty neat and one of the reasons I’ve been so outspoken about this project.
The Linkerd community
The Linkerd community is the best! Everyone is incredibly welcoming. The Slack channel is a great way to get valuable input and collaborate with others. You can literally find solutions to any kind of problem. In fact, you can find me there regularly. I’ve been active in the community and, because I enjoy jumping on any opportunity to help educate others, I was invited to become a Linkerd Ambassador along with some other fantastic Linkerd end users.
At Connect, we are heavily invested in CNCF projects and open source
Over the years, we have become heavily invested in CNCF open source solutions and use them across all our stacks. We use the Emissary-Ingress (formerly Ambassador), Flagger for canary deployments, cert-manager for all certificate handling, Prometheus and Grafana for platform monitoring, NATS, and Rook and Ceph for storage in our private cloud clusters.
Other non-CNCF open source projects include CockroachDB for all SQL databases, NGINX Ingress, Hashicorp Vault, and RabbitMQ.
Zero-trust doesn’t have to be hard
For companies like us, operating in the fintech industry, zero-trust is a requirement. But zero-trust is increasingly becoming a must-have across industries. In a microservices world, a firewall won’t do the trick anymore. I hear a lot of peers who are concerned about the complexity they might be adding to an already complex system. But zero-trust doesn’t have to be hard. I hope this blog serves as proof.
We were able to mTLS all services within five months while minimizing the impact on developer productivity. Additionally, the platform metrics have proven to be a real time-saver when debugging and keeping a pulse on service health.
We chose Linkerd because the configuration is minimal while providing key features almost automatically. The features that were not yet available at the time, we helped build. Today, there is no reason not to mTLS all your services.