Building a fault-tolerant application stack on top of a dynamic foundation

Posted on January 6, 2022 by Mark Swarbrick

CNCF projects highlighted in this post

Guest post by Mark Swarbrick, Head of Infrastructure at Bink

Powering digital loyalty transactions of some of the biggest banks in the UK with Linkerd

Bink, a fintech company based in the UK, has made it their mission to reimagine loyalty programmes. With a lot less fuss, rewards should be easier for everyone — banks, shops, and customers who love to shop. That’s why we built a tool that recognises loyalty points every time people shop, connecting purchases with reward programmes with a simple tap.

To achieve that, we needed to build an extensible platform that banks would accept from a security and accountability perspective. With the luxury of a greenfield site, we leveraged multiple CNCF projects including Kubernetes, Linkerd, Fluentd, Prometheus, and Flux to build a technology stack that is performant, scalable, reliable, and secure, while reducing application issues caused by transient network problems. In this blog post, I’d like to share how Bink achieved that.

Bink’s backstory

Everyone at Bink has years of first-hand experience in banking, retail, and loyalty programmes. We understand banking opportunities, in particular when it comes to transforming loyalty programmes while ensuring they have a positive impact on retail.

In 2019, Barclays recognised the immense potential in our offering and committed to a significant investment in our business. Thanks to this partnership, Bink is now accessible to millions of Barclays’ customers in the UK!

Like most startups, the infrastructure was built mostly by a single individual. Budgets were tight but we knew we had to build something that could grow with the company. Fast forward three years and a team of three is supporting our in-house built platform capable of processing millions of transactions per day — a true testament to the amazing technology of the cloud native ecosystem!

The Bink team, infrastructure, and platform

I’m on the platform team along with Chris Pressland and Nathan Read. Today, we operate six Kubernetes clusters. Two are dedicated to production in a multi-cluster setup with each cluster carrying approximately 57 different in-house built microservices. All operations run in Microsoft Azure with multiple production and test environments.

Initially, Bink had three web servers on bare metal Ubuntu 14.04 instances running a handful of uWSGI applications load balanced behind NGINX instances — no automation of any kind was in place.

In 2016, Chris was hired to begin converting Bink’s applications over to Docker containers and move away from the existing approach of SFTPing code onto the production servers and restarting uWSGI pools. To enable this, we built a container orchestration utility in Chef which dynamically assigns host ports to containers and updates NGINX’s proxy_pass blocks to pass traffic through. This worked well enough until we realized that Docker caused many Kernel panics and other issues on our aging Ubuntu 14.04 infrastructure.

Around the same time, we got formal approval for evaluating a migration from our data center to the cloud — our needs were far outgrowing what the data center could offer. As a Microsoft 365 customer, we decided to go with Microsoft Azure. We also quickly concluded that maintaining our own container orchestrator wasn’t sustainable in the long run and decided to move to Kubernetes. At the time, Microsoft didn’t have a Kubernetes offering that met our requirements, so we picked up Chef again and wrote our own.

Moving to Kubernetes and looking for a service mesh

The first few years of running our in-house Kubernetes distribution were painful, to say the least. This was mostly due to Microsoft’s, at the time, unstable networking infrastructure. This got considerably better over time but, for the first few years, we saw huge amounts of random TCP disconnects, UDP connections just going missing, and other random faults.

Around 2017, we started looking into service meshes hoping it could help solve, or at least mitigate, some of these issues. At the time, Monzo engineers gave a KubeCon talk on a recent outage they experienced and the role Linkerd played. Not everyone is transparent about these things and I really appreciated them sharing what happened so the community can learn from their failure — a big shout-out to the Monzo team for doing that! That’s when we began looking into Linkerd. The maintainers were about to release a newer Kubernetes-native version called Conduit, soon to be renamed to Linkerd2.

We also briefly considered Istio as much of the industry seemed to be leaning towards Envoy-based service meshes. However, after a fairly short experiment, we realized Linkerd was really easy to implement. There was no need to write application code to deal with transient network failures and the latency was so small it was worth the additional hop in the stack — what it gave us in traceability was invaluable, too! In short, Linkerd was a perfect fit for our use case.

And lo and behold, as soon as we started experimenting with Linkerd in our non-production clusters, network faults caused by the underpinning Azure instabilities dropped significantly. That gave us the confidence to add Linkerd to our production workloads where we saw similar results.

The service mesh difference with Linkerd

Migrating our software stack onto a cloud native platform was a no-brainer. However, parts of the architecture weren’t as performant or stable as we had hoped. Linkerd enabled us to implement connection and retry logic at the right level of the stack, giving us the reliance and reliability we needed. Suddenly, the questions over whether we could use our software stack in the cloud without significant uplift had disappeared. Linkerd showed that placing the logic in the connection layer was the right approach and allowed us to focus on product innovation rather than having to worry about network or connection instability. That really helped reduce operational development costs and time to market!

Ultimately, Linkerd significantly improved things from a technology and business standpoint. We had just begun conversations with Barclays and needed to prove we could scale to meet their needs. Linkerd gave us the confidence to adopt a scalable cloud-based infrastructure knowing it would be reliable — any network instability was now handled by Linkerd. This in turn allowed us to agree to a latency and success rate based SLA. Linkerd was the right place in the stack for us to monitor internal SLO’s and track performance of our software stack.

We didn’t want to grow an on-prem infrastructure or refactor our code to deploy retry logic. Linkerd provided the needed metrics to achieve this quickly while aiding us in tracking down platform bottlenecks in a cloud native environment.

The power of cloud native

Looking at the entire stack, cloud native technologies — and CNCF projects, in particular — allowed us to build a cloud-agnostic platform that can scale as we need it whilst keeping a close eye on performance and stability. We’ve tested our platform at load and can perform full disaster recovery in under 15 minutes, recover easily from transitory network issues, and are able to perform root cause analysis of problems quickly and efficiently.

Ready to shake up UK banking and loyalty programmes

At Bink, we had to rapidly grow and mature our infrastructure to meet banking security, stability and throughput requirements. Cloud native technologies enabled us to confidently support and monitor contractual SLA and internal SLO requirements. This positioned Bink well to capitalise on the retail space opening back up and helping retailers enable payment linked loyalty in partnership with some of the biggest names in UK banking.