How a $4 billion retailer built an enterprise-ready Kubernetes platform powered by Linkerd

Posted on February 19, 2021 by Henry Hagnäs and Fredrik Klingenberg

CNCF projects highlighted in this post

Guest post by Henry Hagnäs, Enterprise Cloud Architect at Elkjop Nordic AS, and Fredrik Klingenberg, Senior Consultant at Aurum AS

In this article, we discuss how Elkjøp, the largest electronics retailer in the Nordics, built an internal Kubernetes platform that is now successfully hosting over 200 microservices in production to increase development speed—without compromising security or visibility. The platform enabled the organization to reduce hosting costs by around 80%.

To gain visibility into service health and encrypt all service-to-service communication, we adopted Linkerd, the CNCF service mesh. We, Henry Hagnäs, Elkjøp’s Cloud Solution Architect, and Fredrik Klingenberg, a Senior Consultant at Aurum AS, wrote this blog post to share our thoughts and considerations with you as we adopted Linkerd.

Backdrop

With over 400 retail locations and 12,000 employees across Norway, Sweden, Finland, Denmark, and franchises in Iceland, Greenland, and the Faroe Islands, Elkjøp is the largest electronics retailer in the Nordics. It also has a large e-commerce presence in all these markets.

The history of Elkjøp starts in 1962. Elkjøp already had a strong brick and mortar retail business and logistical infrastructure. For a long time, the IT department was small and focused on mostly integrating third-party products and externally developed solutions. Six years ago, this strategy started to change when microservices were introduced to provide shared functionality between systems. These included an advanced payment API used by both the e-commerce platform and in-store point of sales systems.

As it turned out, in-house developed microservices are a great solution to increase development velocity and quickly became the preferred alternative over expensive enterprise software packages. The first hundred or so microservices were hosted in individual Azure Web Apps, but a new approach was needed as the environment grew and the APIs became the central and soon only way to conduct sales.

After a brief feasibility study, the Elkjøp team decided to get help from Aurum AS, a Norwegian IT consulting firm, to help Elkjøp build a new, modern, and enterprise-ready hosting platform for microservices.

The platform ended up hosting almost two hundred microservices, all upgraded and tuned for the increased requirements of a 24/7 seamless purchasing experience for Elkjøp’s customers, whatever channel they’re on—online, social media, mobile, or in-store.

Platform principles

From the onset, the platform had specific goals and principles:

The first platform iteration needed at least the same features as Azure App Service, but at a lower cost and be easier to operate.
It also needed to be simpler to scale, automatically or manually, and centrally manage different applications.
After moving workloads or starting a new application, developer productivity should increase.

We also wanted to move cross-cutting concerns away from libraries and onto the platform. We needed to provide our developers with a base set of functionality and security available by just deploying the application onto the platform. In other words, we wanted to embrace a more aspect-oriented model. We needed to provide individual teams with full autonomy yet ensure there were boundaries to protect against mistakes.

Our goal was to support and create a Site Reliability Engineering (SRE) and, through tooling and best practices, move towards a DevOps culture where developers think more about operational concerns.

We needed to provide developers with a platform that gave them creative freedom, while providing guardrails that stop them from making costly mistakes, such as weak security or spend too much time focusing on non-functional details. As we will see, Linkerd made that a lot easier.

No service mesh, no network insights, nor encryption

We started the migration by dockerizing and deploying our applications onto Kubernetes. However, we quickly realized that we missed the metrics and insight provided out-of-the-box by a PaaS offering such as Azure App Service. Additionally, since we terminated TLS at the ingress controller, all communication between the applications was unencrypted. We needed to introduce something that would solve both problems—that’s what a service mesh does.

We selected Linkerd, the lightweight, ultra-fast Cloud Native Computing Foundation (CNCF) service mesh. At a high level, Linkerd injects an ultra-lightweight micro-proxy as a sidecar for each application. The proxy can offload many cross-cutting concerns such as encryption and provide valuable metrics—precisely the problem we needed to solve.

The micro-proxies form the data plane, whereas the other components of Linkerd constitute the control plane. The Linkerd control plane has an API you can use with its own CLI or dashboard. It also has other applications that help with issuing certificates for mTLS, a proxy injector that injects the proxies into Kubernetes deployments, an application called “tap” that sets up streams to watch live traffic from the proxies, to name a few examples.

Why Linkerd?

One of our guiding principles was introducing technology that solves specific problems with minimal added complexity. After all, Kubernetes is complicated enough. From a high level of abstraction, service meshes range from full API management to a micro sidecar proxy on the individual applications. We were leaning towards the latter.

Linkerd was our choice for a couple of reasons.

Linkerd’s proxy is written in Rust and has a low-performance overhead. It beats the other service meshes when it comes to memory and CPU usage. While we didn’t think the additional performance cost of, let’s say, Istio would have been a problem for us, we take all the performance gains we can get.

Additionally, we wanted a project backed by the CNCF with all of its benefits.

Microsoft introduced the service mesh interface (SMI) at Kubecon Barcelona 2019. In short, the SMI is an open-source specification on the interfaces that a service mesh can have and implement. We liked the idea of having a committee evaluating the specification. Since Linkerd implements the SMI, it provides us with the flexibility to switch to a different service mesh in the future while keeping the same interface.

Once we started implementing and using Linkerd, we learned that the community’s documentation and response to questions were second to none.

It is easy to install and get up and running. It auto-injects the micro-proxies, which means that the platform team can label the namespaces to get deployments onboard the service mesh automatically. Developers don’t need to worry about Linkerd. If they do need to overwrite the defaults, they can easily do that with annotations.

Metrics and our SRE approach

With Linkerd, we get the Request, Error, and Duration (RED) metrics by default. There is no need for developers or vendors to instrument their applications. While it would occasionally be helpful if they did, with Linkerd, we get the most critical insight, giving us the observability we need.

We get the metrics by letting Linkerd set up an in-memory Prometheus server that scrapes all of the data plane proxies and an in-memory Grafana instance with dashboards during installation. Prometheus is our tool of choice concerning metrics. We augmented this setting with Prometheus federation and used Thanos for long-term metrics retention.

Network confidence

If you’ve operated complex workloads, you probably know that people tend to blame the network every time something goes wrong. Service meshes like Linkerd provide detailed telemetry that reduces the mean-time-to-detection (MTTD) and mean-time-to-resolution (MTTR) of the root cause. While, in some cases, the network may indeed be the issue, it often falls victim to scapegoating. A service mesh puts a definite end to that.

Linkerd resolved many of these discussions quickly by providing us with the necessary network insights throughout different systems. It was rarely the network, by the way.

Saved by Linkerd metrics

As we moved our new point of sales backend to Kubernetes, we expected performance testing to be a formality. The backend worked fine when hosted on Linux servers, so what could go wrong? To our surprise, at even modest loads, we would see a drop in the success rate. We set out to trace the failing requests to make the system healthy and stable.

With the Linkerd metrics, we could monitor the number of TCP connections with their duration and failure rate, both for inbound and outbound and for each application, particularly the POS backend. Ultimately, those metrics led us to discover that the sales portal application was not reusing sockets as it should. SNAT port exhaustion is hard to identify without detailed network telemetry. Knowing this, we increased available outbound ports and later fixed the application. A perfect example of reducing MTTD and MTTR using Linkerd.

Service mesh fine-tuning

Setting up and getting off the ground quickly with a service mesh has no merit if it means that day two operations become too tricky. We needed to set Linkerd up in high availability (HA) and also do proxy resource fine-tuning. The documentation was clear, and we got swift responses whenever we reached out to the community. It was easy to go from the initial setup to HA and do the fine-tuning by setting annotations or flags during Linkerd upgrades. We were surprised by how little resource usage Linkerd consumed, and we have yet to experience it being the bottleneck.

We have upgraded Linkerd several times without downtime or any issues.

One downside of shifting to a more aspect-oriented programming model is that you get very dependent on the component providing that model, in our case Linkerd for mTLS and metrics. However, it has not been a problem; Linkerd has been rock solid.

As our Kubernetes platform’s usage spread across more and more stores, we will probably choose to use multiple geo-redundant clusters. It was, therefore, excellent to see Linkerd introducing support for just that.

As the “App Service Replacement” is done, we are also now looking at more advanced features, such as traffic splitting and canary releases using Linkerd and Flagger.

What’s next for Elkjøp

At the point of this writing, we are rolling out the new point of sale system in all countries. By the end of 2021, every sale—40 billion NOK ($4.7 billion) per year—will be going through these new Kubernetes-environments. We are trusting Linkerd to help us keep the company running and help our customers enjoy amazing technology.

Mumbai, India