Guest post by Gianluca Arbezzano, Software Engineer at Equinix Metal and CNCF Ambassador and Alex Palesandro, Research Assistant at Polytechnic of Turin

Kubernetes clusters are growing in number and size inside organizations. This proliferation is due to various reasons: scalability issues, geographical constraints, multi-provider strategies, and more. Unfortunately, existing multi-cluster approaches have significant limitations in pod placement, cluster setups, and compatibility with new APIs. Moreover, they require lots of manual configurations.

During the first CNCF Turin Meetup, Alex and Mattia discussed multi-cluster management problems highlighting limitations of current approaches. They discussed possible technical choices to overcome current limitations and presented a possible implementation in Liqo, a project that dynamically creates a “big cluster” by transparently aggregating multiple existing clusters. At the end of the presentation, they showed a demo of Liqo in action in a cloud-bursting scenario.

Introduction – The Good and the bad of Multi-clusters

Kubernetes clusters are widespread in data-centers, and different regions are now a reality. After the container “revolution”, Kubernetes has become the de-facto standard for infrastructure management in recent years. On the one hand, the K8s ubiquity stands on the cloud. More and more providers are building and delivering managed clusters as a service. On the other hand, K8s is also popular on-premise where the Kubernetes rich ecosystem can reduce the “catalog” distance with public clouds. Besides, edge setups are becoming popular: an increasing number of projects focus on bringing Kubernetes on lightweight and geographically sparse infrastructures [1].

Despite all the added complexity, ubiquitous multiple cluster topologies introduce new exciting potentials. Such potential goes beyond the simple static application orchestration over multiple clusters explored so far. In fact, multi-cluster topologies can be useful to orchestrate applications across various locations and unify access to the infrastructure. Among the others, this introduces the exciting possibility of migrating an application from cluster to cluster, transparently and quickly. Moving workloads can be practical when dealing with cluster disasters or critical infrastructure interventions, scaling, or placement optimization.

A partial Taxonomy

Multi-cluster topologies introduce primarily two classes of challenges:

  1. They require a form of synchronization between cluster control planes.
  2. They require a form of interconnection that makes services accessible in different clusters.

Many projects address the multi-cluster problem; here, we summarize the most common approaches.

Multi-cluster Control Plane

Dedicated API Server

The official Kubernetes Cluster Federation (a.k.a. KubeFed) [2] represents an example of this approach, which “allows you to coordinate the configuration of multiple Kubernetes clusters from a single set of APIs in a hosting cluster” [2]. To do so, KubeFed extends the traditional Kubernetes APIs with a new semantic for expressing which clusters should be selected for a specific deployment (through “Overrides” and “Cluster Selectors”).

GitOps 

GitOps is a well-established framework to orchestrate CI/CD workflows. The basic idea is to use a git repository as a single source of truth for application deployment and update the cluster’s corresponding objects. Facing multi-cluster topologies, GitOps can represent an elementary multi-cluster control plane. We can mention GitOps tools such as FluxCD, Fleet, ArgoCD.

In such a scenario, the applications are templated with the correct value for the suitable clusters and then deployed on target clusters. This approach, combined with proper network interconnection tools, allows you to obtain a multi-cluster orchestration without dealing with the complexity of extra APIs.

However, GitOps approaches lack dynamic pods placement across the multi-cluster topology. They do not support any active disaster recovery strategy or cross-cluster bursting. In particular, there is no possibility to automatically migrate workloads across clusters to respond to unexpected failures or deal rapidly with unprevented load peaks.

Virtual-kubelet-based approaches 

Virtual Kubelet (VK) is a “Kubernetes kubelet implementation that masquerades as a kubelet to connect Kubernetes to other APIs” [3]. Initial VK implementations model a remote service as a node of the cluster used as a placeholder to introduce serverless computing in Kubernetes clusters. Later, VK has gained popularity in the multi-cluster context: a VK provider can map a remote cluster to a local cluster node. Several projects, including Admiralty, Tensile-kube, and Liqo, adopt this approach. 

This approach has several advantages compared to a dedicated API Server. First, it introduces multi-cluster without requiring extra APIs, and it is transparent w.r.t. applications. Second, it flexibly integrates resources of remote clusters in the scheduler’s availability: users can schedule pods in the remote cluster in the same way as they were local. Third, it enables decentralized governance. More precisely, the VK may not require privileged access on the remote cluster to schedule pods and other K8s objects supporting multiple ownerships.

Network Interconnection Tools

The networking interconnection represents the second important aspect of multi-cluster topologies. Pods should be able to communicate with pods on other clusters and services seamlessly. Inter-cluster connectivity can be provided by extensions of the CNI, the component responsible for cluster connectivity, or by dedicated tools

The critical design choices for interconnection tools involve mainly three aspects: (1) interoperability with different cluster configurations, (2) compatibility between network parameters used in the other clusters, and (3) how to deal with services exposed on all clusters.

CNI-provided interconnection

CiliumMesh represents an example of a CNI implementing multi-cluster interconnection. More precisely, CiliumMesh extends the capacity of the popular Cilium CNI to “federate” multiple Cilium instances on different clusters (ClusterMesh). Cilium supports Pod IP routing across numerous Kubernetes groups via tunneling or direct-routing without requiring any gateways or proxies. Moreover, it promotes transparent service discovery with standard Kubernetes services and coredns.

The main drawback of approaches like CiliumMesh is the strict dependency on a given CNI, namely Cilium. This latter must be adopted in both clusters. Furthermore, Cilium has some critical requirements in terms of POD CIDR uniqueness across clusters.

CNI-Agnostic Interconnection

Submariner enables direct networking between Pods and Services in different Kubernetes clusters, either on-premises or in the cloud. Submariner is entirely open-source and designed to be network plugin (CNI) agnostic. Submariner has a centralized architecture based on a broker that collects information about cluster configurations and sends back parameters to use.

Submariner does not support services having endpoints spread across multiple clusters (multi-cluster services). Instead, it provides a more straightforward mechanism for discovering remote services, having all the back-end pods in the exact location.

Skupper Skupper is a layer-7 service multi-cluster interconnection service. Skupper enables secure communication across Kubernetes clusters by defining an ad-hoc virtual networking substrate. UnlikeSubmariner and Cilium, Skupper does not introduce a cluster-wide interconnection but just for a specific set of namespaces. Skupper implements multi-cluster services in namespaces exposed in the Skupper network. When a service is exposed, Skupper creates particular endpoints, making them available on the entire cluster.

Service Mesh 

Service mesh frameworks are dedicated infrastructure layers to simplify the management and configuration of micro-service-based applications. Service meshes introduce a sidecar container as a proxy to provide multiple features (i.e., secure connections with mutual TLS, circuit breaking, canary deployments).

Some of the most popular service mesh architectures (ISTIO, Linkerd) have multi-cluster support to embrace multi-cluster microservices applications. The interconnection among different clusters uses a dedicated proxy to route traffic from the mesh of one cluster to another. Similarly, Istio and Linkerd can create an ad-hoc mutual TLS tunnel across clusters and provide primitives to expose services across the clusters, enabling features such as cross-cluster traffic splitting.

In general, multi-cluster support in service-mesh frameworks provides a wide range of features. However, they require many steps and several new specific APIs to configure to set the topology up.

Focus: Liqo

The categories of approaches mentioned above have several limitations. First, for many of them (Kubefed and GitOps), pod placement is static, and no fine-grained cluster optimization is possible. Second, those projects handle either network and control planes: this requires 3rd-party tools to deal with interconnection. Overall, this separated approach precludes the case of quickly plugging or removing clusters from existing topologies. For example, as we will discuss later, Liqo integrated approach enables the implementation of CNI-agnostic multi-cluster service support, where service endpoints are added to K8s with the correct IP address (i.e., considering natting rules and network topology).

The idea behind Liqo is to make multi-cluster topology a single-step operation for cluster administrators. This is obtained by combining a virtual-kubelet-based approach to deal with a multi-cluster control plane and a dedicated network fabric that is agnostic to CNIs/IP configuration. In a nutshell, Liqo offers unified access to clusters, preventing the Kubernetes users from knowing in advance the multi-cluster topology. Kubernetes administrators can change the topology by adding or removing clusters without impacting their users and potentially without affecting the running workloads.

On the one hand, dynamicity implies the capacity of adding and removing clusters “peering” to the topology on the go. On the other hand, Liqo provides transparency to an application by realizing pod offloading and multi-cluster services relying on a dedicated virtual-kubelet provider. 

The main differences between Liqo and other projects are presented in Tables 1 and 2.

Table 1 – Comparison of Control-Plane Multi-cluster projects

CriteriaLiqoAdmiraltyTensile-KubeKubefedArgoCDFleetFluxCD
Seamless SchedulingYesYesYesNoNoNoNo
Support for Decentralized Governance YesYesYesNoYesYesYes
No Need for application extra APIsYesYesYesNoNoNoNo
Dynamic Cluster DiscoveryYesNoNoNo, using Kubefed CLINoNoNo

Table 2 – Comparison of Network Interconnection projects

CriteriaLiqoCilium ClusterMeshSubmarinerSkupperIstioMulti-ClusterLinkerd Multi-cluster
ArchitectureOverlay Network and GatewayNode to Node trafficOverlay Network and GatewayL7 Virtual NetworkGateway-basedGateway-based
Interconnection Set-UpPeer-To-Peer, AutomaticManualBroker-based, ManualManualManualManual
Secure Tunnel TechnologyWireguardNoIPSec (Strongswan, Libreswan), WireguardTLSTLSTLS
CNI AgnosticYesNoYesYesYesYes
Multi-cluster Services (“East-West”)YesYesLimitedYesYes, with traffic managementYes, with traffic splitting
Seamless Cluster ExtensionYesYesYesNoNoNo
SeamlessSupport for Overlapped IPsYesNoNo, it relies on  global IPs networkingYesYesYes
Support for more than 2 clustersIn ProgressYesYesYesYesYes

Main abilities of Liqo

Liqo Discovery and Peering

Liqo implements a mechanism to discover and peer a cluster to another one within a single step. The discovery relies on DNS SRV records (such as in SIP), LAN mDNS, and manual insertion as a last resort. When a new cluster is discovered, an administrative interconnection between the clusters is established, called peering, mimicking a similar process that involves interconnections between different Internet operators. The peering process relies on a P2P protocol that enables the exchange of parameters between two clusters to define a secure network configuration. 

Liqo Network Fabric

The fundamental principle of Liqo networking is to maintain the basic functioning of single-cluster networking. In particular, the main focus preserves direct pod-to-pod traffic, enabling a transparent extension from the user’s perspective. Pods discovered as service endpoints can be reached even if they are on another cluster or their address collides with the “home” cluster pod address space.

Under the hoods, cluster interconnection is established using an overlay network to route traffic to remote clusters. Liqo leverages a “gateway” pod that connects to the remote peer using Wireguard. This architecture prevents the requirements, such as in CiliumMesh, to have all the nodes of participating clusters completely reachable from the other cluster. In addition, Liqo also handles possible overlapped pod IP addresses, taking them via double natting.

Liqo is mainly independent of the CNIs of connected clusters or compatible POD CIDRs. CNIs can be chosen independently, and, moreover, Liqo also supports managed clusters (i.e., AKS, GKE) and their networking architecture.

Liqo Resource Sharing

After the peering has occurred, a new node is added to the cluster. This virtual node will describe the amount of CPU and memory of another cluster available for scheduling.  The vanilla Kubernetes scheduler can directly assign pods to this created node. The peering process defines which size the node should have and introduces de facto the possibility of decentralized governance. Cluster admins can tweak the number of exposed resources to the other clusters.

Using Liqo, there is no disruption for the user-facing Kubernetes. For example, when a user deploys an application on a Liqo-labelled namespace, the namespace content is reflected in a twin namespace on the other cluster. More precisely, inside the “twin” namespace, most of the k8s objects are replicated on the remote one. This enables a pod to transparently be executed remotely and access its configuration objects.

This is particularly interesting for service reflection, which implements multi-cluster services “East-West.” Pods can access services anywhere in the multi-cluster topology. Under the hoods, service endpoints are manipulated by Liqo VK, crafted to also consider NAT translation.

Finally, Liqo pod offloading is split-brain tolerant. When Pods are offloaded on the remote cluster, they are wrapped in a replicaset object. This way, the offloaded pod states continue to be correctly reconciled on the remote cluster even if the connection with the originating cluster is lost.

Future Work

Liqo has recently seen its second major release, the 0.2. Planned features for the next release (v0.3, expected mid-July, 2021) are the following:

  • Support for deployments spanning across more than two clusters: the seamless cluster integration proposed by Liqo will embrace more complex topologies, enabling a deployment (e.g., a microservice-based application) to run across three or more clusters.
  • Support for Amazon Elastic Kubernetes (EKS) Service
  • Support for more-granular permission control over remote cluster resources: So far, Liqo does not handle permission management to limit the permissions of an offloaded workload on a remote cluster.

Conclusion

With the rising number of clusters, multi-cluster topologies will start to become increasingly popular.

Liqo proposes an interesting approach to simplify this problem, offering a way to create a virtual cluster abstraction that provides a unified and coherent view of your clusters, hence simplifying the creation and management of multi-cluster topologies.

References:

[1] https://www.infoq.com/articles/kubernetes-trends-and-challenges/

[2] https://github.com/kubernetes-sigs/kubefed

[3] https://github.com/virtual-kubelet/virtual-kubelet