Guest post by KubeEdge Maintainers

Service Architecture Evolution

  1. Introduction to China Mobile Online Marketing Service Center

The Center is a secondary organ of the China Mobile Communications Group. It operates and manages online service resources and channels. The Center holds the world’s largest call center with 44,000 agents, 900 million users, and 53 customer service centers. It leads customer service technologies and services. 

China Mobile Online Marketing Service Center milestones
China Mobile to cloud native roadmap
  1. Way to Cloud Native

Cloud-Edge Synergy Solution Selection

  1. Service requirements and pain points

The convergent customer service system features centralized traffic control and nearby media access. In this architecture, a small number of servers need to be deployed in the equipment rooms of each provincial branch. A star topology is formed with two centers at its core and 31 provincial branches as the access points. In addition, multimedia service systems such as the video customer service and all-voice portal need to be deployed in local branches to ensure low network latency, faster streaming data transmission, and better user experience. Painpoints of branch resource pools:

Service requirements and pain points diagram using control centers and media nodes
  1. Cloud-Edge Synergy Solution Selection (1)

How should we manage the nodes deployed on local branches, namely, the nodes at the edge? The edge can be deployed as either clusters or individual nodes.

Diagram showing Cloud-Edge Synergy Solution Selection

To deploy an edge cluster, a full set of management capabilities needs to be deployed alongside the cluster. This could lead to a large resource overhead, which is not favored, especially for services that are expected to deliver high performance and availability.

Deploying edge nodes require much less resources compared with deploying an edge cluster. Nodes are managed by a cloud-based Kubernetes master node. Services at the edge can be directly controlled from the cloud. This solution inherently achieves cloud-edge synergy. Two capabilities need to be incorporated before we can finally settle with this solution. The first is offline autonomy, which provides the architecture with resilience. The edge could be disconnected or poorly connected to the cloud. In this case, kubelet should be able to run or edge components should be able to start service pods. When the connection is poor, pods should be up after the host is restarted. The second is that the edge nodes must be fully compatible with Kubernetes APIs.

Considering that the nodes in each local branch are not of a large scale, deploying them as edge nodes would cause lower resource overhead. Therefore, the edge node solution is chosen.

  1. Cloud-Edge Synergy Solution Selection (2)

KubeEdge, OpenYurt, and SuperEdge are the mainstream open-source projects for edge nodes.

They are evaluated and compared in the following three aspects:

In addition, the number of edge components is also considered. The more components there are, the harder the O&M could be. For edge components that include kubelet, KubeEdge only requires one, OpenYurt three, and SuperEdge five. KubeEdge now is leading the game.

Table showing comparison between KubeEdge, OpenYurt and SuperEdge
Bar Chart showing community activeness between KubeEdge, OpenYurt and SuperEdge where results show KubeEdge has the highest number.
  1. Cloud-Edge Synergy Framework

A cloud-edge synergy architecture consisting of two centers and multiple edges is built based on KubeEdge. This solution extends container cloud compute capabilities to the edge nodes in local branches, which enables unified resource scheduling and management.

Cluster management by a unified container management platform

Edge node management based on KubeEdge

Optimized offline autonomy by adding edge proxy on EdgeCore

Higher service release efficiency by using the CI/CD capabilities of the management platform

Synergy of management, O&M, services and data diagram

A platform for data, management, and O&M synergy

Problems and solutions in solution practices

  1. Edge Networking

Diverse networking models are available for edge nodes. The following solution is chosen to ensure local traffic can be processed within local branches.

Table solutions of Edge Networking

  1. Edge Resource Management

Normally, quotas are used to allocate tenant cluster resources. However, this mechanism can result in severe resource fragmentation. It does not fit the elastic scaling capability of the container cloud and is not suitable for enterprise-grade large-scale applications.Kubernetes taints are used to address this problem. Multiple pools are built to suit different services, as shown below:

Edge Resource Management using Kubernetes taints diagram

Resource pools:

Advantages:

  1. Edge Service Access | Deploying Capabilities at the Edge

KubeEdge extends container cloud computing capabilities to the edge. EdgeMesh enables simple service exposure. However, layer-7 load balancing is not supported. Therefore, cloud capabilities need to be deployed at the edge.

Edge Service Access within and outside the cluster diagram

A cluster handles intra-cluster traffic and outbound traffic. For intra-cluster traffic, kube-proxy and CoreDNS that are originally on the cloud are moved to the edge. When pod1 talks to pod2, pod1 uses the dedicated domain name of each branch to perform domain name resolution on CoreDNS. After the cluster IP address is obtained, the traffic is forwarded to pod2 based on the iptables or IPVS policy generated by kube-proxy.

For outbound traffic, a customized Ingress controller is deployed for service exposure.

  1. Edge Service Access | Service Grouping and Traffic Distribution
Edge service access with Kubernetes cluster diagram
  1. Hierarchical Image Repositories

Networks at the edge could be of poor quality. When a unified cloud image repository is used, image pull can cause great burden on the network. Therefore, a hierarchical architecture is used to relieve the burden brought by image pull on the network and provide faster image pull speed.

Active and standby repositories: The active image repository is connected to OSS. The standby repository is deployed in non-HA mode and uses local disks

HA: The active repository uses Keepalived and VIP to provide service.

Independent deployment: An independent image repository is deployed at each branch. Edge nodes pull images from the nearest image repository.

HA: Image repositories at the edge are deployed in non-HA mode, but they and the repositories at the centers form an active/standby mechanism.

Centralized build: Centralized image build by using the CI capability of the management platform

Unified push: Branch service images are pushed to the central and edge repositories in a unified manner.

Management platform architecture
  1. Federated Monitoring

A center + edge federated monitoring system based on Prometheus, AlertManager, and Grafana.

Federated monitoring architecture

Non-HA Prometheus is deployed in branches to build cluster federation with Prometheus on the cloud.

Data is aggregated to the HQ Prometheus cluster and is displayed using the Grafana panel in a unified manner.

Data is analyzed by Prometheus at the HQ. AlertManager is connected for centralized alarm reporting.

Prometheus at the edge can collect data based on customized metrics.

  1. Offline IP Address Retaining

The native CNI component basically depends on cloud capabilities (API Server/etcd). IP addresses cannot be allocated when pods are offline. IP address snapshots ensure that a pod retains its original IP address when the pod is restarted offline.

IP address allocation is controlled by CNI. How can a pod retain its original IP address when it is restarted?

An IP snapshot is added to the results returned by CNI. When the network is normal, the IPAM component is called to allocate IP addresses. When it is offline, local disks are used to retain the original IP addresses.

Diagram showing before and after result of adding IP snapshot

No intrusion to the overall CNI design. It can flexibly interconnect with other CNI implements.

DaemonSets are used to distribute binaries.

Only simple adjustments on your existing CNI configuration file are required.

  1. Edge Proxy

For edges deployed with a large set of cloud capabilities, edge proxies can improve the offline autonomy capability for system components and enable node-level offline autonomy.

How can system components perform offline autonomy when disconnected?

An edge proxy is added to proxy system component requests as an HTTP proxy. It intercepts requests, and caches them locally. When the system components are offline, the cached data is used to provide services for the system components.

This mode does not address the relist/rewatch issue. You can use the edge list-watch function of KubeEdge 1.6.

Diagram showing before and after using EdgeProxy


  1. Edge Fault Isolation

The two-level isolation architecture enables automatic monitoring and isolation, which covers key cluster components and scenarios such as node failures. The architecture reduces service downtime and ensures service stability.

Edge Fault Isolation architecture

Contribution to the KubeEdge Community

Improvements on KubeEdge:

  1. keadm debug tool suite
  1. CloudCore memory usage optimization
  1. Default node labels and taints
  1. Host resource reservation

Summary & Outlook

Outlook

At present, the architecture allows pods to be deployed at edge. However, some services can run only on VMs due to historic reasons or specific requirements. It is not economic to maintain a VM resource pool especially for these services. Explorations are being made in edge VM management using Kubernetes, KubeEdge, and KubeVirt, a VM management open-source project based on Kubernetes.

Kubernetes + KubeEdge + KubeVirt

In the architecture aforementioned, each edge branch deploys several servers and has its own services, databases, and middleware. They can be properly managed from the cloud using Operator. However, in a weak network environment, list-watch is not completely reliable. How should we manage middleware services at the edge? Currently, a Kubernetes + KubeEdge + middleware service solution seems to be a good direction to look at. We’d like to share our progress in later articles.

Developers are always welcome to join the KubeEdge community and build the cloud native edge computing ecosystem together.

More information for KubeEdge:

Documentation: https://docs.kubeedge.io/en/latest/