Project post from Ruan Zhaoyin of KubeEdge

This article describes how e-Cloud uses KubeEdge to manage CDN edge nodes, automatically deploy and upgrade CDN edge services, and implement edge service disaster recovery (DR) when it migrates its CDN services to the cloud.

This article includes the following four parts:

Project Background

China Telecom e-Cloud CDN Services

China Telecom accelerates cloud-network synergy with the 2+4+31+X resource deployment. X is the access layer close to users. Content and storage services are deployed at this layer, allowing users to obtain desired content in a low latency. e-Cloud is a latecomer in CDN, but their CDN is developing rapidly. e-Cloud provides all basic CDN functions and abundant resources, supports precise scheduling, and delivers high-quality services.

Diagram shows CDN functions

Background

Unlike other cloud vendors and traditional CDN vendors, e-Cloud CDN is developed cloud native. They built a CDN PaaS platform running on containers and Kubernetes, but have not completed the cloud native reconstruction of CDN edge services.

Diagram shows China telecom CDN platform

Problems they once faced:

KubeEdge-based Edge Node Management

CDN Node Architecture

CDN provides cache acceleration. To achieve nearby access and quick response, most CDN nodes are deployed near users, so requests can be scheduled to the nearby nodes by the CDN global traffic scheduling system. Most CDN nodes are discretely distributed in regional IDCs. Multiple CDN service clusters are set up in each edge equipment room based on the egress bandwidth and server resources.

CDN Node Architecture

Containerization Technology Selection

In the process of containerization, they considered the following technologies in the early stage:

Standard Kubernetes: Edge nodes were added to the cluster as standard worker nodes and managed by master nodes. However, too many connections caused relist issues and heavy load on the Kubernetes master nodes. Pods were evicted due to network fluctuation, resulting in unnecessary rebuild.

Access by node: Kubernetes or K3s was deployed in clusters. But there would be too many control planes and clusters, and a unified scheduling platform could not be constructed. If each KPI cluster was deployed in HA mode, at least three servers were required, which occupied excessive machine resources.

Cloud-edge access: Edge nodes were connected to Kubernetes clusters using KubeEdge. Fewer edge node connections were generated and cloud-edge synergy, edge autonomy, and native Kubernetes capabilities could be realized.

Solution Design

CDN node optimized architecture

The preceding figure shows the optimized architecture.

Several Kubernetes clusters were created in each regional center and data center to avoid single-point access and heavy load on a single Kubernetes cluster. Edge nodes were connected to the regional cluster nearest to them. The earlier 1.3 version only provided the single-group, multiple-node HA solution. It could not satisfy large-scale management. So they adopted multi-group deployment.

This mode worked in the early stage. However, when the number of edge nodes and deployed clusters increased, the following problem arose:

Unbalanced Connections in CloudCore Multi-Replica Deployment

Diagram shows connection process from the hub to upstream and then to the API server

The preceding figure shows the connection process from the hub to upstream and then to the API server. The upstream module distributed messages using single threads. As a result, messages were submitted slowly and certain edge nodes failed to submit messages to the API server in time, causing deployment exceptions.

To solve this problem, e-Cloud deployed CloudCore in multi-replica mode, but connections were unbalanced during the upgrade or unexpected restart of CloudCore. e-Cloud then added layer-4 load balancing to the multiple copies, and configured load balancing policies such as listconnection. However, layer-4 load balancing had cache filtering mechanisms and did not ensure even distribution of connections.

After optimization, they used the following solution:

CloudCore Multi-Replica Balancing Solution

  1. After a CloudCore instance is started, it reports information such as the number of real-time connections through ConfigMaps.
  2. It calculates the expected number of connections of each node based on the number of connections of itself and other instances.
  3. It calculates the difference between the number of real-time and expected connections. If the difference is greater than the maximum tolerable difference, the instance starts to release the connections and starts a 30s observation period.
  4. After the observation period, the instance starts a new detection period. It will stop this process when the connections are balanced.
Diagram shows CloudCore Multi-Replica balancing solutions

Changes in the number of connections

Screenshot shows load balancing result

Load balancing after the restart

Edge Service Deployment

CDN Acceleration Process

CDN consists of two core systems: scheduling and cache. The scheduling system collects the status of CDN links, nodes, and node bandwidth usage on the entire network in real time, calculates the optimal scheduling path, and pushes the path data to the local DNS, 302 redirect, or HTTP dynamic streaming (HDS).

The local DNS server then resolves the data and sends the result to the client, so the client can access the nearest edge cluster. The edge cluster checks whether it has cached the requested content. If no cache hits, the edge cluster checks whether its upper two or three layers have cached the content. If not, the edge cluster retrieves the content from the cloud site.

Diagram flow shows CDN acceleration process using Local DNS

In the cache system, the services used by products, such as live streaming and static content acceleration, are different. This requires more costs for development and maintenance. The convergence of different acceleration services may be a trend.

Features of the CDN cache service:

1. Exclusive storage and bandwidth resources

2. Large-scale coverage: The cache service of the software or a machine may support even 100,000 domain names.

3. Tolerance of DR faults by region: The cache of a small number of nodes can be lost or expire. Too much cache content will cause breakdown. When so, the content will be back-cached to the upper layers. As a result, the access slows down and services become abnormal.

4. High availability: Load balancing provides real-time detection and traffic switching/diversion. Layer 4 load balancing ensures traffic balancing between hosts in a group. Layer 7 load balancing ensures that only one copy of each URL is stored in a group through consistent hashing.

The following problems are worthy of attention during CDN deployment:

The upgrade deployment solution includes:

Diagram flow shows CDN acceleration process using Local DNS

KubeEdge-based CDN Edge Container DR and Migration

Migration procedure

  1. Back up etcd and restore it in the new cluster.
  2. Switch to the DNS.
  3. Restart the CloudCore and disconnect the cloud and edge hubs.
Diagram shows Smart DNS chart

Advantages

CDN Large-Scale File Distribution

Scenarios

Diagram shows CDN large-scale file distribution scenario

Architecture Evolution Directions

Edge Computing Challenges

Basic Capabilities of CDN-based Edge Computing Platform

CDN Edge Computing Evolution

Edge Infrastructure Construction

Opportunities

More information for KubeEdge: