Alibaba Cloud turns to Cilium for improved network performance and scalability
Challenge
Alibaba Cloud, also known as Aliyun, is a subsidiary of the Alibaba Group, a leading global technology company headquartered in China. Alibaba Cloud is one of the world’s largest cloud computing services and solutions providers. They offer their customers a full suite of cloud infrastructure and services, including a managed Kubernetes service known as Container Service for Kubernetes (ACK).
Alibaba initially built the ACK network with Flannel, Calico, and their open source ENI-based network plugin, Terway. However, as their business scaled, they encountered networking issues including increased service and network latency.
Faced with these challenges and the need to support its growing customer base, Alibaba Cloud realized they needed to find a more scalable and high-performance solution for the ACK Kubernetes networking stack.
Solution
Alibaba Cloud turned to Cilium because it uses eBPF to provide faster networking and additional network policy functionalities. They recognized that chaining Cilium with Terway could solve their latency issues. Additionally, Cilium could provide network policy capabilities, eliminating the need for an additional CNI.
Impact
Integrating Cilium into the ACK network infrastructure has helped Alibaba Cloud solve their service and network latency issues. Furthermore, utilizing Cilium has also delivered significant performance improvements to their network.
By the numbers
Number one
Cloud service provider in APAC
89
Availability zones
30
Regions
Integrating Cilium For Better Network Scalability and Performance
Alibaba Cloud currently has 30 data centers and 89 availability zones worldwide, with customers spanning across the globe. Their managed Kubernetes service, known as ACK, has tens of thousands of clusters with the largest clusters having over 10,000 Nodes.
During the early days of building their managed Kubernetes service, Alibaba Cloud used Flannel for networking. However, as their platform grew, they recognized the need for more sophisticated solutions that could scale with their platform and customers.
With the challenges they were facing, they decided to build their own CNI plugin, Terway, to solve their scaling challenges. However, with Terway, their pod network traffic still passed through the host network, increasing latency due to the scale of their clusters. In a bid to tackle this latency issue, they started to use IPVLAN which would allow traffic to go directly from a pod port to their VPC but this solution did not cater to Kubernetes services as container traffic no longer passed through the host and the rules created by kube-proxy can’t handle the traffic for Kubernetes service. They realized that they needed another solution to handle traffic to and from services.
“We initially provided ACK with Flannel but modified it to use the VPC route, eliminating the overlay in our network. However, this setup had some disadvantages: the network was not all in the same layer and Flannel still provided a second CIDR in the cluster, which limited its effectiveness.
We decided to create our own CNI plugin, leading to the development of the open source Terway plugin. We found that network traffic from the pods still had to go through the host network, which would be a huge problem for our customers due to their large-scale clusters and number of services. There would also be an increased latency because IPVS and iptables are not powerful enough to handle such traffic.
To address this, we sought a solution that would bypass the host network for traffic routing. We built another data path called IPVLAN, allowing traffic to go directly from the pod port to the VPC. IPVLAN is a network virtualization feature in Linux where, if containers directly use IPVLAN sub-interfaces, the traffic within the containers can bypass the host network and directly reach the VPC, eliminating host network overhead. However, this approach introduced a new problem: container traffic no longer passes through the host, rendering the service functionality unusable. The service was provided on the host side by kube-proxy, but we needed it to be managed within the pod network for the namespace.”
BoKang Li, Senior Engineer, Alibaba Cloud
Alibaba realized that with traditional veth-based container networking models, there’s significant overhead when transitioning packets between namespaces. In the default iptables-based service mode, the growth of rules also incurs high costs. Therefore, they began to specifically search for solutions that could solve these two major issues.
During their search, they discovered that Cilium, powered by eBPF, could handle their implementation of Kubernetes services and ultimately fix the problems. They also liked that it provided network policy out of the box which they initially supported via Calico.
“We did some digging and found Cilium. We realized that eBPF and Cilium could handle the Kubernetes services function in our setup and found it to be a good solution for our use case. Further, in the kube-proxy model, whether using iptables or IPVS, the rules are namespace-based. Configuring a service for each container would consume a substantial amount of resources. With eBPF, eBPF maps can be shared across namespaces, presenting a solution. We only need to configure an eBPF program for each container, allowing programs in different namespaces to share the same eBPF map.
Finally, we also had network policies provided by Calico. With the new data path we built, we also needed network policy to be provided and Cilium had it out of the box for layer 3 to layer 7. With these discoveries, we made some modifications and integrated Cilium as our second CNI plugin. Cilium provided everything that we need.”
BoKang Li, Senior Engineer, Alibaba Cloud
After discovering Cilium, Alibaba Cloud decided to integrate it into their open source CNI, Terway, via the CNI chaining mode. Terway handles IPAM and network configuration, while Cilium is responsible for loading the eBPF programs and providing Service, NetworkPolicy, and quality of service features.
“Since Cilium already supports service functionality, we adapted it by chaining it with our CNI, Terway. Terway CNI handles IPAM and network card configuration, while Cilium CNI is responsible for loading the eBPF programs and providing Service, NetworkPolicy, and QoS features. In our solution, Cilium also attaches the eBPF program to the network card within the container namespace.”
BoKang Li, Senior Engineer, Alibaba Cloud
With Cilium integrated into their network, Alibaba Cloud has been able to solve their scalability issues and now enjoys lower network latency for ACK customers.
“After Cilium and eBPF simplified our network, the performance improved significantly by 32% compared to when iptables was used and 62% compared to IPVS mode.”
BoKang Li, Senior Engineer, Alibaba Cloud
Solving Customer Network Performance Problems and Future Plans
Integrating Cilium into the network infrastructure of their managed Kubernetes service ACK has been a success for Alibaba Cloud. They’ve been able to solve their high latency issues, improve network security, and provide observability to their customers.
“With Cilium, I think our customers compare us differently with other managed Kubernetes provider now because our performance is much better. Latency is the biggest concern for our customers and Cilium gives us a good advantage there. The performance improvement is one of the biggest advantages that we’ve gotten from Cilium and it is great for our business. Cilium also provides many additional functionalities, like Hubble which gives our customers observability.”
BoKang Li, Senior Engineer, Alibaba Cloud
In the future, the Alibaba Cloud team plans to continue improving their solution and focus on evolving the ACK data plane.
“The IPVLAN + eBPF mode delivers notable performance improvements, but it differs from conventional solutions in data path handling. Recently, we introduced the DatapathV2 on top of this foundation, leveraging eBPF redirects (referred to as HostRouting in Cilium). This new approach achieves similar performance to traditional solutions while providing a comparable network path, enhancing compatibility. In the future, we will continue to focus on the evolution of the data plane.”
BoKang Li, Senior Engineer, Alibaba Cloud
To get more in-depth information about their solution, check out this blog post on Cilium.io.