Building a Performant Networking Infrastructure with Cilium
Trendyol is a leading e-commerce platform headquartered in Turkey. They provide an extensive selection of products to their customers spanning categories such as fashion, technology, and home furnishings. With a customer base of 20 million, Trendyol dispatches over 1.5 million items daily to 27 distinct countries.
Trendyol initially built their Kubernetes platform with Flannel and Calico for networking. However, as their infrastructure grew, they encountered challenges in scalability and performance. Anticipating a further increase in the number of Kubernetes clusters and size of the clusters prompted them to seek an alternative CNI solution. Their goal was to improve the network performance of their Kubernetes cluster’s connectivity, efficiently manage 3-5 thousand nodes within a single cluster, and eliminate kube-proxy in order to enhance the overall performance of their Kubernetes clusters.
After performance testing different solutions, Trendyol turned to Cilium as their preferred networking, observability, and security solution to take advantage of its performance, observability capabilities, enhanced security, policy management, and mesh technologies.
With Cilium, they now have improved performance, effective observability with hubble, and enhanced scalability and security for their Kubernetes clusters.
Cilium is now the default CNI for Trendyol’s Kubernetes(>= 1.26 versions) clusters. It has significantly improved the performance of their network infrastructure, exceeding their expectations and outperforming their previous (and any other) CNI. Migrating their main clusters from Flannel to use Cilium’s advanced networking has enabled them to increase their network performance by over 40% according to their internal benchmarks.
By the numbers
196k pods across multiple clusters + 20k nodes
1.16 PB of data
of clusters by up to 40%
Building a Performant Networking Infrastructure with Cilium
Trendyol has engineers spread across Turkey with a specialized seven-person team focused on deploying and managing their Kubernetes. The Kubernetes platform team runs over 450 clusters across a few bare metal data centers, running either OpenStack or VMware as a virtualization layer. Those clusters encompass a storage capacity of 1.16 PB hosting 196,000 pods, 8,000 microservices, 106 services, and 7,100 Kubernetes VMs distributed across nine regions. To make a PaaS for their developers, they use Ansible, ArgoCD and Terraform to automate cluster creation.
Trendyol’s team originally relied on a simple Container Network Interface (CNI) known as Flannel for networking. However, as their Kubernetes clusters grew in size and complexity, they began to run into issues including difficulty managing and scaling their large clusters. With plans to add even more clusters in the future and scale the clusters up to 5,000 nodes, they recognized the need to explore alternatives and upgrade their network infrastructure.
In their search for a solution, Trendyol had clear criteria: a CNI solution that would allow them easily manage thousands of nodes in one cluster, better performance for their Kubernetes cluster connectivity, support network policy, and utilize eBPF to get rid of iptables and IPVS. After researching, they narrowed their choices down to Calico and Cilium. To make an informed decision, they conducted performance tests comparing the two. Based on their specific needs, Cilium outperformed Calico.
Emin AKTAŞ, Platform Engineer, Trendyol
“We evaluated other CNIs, like Calico, but Cilium stood out primarily due to its performance. When we started performance testing Cilium, we found that its performance remained steady, even as cluster traffic surged. As our scale increased, Cilium didn’t start degrading while other CNIs did.”
After choosing Cilium, Trendyol replaced the CNI installation from Flannel to Cilium for new clusters to be provisioned. They install it with Helm which makes it easy for them to upgrade their clusters too.
Leveraging Cilium’s features such as NodeLocal DNSCache, an eBPF dataplane, and the Local Redirect Policy allowed them to improve the performance of their clusters by 40%, but that was just the beginning of the benefits Cilium brought to Trendyol.
Achieving Better DNS and Network Visibility with Hubble
At the scale Trendyol runs, observability is crucial to be able to keep the infrastructure running smoothly. To easily visualize what was happening in their clusters, Trendyol enabled Hubble, Cilium’s observability layer which provides deep visibility into network traffic.
Enabling Hubble in their Kubernetes clusters allowed Trendyol to see their network’s performance. With its comprehensive monitoring capabilities, Hubble has played an important role in empowering the team to get deeper insights on what goes on in their network, quickly investigate issues, pinpoint problems, and undertake timely resolutions.
“With Hubble, if we need to debug something or investigate the source and destination of certain traffic that comes into our clusters we can quickly see that. It has enabled members of our team to easily monitor the network connectivity within our clusters,” said Emin AKTAŞ, Platform Engineer, Trendyol.
Besides just the Hubble UI, they also export the data to Grafana to easily visualize what is happening in their clusters. But Cilium’s observability doesn’t just stop at Hubble.
“DNS is a pain in Kubernetes and Cilium makes it easy. Because Cilium has an integrated DNS proxy, tracking a DNS request is straightforward. This gives us clear insight into our network, letting us see the exact number of queries executed within our nodes, pods, and clusters.”Emin AKTAŞ, Platform Engineer, Trendyol
Increasing Network Scalability, Performance, and Observability With Cilium
Enabling their new Kubernetes clusters to use Cilium has proven to be a major success for the Trendyol team. They’ve been able to address the network scalability issues in their Kubernetes clusters to meet their network performance targets, expand their capacity to include additional clusters, scale the clusters up to 5,000 nodes, and enabled deep visibility and monitoring for their clusters with Hubble.
“We are still discovering what Cilium can do. It comes with all the tools you need and it seems like its features don’t end. Beyond just performance, it also provides observability tools so you don’t have to deploy any third-party applications to enable visibility within your Kubernetes clusters. Cilium is like a Swiss army knife,” said Aktas.
Since Cilium has addressed many of their scalability and performance concerns, Trendyol has many other plans for how to leverage Cilium’s feature set. They are looking at kube-proxy replacement to further improve cluster scalability. Later in the future, they are also considering utilizing Cilium’s service mesh and cluster mesh features to replace Hashicorp Consul.
“Cilium has leveled up Trendyol’s Kubernetes clusters. With its advanced capabilities, Cilium enabled us to solve our networking issues and outperformed all other CNIs. We have successfully increased our scalability, performance, security and visibility with Cilium.”Emin AKTAŞ, Platform Engineer, Trendyol