How Datadog uses Cilium & eBPF to power their data plane
Datadog runs 10s of clusters with 10,000+ of nodes and 100,000+ pods across multiple cloud providers. They began to reach a point where traditional networking tools were not up for the job anymore. iptables, IPVS, and cloud specific CNIs were not designed to power Datadog’s data plane needs.
Datadog turned to Cilium as their CNI and kube-proxy replacement to take advantage of the power of eBPF and provide a consistent networking experience for their users across any cloud. They also implemented Cilium Network Policies to meet certifications and run in regulated environments.
With Cilium, Datadog is now able to scale up to 10,000,000,000,000+ data points per day across more than 18,500 customers. They are able to run their network at scale and keep their customers’ data secure.
By the numbers
Data points every day
Customers using SaaS product
Nodes & Pods across multiple cloud providers
Problem: Limited scalability with iptables, IPVS, and kube-proxy
Datadog runs 10s of Kubernetes clusters, across multiple cloud providers. Some of them have up to 4,000 nodes, which makes for 10,000s of hosts in their infrastructure. Datadog’s SaaS product serves over 18,500 customers with millions of hosts reporting in, resulting in 10s of trillions of data points per day. Such a large and growing infrastructure comes with challenges.
For example, enabling cross-cluster communication in some environments isn’t simple, and ensuring that it is secure and encrypted complicates the process even further. There was also a need for routable pod IPs, which improves performance and enables direct cross-cluster communication often needed to reduce the management burden of data stores, such as Kafka or Cassandra. With many clusters and IPs routed between them, extra care must be taken to properly manage the IP address spaces and cross-cluster service discovery.
Datadog’s initial solution was to use cloud specific solutions, like Lyft’s CNI plugin for AWS. However, that meant finding additional solutions for other cloud providers. In addition, maintaining very different approaches and technologies across multiple clouds—many without operators to automate tasks—and managing provider and solution differences was time intensive. These solutions were not able to address the need for streamlined network policies nor did they meet the requirement for traffic encryption.
Another challenge was service load balancing. The usual approach is to use iptables, but as the environment grows, the rule count also grows, making this difficult to scale. This can increase update time—sometimes more than 10 seconds—as you need to methodically review all of the rules that you are interested in (“matching time”). In the words of Laurent Bernaille, Staff Engineer at Datadog:
“When you scale, and you have a large number of services and endpoints, iptables becomes challenging. We can use iptables for load balancing, but it was not designed for it.”Laurent Bernaille, Staff Engineer, Datadog
As an alternative, they tried IPVS, which was more powerful but brought another set of challenges, especially given IPVS was not designed for client-side load balancing. Connection tracking had to be done twice, once for IPVS and once for netfilter, and it lacked feature parity with iptables.
The pain of using these solutions grew even more intense because they were difficult to debug, and it could take a lot of time between finding issues and getting the solution into production. As a result, the team asked itself: “What if we could dynamically program these features?”
Solution: Cilium and eBPF powering the Kubernetes networking data plane
Datadog went to KubeCon + CloudNativeCon and spoke with other teams in the hallway track to find out what solutions they were considering to solve their scaling problems, and many of them were considering Cilium. When they spoke with their development team, all the features that Datadog found interesting were already in the works or almost done. They started using some of the features in early beta, including putting v1.6 into production, and gave feedback to the team.
“Of course we saw some small issues and edge cases, but our interaction with upstream has always been very good and we have engineers who hadn’t contributed before that were quickly able to become contributors.”Laurent Bernaille, Staff Engineer, Datadog
As part of the new architecture, Cilium was used to replace kube-proxy, addressing the growing pains with iptables and IPVS. Leveraging eBPF’s efficiency also allowed Datadog to enforce network policies, which increased their security and was required for a certification and in regulated environments. Cilium has now become the default CNI for Kubernetes at Datadog and runs on almost every node of their SaaS offering. This allows them to abstract cloud providers and have a consistent data plane network.
“As the universal default Kubernetes CNI, Cilium can be used on multiple clouds, allowing network policies to be enforced across clouds where they are needed.”Laurent Bernaille, Staff Engineer at Datadog
Beyond the SaaS offering, Datadog also has an integration with Cilium where their customers can send Cilium’s metrics and logs into Datadog. The team is also always looking for ways to improve both their infrastructure and Cilium. They are currently working on a few improvements around supporting multiple CIDR ranges in Cilium and are testing Bandwidth Manager to guarantee better throughput.
“eBPF and Cilium helped us to push the boundaries both within operations and also with product development. To do things safer, faster and more easily than what we could have with traditional techniques.”Laurent Bernaille, Staff Engineer, Datadog
Datadog has started using Envoy for more use cases and is very interested in the sidecarless approach developed in the context of Cilium Service Mesh. They are also looking outside the cluster and considering using eBPF to create a smarter and more efficient network edge in the future. Using Cilium L4LB XDP to create their own load balancers rather than relying upon a cloud provider would allow them to provide a consistent experience to end users.
“The overlay features in a single product, compatibility with multiple cloud providers, and ability to just run it. These three things are what made Cilium an obvious choice for us.”Laurent Bernaille, Staff Engineer, Datadog
Hear more about Datadog’s journey in their videos from eBPF Summit and KubeCon + CloudNativeCon Europe 2022.