Guest post originally published on the Clockwork blog

TL;DR:

In the previous blog of this series, we discussed how cloud providers’ VM placement algorithms end up creating multiple VMs on the same physical machine, and how the resulting VM colocation can impair network bandwidth.

In this second blog post, we investigate the impact of VM colocation on additional key network performance metrics, namely network latency and network packet drops.  

Network Latency

Using Clockwork’s highly accurate clock synchronization, we can identify packets that do not incur any queueing delay on their path from the sender VM to the receiver VM, and measure the intrinsic networking latency. We use the raw measurements to compute two-way delays, defined as the sum of the one-way delays from source to destination and from destination back to source, while ensuring that neither of the two terms is affected by queueing delays. This measurement is different from roundtrip time (ping time), which additionally includes turnaround processing time at the destination VM. 

GIF showing receiver and sender pattern

Since colocated VMs can communicate without using physical network links, it is reasonable to assume that their network latency is lower than in the non-colocated case. Let’s see if this assumption holds in reality. To this end, we measure the minimum two-way delay for small UDP packets while the network is at 50% load. 

Amazon Web Services (EKS) 

Our data demonstrates that in Amazon Web Services, the latency does not significantly differ between the colocated case and the non-colocated case.

Histogram of two-way delay (EKS/eu-west-2/m4.xlarge/50% load) between colocated and not colocated
Drawing data from 188 EKS clusters of 50 m4.xlarge instances in eu-west-2 clearly shows that two-way delays have the same distribution regardless of colocation.
Histogram of two-way delay (EKS/eu-east-1/m4.xlarge/50% load) between colocated and not colocated
The same observation holds across 178 EKS clusters in us-east-1. Interestingly, there are two distinct modes near 70μs and 210μs, which may be explained by two different generations of networking software or hardware being used in different parts of the region.

This conclusion holds for all other regions and instance types in EKS that we investigated. The distribution of two-way delays is not affected by colocation. The AWS virtual network implementation hides any potential latency benefit of colocation and yields the same latency regardless of colocation.

Microsoft Azure (AKS) 

It turns out that in Azure, contrary to expectations, the communication latency between colocated VMs is actually larger than between non-colocated VMs. Consider the following histogram:

Histogram of two-way delay (AKS/southeastasia/Standard_D4s_v3/50% load) between colocated and not colocated
Across 70 clusters in the southeastasia region on Microsoft Azure (AKS), each with 50 Standard_D4s_v3 VMs, two-way delays are consistently larger if the source and destination VMs are colocated. 

We observe the same counterintuitive behavior in all regions and all instance types that we investigated. The explanation lies in Azure’s accelerated networking, which is turned on by default for AKS. Packets destined for a VM on the same physical host take longer because they do not benefit from hardware acceleration. Instead, they are handled as exception packets in the software-based programmable virtual switch (VFP) on the physical host [Link]. 

Google Cloud Platform (GKE)

In Google Cloud Platform, the latency impact of colocation depends on a combination of region and VM type.

In some cases, links between colocated VMs have higher latency than for non-colocated VMs. This pattern looks very similar to the latency behavior in Microsoft Azure, and may be caused by a similar hardware acceleration that does not apply to colocated VMs.

Histogram of two-way delay (GKE/europe-west2-b/n1-standard-4/50% load) between colocated and not colocated
For n1-standard-4 instances in GCP’s europe-west2-b region, links between colocated VMs often have higher latency than links between non-colocated VMs.
Histogram of two-way delay (GKE/asia-southeast1-b/n2-standard-4/50% load) between colocated and not colocated
For n2-standard-4 instances in GCP’s asia-southeast1-b region, links between colocated VMs often have slightly lower latencies than links between non-colocated VMs.
Histogram of two-way delay (GKE/us-east4-c/n2-standard-4/50% load) between colocated and not colocated
For n2-standard-4 instances in GCP’s us-east4-c region, link latency is not affected by colocation.

Network Packet Drops

During a Clockwork Latency Sensei audit, we measure the fraction of packets that are dropped under idle conditions and under moderate (50%) network load. Under idle conditions, typically very few packets are lost. Under moderate load, intermittent congestion causes packet drops up to a rate of several hundreds packets per million (ppm), thus degrading network performance due to retransmissions and shrinking transmission windows.

Does colocation affect the rate of packet drops? For colocated VMs, packets do not pass through physical network hops, thus encountering fewer potential drop occasions. If drops occur mostly within the network (rather than at the edge of the network), then links between colocated VMs should exhibit a lower packet drop rate.

Across thousands of 50-node cluster instances, we observe that colocation does not have a significant effect on packet drop rate, as shown in the table below.

Packet drop rate

Links between non-colocated VMsLinks between colocated VMs
AKS68 ppm60 ppm
EKS220 ppm213 ppm
GKE60 ppm62 ppm

This indicates that practically all packet drops are caused by overflowing queues that are traversed by links between colocated and non-colocated VMs alike. In other words, most packet drops occur in the virtualization hypervisor, not within the network proper.

VM colocation is bad for business

The analysis in this article and the previous blog post clearly shows that VM colocation has a net negative effect on cloud networking performance: Colocated VMs achieve lower bandwidth, incur the same or higher latency (except in certain special cases in Google Cloud), and do not provide a benefit of lower packet drop rate.

For optimal cloud system performance, colocation should be avoided. 

Clockwork Latency Sensei provides visibility into VM colocation. The Latency Sensei audit report indicates which VMs are colocated, and quantifies the performance impact of colocation. Thankfully, in the cloud, once a colocation problem is identified, it is easy to shut down the underperforming VMs and replace them with new, hopefully non-colocated VMs.