Kuaishou login screen
Case Study

Kuaishou Technology

Stable and efficient image distribution for 1 billion monthly users with Dragonfly

Challenge

Kuaishou Container Platform aims to provide hyperscale container-based infrastructure services for its growing, diverse and constantly changing business. To achieve this, Kuaishou engineers have tackled challenges including elasticity, stability, efficiency, and serverless architecture. Among those challenges, image distribution stability and efficiency is one of the most intractable problems.

Solution

In order to make image distribution more stable and efficient in Kuaishou Container Platform, engineers performed multiple experiments and analyses, and discovered that Dragonfly, along with its sub-project Nydus, was the most suitable solution. The new system has proved to be compatible with existing capabilities and also brought performance improvement for service delivery. 

Impact

Once Dragonfly was online, Harbor bandwidth pressure achieved an apparent ease of more than 70% on average and 80% at peak. The image distribution system became more stable and reliable. Harbor was then able to support a greater number of images pull requests concurrently, especially in scenarios of daemonset deployment and critical, quantity instance business service updates.

Industry:
Location:
Published:
January 10, 2023

Projects used

containerd
Dragonfly
Harbor

By the numbers

80% reduction

On peak pressure

90% Decrease

On image pull cost

50% saving

Pod instance pipeline cost

Stability and performance for 1 billion monthly users

“At Kuaishou, Dragonfly effectively solves the problem of large-scale file distribution

HONGBIN WU – KUAISHOU SENIOR STAFF SOFTWARE ENGINEER

Kuaishou is China’s first short video platform. Launched in 2011, it serves 1 billion monthly users globally, of which 180 million are outside of China. Its global footprint has swiftly expanded to Latin America, the Middle East and Southeast Asia. At Kuaishou, any user can chronicle and share their life experiences through short videos and live streams to showcase their talents. Working closely with content creators and businesses, the company is mainly engaged in the operation of content communities and social platforms, providing live streaming services, online marketing services, e-commerce, entertainment, online knowledge-sharing and other value-added services. As Kuaishou’s business keeps growing rapidly, tens of thousands of critical services, and also middleware, runs on Kuaishou Container Platform, and the stability and efficiency of the image distribute system became more and more important.

For Kuaishou’s image distribution system, the biggest challenge is not only the peak pressure easing of registry, or image pulling acceleration, but how to make distribution service seamless. Through research, engineers from Kuaishou Container Platform found that Dragonfly and Nydus supported the containerized platform and provided fast, stable, secure, and easy access to container images in a compatible and friendly way.

“Nydus is deeply integrated with Dragonfly P2P system. This is the main reason why we choose Dragonfly and Nydus. And the only thing we have to do is switch container runtime from Docker to containerd, since containerd has better integration experience with Dragonlfly. Thanks to Kuaishou engineers’ effort, both containerd and Dragonfly have been adopted at large scale.”

Hui Zhou – Kuaishou Senior Software Engineer

Stable and efficient image distribution

For stable and efficient image distribution, Dragonfly provided the answer. At Kuaishou, there are many important services that scale to hundreds of thousands of instances, in just a few minutes, on sales days such as Kuaishou’s 818 Shopping Festival, or the Double 11 in China. This kind of scaling requires thousands of GB bandwidth to download directly from registry mirror. In some other scenarios, prediction model and search businesses demand regularly updating model parameter files and index files – ensuring recommendation effect and retrieval effect, which technically means hundreds of GB of files must be distributed to every related instance in no time.

Engineers deployed Dragonfly components dfdaemon and dfget in all hosts to pull files with the p2p algorithm. Meanwhile, they deployed independent supernode clusters in each AZ and designed a schedule server for dfget, choosing suitable supernodes to avoid cross-az or cross-region traffic. What’s more, engineers realized data stream p2p transmission based on Dragonfly’s unique piece management p2p algorithm reduces disk load.

Dragonfly dfdaemon and dfget architecture map

“Thanks to Dragonfly, tens of thousands of instances can pull image or download file at the same time without increasing time cost and disk load.”

Lin Yang – Kuaishou Senior Engineer

“Advanced technology constitute a primary productive force. Embraced with Dragonfly and Nydus, application’s delivery efficiency accelerates greatly which results in substantial business innovation.”

Yin Sun – Director of Kuaishou Container Platform

Since pulling image is one of the time-consuming steps in the container lifecycle, to further accelerate image distribution and service start, engineers put Dragonfly to work. Kuaishou has many services with thousands of Pods – some of them have images over 20G or more in size. When these services are upgrading or scaling, the huge images and startup time can seriously slow down the service startup. Kuaishou needed a solution that could dramatically improve the startup speed of the service, especially since some services put their training models into images, which can be disastrous for service startup.

Engineers learned about the Nydus project early on, because of the Dragonfly implementation at Kuaishou. Nydus is a powerful open source filesystem solution to form a highly-efficient image distribution system for Cloud Native workloads, such as container images.

For each pod, it can be started in seconds. This saves startup time for service deployment and allows the application to serve users as soon as possible. Typically, a service that took hours to deploy before, now only takes only a few minutes or even seconds. This is due to the new mirror design of Nydus – for each cluster node, supporting Nydus is not complicated, and in Kuaishou’s experience, it can take only a few minutes.

“Ours thanks the creators of Nydus! Kuaishou will participant in the Dragonfly community actively and contribute what we enhanced back to Dragonfly and Nydus.”

Hui Zhou – Kuaishou Senior Software Engineer

In practice, Harbor still plays an important role as Kuaishou’s global image registry:

  1. Support building image with Nydus standard at the building stage.
  2. Support image distribution with Dragonfly’s p2p technology among cluster nodes.
  3. Make containerd pull image through p2p proxy and start container with Nydus image. Of course, all changes are compatible with the current OCI image format.
Harbor architecture map

In summary, Dragonfly, along with Nydus, provides Kuaishou Container Platform the best solution to handle image distribution problems. Tens of thousands of Kuaishou’s services significantly reduced their deploying time and engineers have a far easier time when updating services.

“Both Dragonfly and Nydus are great open source projects from CNCF. Further to say, we will keep investing on both projects and collaborate more with the community to make them more powerful and sustainable. Cloud native technologies are a revolution for the infrastructure area, especially for elasticity and serverless, and we believe that Dragonfly will definitely play an important role in cloud native ecosystem.”

Hui Zhou – Kuaishou Senior Software Engineer