KubeCon + CloudNativeCon North America Virtual | November 17-20, 2020 | Don’t Miss Out | Learn more



Efficient Model Training in the Cloud with Kubernetes, TensorFlow, and Alluxio

By | Blog

Member Post

Guest post originally published on the Alluxio Engineering Blog by Rong Gu, associate researcher at Nanjing University, and Yang Che, is a senior technical expert at Alibaba Cloud, feat. the Alibaba Cloud Container Service Team Case Study

This article presents the collaboration of Alibaba, Alluxio, and Nanjing University in tackling the problem of Deep Learning model training in the cloud. Various performance bottlenecks are analyzed with detailed optimizations of each component in the architecture. Our goal was to reduce the cost and complexity of data access for Deep Learning training in a hybrid environment, which resulted in over 40% reduction in training time and cost.

1. New trends in AI: Kubernetes-Based Deep Learning in the Cloud


Artificial neural networks are trained with increasingly massive amounts of data, driving innovative solutions to improve data processing. Distributed Deep Learning (DL) model training can take advantage of multiple technologies, such as:

  • Cloud computing for elastic and scalable infrastructure
  • Docker for isolation and agile iteration via containers and Kubernetes for orchestrating the deployment of containers
  • Accelerated computing hardware, such as GPUs

The merger of these technologies as a combined solution is emerging as the industry trend for DL training.

Data access challenges of the conventional solutions

Data is often stored in private data centers rather than in the cloud for various reasons, such as security compliance, data sovereignty, or legacy infrastructure. A conventional solution for DL model training in the cloud typically involves synchronizing data between the private data storage and its cloud counterpart. As the size of data grows, the associated costs for maintaining this synchronization becomes overwhelming:

  • Copying large datasets: Datasets continually grow, so eventually it becomes infeasible to copy the entire dataset into the cloud, even if copying to a distributed high-performance storage such as GlusterFS.
  • Transfer costs: New or updated data needs to be continuously sent to the cloud to be processed. Not only does this incur transfer costs, but it is also costly to maintain because it is often a manually scripted process.
  • Cloud storage costs: Keeping a copy of data in the cloud incurs the storage costs of the duplicated data.

A hybrid solution that connects private data centers to cloud platforms is needed to mitigate these unnecessary costs. Because the solution separates the storage and compute architectures, it introduces performance issues since remote data access will be limited by network bandwidth. In order for DL models to be efficiently trained in a hybrid architecture, the speed bump of inefficient data access must be addressed.

2. Container and data orchestration based architecture

A model training architecture was designed and implemented based on container and data orchestration technologies as shown below:

Core components of system architecture

  1. Kubernetes is a popular container orchestration platform, which provides the flexibility to use different machine learning frameworks through containers and the agility to scale as needed. Alibaba Cloud Kubernetes (ACK) is a Kubernetes service provided by Alibaba Cloud.
  2. Kubeflow is an open-source Kubernetes-based cloud-native AI platform used to develop, orchestrate, deploy, and run scalable, portable machine learning workloads. Kubeflow supports two TensorFlow frameworks for distributed training, namely the parameter server mode and AllReduce mode.
  3. Alluxio is an open-source data orchestration system for hybrid cloud environments. By adding a layer of data abstraction between the storage system and the compute framework, it provides a unified mounting namespace, hierarchical cache, and multiple data access interfaces. It is able to support efficient data access for large-scale data in various complex environments, including private, public, and hybrid cloud clusters.

Recently introduced features in Alluxio have further cemented its utility for machine learning frameworks. A POSIX filesystem interface based on FUSE provides an efficient data access method for existing AI training models. Helm charts, jointly developed by the Alluxio and Alibaba Cloud Container Service teams, greatly simplify the deployment of Alluxio in the Kubernetes ecosystem.

3. Training in the Cloud

Initial Performance

In the performance evaluation using the ResNet-50 model, we found that upgrading from Nvidia P100 to Nvidia V100 resulted in a 3x improvement in the training speed of a single card. This computing performance improvement puts additional pressure on data storage access, which also poses new performance challenges for Alluxio’s I/O.

This bottleneck can be quantified by comparing the performance of Alluxio with a synthetic data run of the same training computation. The synthetic run represents the theoretical upper limit of the training performance with no I/O overhead since the data utilized by the training program is self-generated. The following figure measures the image processing rate of the two systems against the number of GPUs utilized.

Initially, both systems perform similarly but as the number of GPUs increases, Alluxio noticeably lags behind. At 8 GPUs, Alluxio is processing at 30% of the synthetic data rate.

Analysis and Performance Optimization

To investigate what factors are affecting performance, we analyzed Alluxio’s technology stack as shown below, and identified several major areas of performance issues. For a complete list of optimizations we applied, please refer to the full-length whitepaper.

Filesystem RPC overhead

Alluxio file operations require multiple RPC interactions to fetch the requested data. As a virtual filesystem, its master processes manage filesystem metadata, keeps track of where data blocks are located, and fetch data from underlying filesystems.


We reduced this overhead by caching metadata and not allowing the cache to expire throughout the duration of the workload by setting alluxio.user.metadata.cache.enabled to true. The cache size and expiration time, determined by alluxio.user.metadata.cache.max.size and alluxio.user.metadata.cache.expiration.time respectively should be set to keep the cached information relevant throughout the entire workload. In addition we extend the heartbeat interval to reduce the frequency of worker updates by setting the property alluxio.user.worker.list.refresh.interval to be 2 minutes or longer.

Data caching and eviction strategies

As Alluxio reads data from the underlying storage system, the worker will cache data in memory. As the allocated memory fills up, it will choose to evict the least recently accessed data. Both of these operations are asynchronous and can cause a noticeable overhead in data access speeds, especially as the amount of data cached on a node nears saturation.

We configured Alluxio to avoid evicting data and disabled features that would cache duplicate copies of the same data. By default, Alluxio caches data read from the underlying storage system, which is beneficial for access patterns where the same data is requested repeatedly. However, DL training typically reads the entire dataset, which can easily exceed Alluxio’s cache capacity as datasets are generally on the order of terabytes. We can tune how Alluxio caches data by setting alluxio.user.ufs.block.read.location.policy to alluxio.client.block.policy.LocalFirstAvoidEvictionPolicy, which will avoid evicting data to reduce the load on each individual worker, and setting alluxio.user.file.passive.cache.enabled to false to avoid caching additional copies.

Alluxio and FUSE configuration for handling numerous concurrent read requests

Reading files through the FUSE interface is fairly inefficient using its default configuration. FUSE reads are handled by the libfuse non-blocking thread pool that does not necessarily recycle its threads. The thread pool is configured with max_idle_threads hardcoded to a value of 10. When this value is exceeded, threads will be deleted instead of recycled.


Because libfuse2 does not expose the value of max_idle_threads as a configuration value, we patched its code to be able to configure a much larger value to better support more concurrent requests.

Impact of running Alluxio in containers on its thread pool

Running Java 8 in a containerized environment may not fully utilize a machine’s CPU resources due to how the size of thread pools is calculated.


Before patch 191 for Java 8, thread pools would be initialized with a size of 1, which severely restricts concurrent operations. We can set the number of processors directly by setting the Java flag -XX:ActiveProcessorCount.


After optimizing Alluxio, the single-machine eight-card training performance of ResNet50 is improved by 236.1%. The performance gains are reflected in the four-machine eight-card scenario with a performance loss of 3.29% when compared against the synthetic data measurement. Compared to saving data to an SSD cloud disk in a four-machine eight-card scenario, Alluxio’s performance is better by 70.1%.

The total training time of the workload takes 65 minutes when using Alluxio on four machines with eight cards each, which is very close to the synthetic data scenario that takes 63 minutes.

Compared with training via SSD on the cloud, Alluxio saves 45 minutes in time and 40.9% in costs.

4. Summary and future work

In this article, we present the challenges of using Alluxio in a high-performance distributed deep learning model training and dive into our work in optimizing Alluxio. Various optimizations were made to improve the experience of AlluxioFUSE performance in high concurrency reading workload. These improvements enabled us to achieve a near-optimal performance when executing a distributed model training scheme in a four-machine eight-card scenario with ResNet50.

For future work, Alluxio is working on enabling page cache support and FUSE layer stability. Alibaba Cloud Container Service team is collaborating with both the Alluxio Open Source Community and Nanjing University through Dr. Haipeng Dai and Dr. Rong Gu. We believe that through the joint effort of both academia and the open-source community, we can gradually reduce the cost and complexity of data access for Deep Learning training when compute and storage are separated to further help advance AI model training on the cloud.

5. Special Thanks

Special thanks to Dr. Bin Fan, Lu Qiu, Calvin Jia, and Cheng Chang from the Alluxio team for their substantial contribution to the design and optimization of the entire system. Their work significantly improved the metadata cache system and advanced Alluxio’s potential in AI and Deep Learning.


Yang Che, Sr. Technical Expert at Alibaba Cloud. Yang is heavily involved in the development of Kubernetes and container related products, specifically in building machine learning platforms with cloud-native technology. He is the main author and core maintainer of GPU shared scheduling.

Rong Gu, Associate Researcher at Nanjing University and core maintainer of the Alluxio Open Source project. Rong received a Ph.D. degree from Nanjing University in 2016 with a focus on big data processing. Prior to that, Rong worked on R&D of big data systems at Microsoft Asia Research Institute, Intel, and Baidu.


Jenkins and Kubernetes: The Perfect Pair

By | Blog

Member Post

Guest post originally published on the Rookout blog by Liran Haimovitch, co-founder and CTO of Rookout

As the world is adapting to new and unforeseen circumstances, many of the traditional ways of doing things are no longer. One significant effect of this is that the world has gone almost completely virtual. Whether it’s Zoom happy hours and family catch ups or virtual conferences, what used to be in-person has digitized. Before the world seemingly turned upside down a few months ago, I was meant to speak at a conference about our experience at Rookout running Jenkins on Kubernetes these past few years. Yet, alas, it was not meant to be. So, I figured I could impart whatever I have learned thus far, here (digitally! ;)), with you all. The world is going virtual, so what better way to connect over learned experiences, right?

Jenkins and Kubernetes

Why would you go about running Jenkins on top of Kubernetes?

The TL;DR of why we chose Jenkins is that we needed a high degree of control over build-processes and the code-reusability enabled by Jenkins Pipelines (in the time since we have made that choice, CircleCI and GitHub Actions have made great progress in meeting some of our requirements). You can find the full details of that specific journey in this blog post, but let’s focus on this one.

Running Jenkins on top of Kubernetes takes away most of the maintenance work, especially around managing the Jenkins Agents. The Jenkins Kubernetes Plugin is quite mature, and using it to spin up agents on demand reduces the maintenance costs of the agents themselves to virtually nothing.

The Ugly Parts

While we greatly enjoy the day-to-day benefits of this setup, such as fast build times, highly customizable CI/CD processes, and little to no maintenance, getting it up and running was far from a trivial task.

Along the way, we ran into various limitations of both Jenkins and Kubernetes, looked ‘under the hood’ and discovered little known nuggets of knowledge. By sharing them here, with you, I hope your own deployment experience will go much smoother.

The Deployment Process

The easiest way to get Jenkins deployed on Kubernetes Cluster (we are using a dedicated cluster for Jenkins, but that’s not necessary) is to build your own Helm chart (if you are not familiar with Helm, check it out) relying on existing helm charts as dependencies and adding any additional resources you might need.

The first chart dependency you’ll add is, quite obviously, Jenkins itself, and we chose this helm chart from the stable helm repository. The most important configuration options to define are:

  1. Make sure to pass a Persistent Volume Claim as ExistingClaim to the Jenkins Persistence configuration.
  2. Figure out the amount of memory your Jenkins Master requires based on the amount of jobs you are running and set the JVM arguments -Xmx, -Xms, and -XX:MaxPermSize in the hidden master.javaOpts argument (we use 8192m, 8192m, and 2048m respectively).

The most challenging part of running Jenkins on Kubernetes is setting up the environment for building container images. To do so, follow these three simple steps:

  1. Add a deployment and a service running the official Docker image for building containers docker:dind to your Helm chart.
  2. Mount a persistent volume to /var/lib/docker to make sure your layers are cached persistently for awesome build performance.
  3. Configure pod templates to use the remote docker engine by adding the DOCKER_HOST environment variable to point to the relevant pod  (i.e. tcp://dind-service:2375).

Operational Considerations

The next step on your journey is to enable your team to access Jenkins while, at the same time, avoiding exposing it to the world. Jenkins has a multitude of plugins and configuration options and struggling to keep everything up to date and secure is nearly impossible.

We chose to handle that by having our ingress controller, HAProxy performs the OAuth2 authentication before passing any incoming requests to Jenkins. Follow this guide to configure the HAProxy OAuth2 plugin to use the OAuth2 Proxy Container. If you configure Jenkins to use the same OAuth2 identity provider (for instance using this plugin for Google Authentication), your team will only have login once. Alternatively, you can always get a commercial, off the shelf solution such as Odo.

Once you have everything set up, you’ll want to make sure your Jenkins Master is being backed up regularly. The easiest way to achieve this is to use this neat little script.

Resources and Scaling

As I previously mentioned, we found that one of the biggest benefits of this approach is the ability to easily scale your resources on the fly. We use two separate Node Pools on our cluster, one for long-running pods such as the Jenkins Master, Ingress, Docker-In-Docker, and a second node pool for the Jenkins Agents and the workloads they are running.

For our master itself, we chose a single-master deployment for our Jenkins. This is running on a single node with 16 CPUs and 64GBs of RAM. This means that master upgrades and other unexpected events can lead to short downtimes. If you need a multi-master deployment, you are on your own 🙂

The second node pool is running the Jenkins Agents and their workloads and has auto-scaling enabled. To allow Kubernetes to smartly manage resources for that node pool,  you have to make sure that you properly define the Kubernetes Resource Requests and Limits.

This has to be done in two separate configurations:

  1. Set the Jenkins Agents resources in the helm chart under agent.resources.
  2. Set the resources for the workloads themselves as part of the Pod Templates in the Jenkins Kubernetes Plugin.

Keep in mind that the second node pool is actually a great opportunity for cost savings and is the perfect candidate for Spot Instances, either directly or by leveraging Spot. As an additional benefit, when running on GKE we found that nodes’ performance deteriorated over time, probably due to the intensive scheduling of pods. When using Google’s Preemptible VMs that are automatically replaced every 24 hours (or less), we noticed significant improvements to the cluster reliability and performance.

It all boils down to…

In my work with both our customers and Rookout’s R&D team, I have found that deployments are often the bottleneck that is slowing down day-to-day operations and engineering velocity. I hope that by sharing with you a few of the lessons we learned running Jenkins on Kubernetes, you’ll now be able to improve your own CI/CD processes.

Having said that, it’s still important to note that adopting the right tools will enable you to do even more. So go forth and get started, I’m looking forward to hearing how your experience went!

Introducing Cloud Native Community Groups!

By | Blog

CNCF Staff Post

As the CNCF community has grown, so too has the desire to have cloud native gatherings all over the world. CNCF Meetup Groups cover topics around cloud native computing and CNCF-hosted projects in groups across the globe. Through the passion of our community and the help of the Cloud Native Ambassador (CNA) program, there are now nearly 200 Meetups across more than 50 countries with over 150,000 members.

Late last year we also launched Kubernetes Community Days, community-organized events that gather adopters and technologists from open source and cloud native communities to learn, collaborate, and network to further the adoption and improvement of Kubernetes. 

As physical events have shifted to virtual, we have made the decision to combine these programs to create Cloud Native Community Groups (CNCG) and take advantage of a new platform to make that happen. CNCGs provide an easy way to host a community meetup or cloud native event. These events provide a great way to connect with other community members who are interested in all things cloud native, all over the world.

All CNCG events will now be hosted on a new platform, Bevy. Bevy brings a number of benefits to the community. It will allow us to host events directly under the cncf.io domain and connect the marketing tools and technologies currently used by CNCF, making it easier to promote CNCG events. It will also allow us to better serve chapter leads through richer communication tools and event analytics.

For all CNCGs, CNCF will offer the following benefits:

  • One-time complimentary swag certificate to the CNCF Store
  • Boosting the visibility of community group and events
  • Cost coverage for the hosted community platform

We can’t wait to hear what you think about the new platform!

Learn more about Cloud Native Community Groups, apply to create a local chapter, or find a group near you! If you have any questions please join us on the CNCF Slack in the #communitygroups channel.

Kubernetes Best Practices for Monitoring and Alerts

By | Blog

Member Post

Guest post originally published on the Fairwinds blog by Sarah Zelechoski, VP of engineering at Fairwinds

The truth is Kubernetes monitoring done right is a fantasy for most. It’s a problem magnified in a dynamic, ever-changing Kubernetes environment. And it is a serious problem.

While organizations commonly want availability insurance, few monitor their environments well for two main reasons: 

  • It’s hard for monitoring to keep up with changing environments.
  • Monitoring configuration is often an afterthought—it isn’t set up until a problem occurs, and monitoring updates are seldom made as workloads change. 

When the average organization finally recognizes its need for application/system monitoring, the team is too overwhelmed just trying to keep infrastructure and applications “up” to have the capacity to look out for issues. Even monitoring the right things to identify the problems the application or infrastructure is facing on a day-to-day basis is beyond the reach of many organizations.

Consequences of the Inadequate Monitoring in Kubernetes

There are a number of consequences you’ll face without adequate monitoring (some that are universal, others that are exemplified in Kubernetes).

  • Without the right monitoring, operations can be interrupted.
  • Your SRE team may be unable to respond to issues (or the right issues) as fast as needed. 
  • Monitoring management must reflect the state of clusters and workloads.
  • Manual configuration increases availability and performance risks because monitors may not be present or accurate enough to trigger changes in key performance indicators (KPIs). 
  • Undetected issues may cause SLA breaches.
  • Noisy pagers can result due to incorrect monitor settings.

Insufficient monitoring introduces a lot of heavy work because you need to constantly check systems to ensure they reflect the state that you want.

Kubernetes Best Practices for Monitoring and Alerting

What’s needed is monitoring and alerting that discovers unknown unknowns – otherwise referred to as observability. Kubernetes best practices involve recognition that monitoring is key and requires the use of the right tools to optimize your monitoring capabilities. What needs to be monitored and why? Here we suggest a few best practices.

Create your Monitoring Standards

With Kubernetes, you have to build monitoring systems and tooling to respond to the dynamic nature of the environment. Thus, you will want to focus on availability and workload performance. One typical approach is to collect all of the metrics you can and then use those metrics to try to solve any problem that occurs. It makes the operators’ jobs more complex because they need to sift through an excess of information to find the information they really need. Open source tools like Prometheus and OpenMetrics help standardize how to collect and display metrics. We suggest that Kubernetes best practices for monitoring includes: 

  • Kubernetes deployment with no replicas
  • Horizontal Pod Autoscaler (HPA) scaling issues
  • Host disk usage
  • High IO wait times
  • Increased network errors
  • Increase in pods crashed
  • Unhealthy Kubelets
  • nginx config reload failures
  • Nodes that are not ready
  • Large number of pods that are not in a Running state
  • External-DNS errors registering records

Implement Monitoring as Code

A genius of Kubernetes is that you can implement infrastructure as code (IaC) – the process of managing your IT infrastructure using config files. At Fairwinds take this a step further by implementing monitoring as code. We use Astro, an open source software project built by our team, to help achieve better productivity and cluster performance. Astro was built to work with Datadog. Astro watches objects in your cluster for defined patterns and manages monitors based on this state. As a controller that runs in a Kubernetes cluster, it subscribes to updates within the cluster. If a new Kubernetes deployment or other objects are created in a cluster, Astro knows about it and creates monitors based on that state in your cluster. 

Identify Ownership

Because a diverse set of stakeholders is involved in monitoring cluster workloads, you must determine who is responsible for what from both an infrastructure and a workload standpoint. For instance, you want to make sure the right people are alerted at the right time to limit the noise of being alerted about things that do not pertain to you.

Move Beyond Tier 1 to Tier 2 Monitoring

Monitoring tooling must be flexible enough to meet complex demands, yet easy enough to set up quickly so that we can move beyond tier 1 monitoring (e.g., Is it even working?”). Tier 2 monitoring requires dashboards that reveal where security vulnerabilities are, whether or not compliance standards are being met, and targeted ways to improve.

Define Urgent

Impact and urgency are key criteria that must be identified and assessed on an ongoing basis. Regarding impact, it is critical to be able to determine if an alert is actionable, the severity based on impact, and the number of users or business services that are or will be affected. Urgency also comes into play. For example, does the problem need to be fixed right now, in the next hour, or in the next day?

It is difficult to always know what to monitor ahead of time, so you need at least enough context to figure out what’s going wrong when someone inevitably gets woken up in the middle of the night and needs to bring everything back online. Without this level of understanding, your team cannot parse what should be monitored and know when to grin and bear turning on an alert.

Read in-depth insights into how to optimize monitoring and alerting capabilities in a Kubernetes environment.

Learn more about Kubernetes Best Practices by visiting https://www.fairwinds.com/.


Infrastructure as code: A non-boring guide for the clueless

By | Blog

Member Post

Guest Post originally published on the Cycloid DevOps Framework Blog by Niamh Lynch, content manager at Cycloid

If you’ve recently moved into the world of DevOps, or are thinking of it, you’ll soon come into contact with infrastructure as code. It’s a mainstay of good DevOps practice, but it can be hard to get your head around if you’re not familiar with infrastructures in the first place.

We recently had reason to get back to TerraCognita basics with one of our collaborators. The result of that educational exercise was actually a really good intro to infrastructure as code, so we decided to republish it here today (with some edits) to help out any of you who might need some education or refreshment in the world of infra-as-code.


We’re explaining infrastructure as code and the concepts that surround it. It’s aimed at people with little or no experience in tech infrastructures.

What happened before infra-as-code?

Before the era of cloud computing, people managed infrastructures (the technical makeup of their business, app, or website) by making changes via the command line or other interface. If you wanted to add or change something, you pointed, clicked, and made the change (aka manually modifying an infrastructure).

In recent years, lots of computing has moved away from the physical server room to “the cloud”. That’s brought a few major changes – modern infrastructures tend to be much bigger and more complex, their makeup changes more often, and we stop and start them much more frequently. That means you’d have to log in and make changes by hand hundreds of times a day, which just isn’t practical – it doesn’t scale and it takes too long.

Solving the problem – infra-as-code

To try and overcome this problem, infrastructure-as-code was created. It’s not a programming language – it’s more of a mindset or technique. In basic terms, it’s a set of instructions that tells your cloud provider how to set up and maintain your infrastructures without your involvement. It ensures that infrastructures are created and modified exactly to your specifications, each and every time.

Life without infra-as-code

Ideally, you or your company has created IAC for all of your infrastructures from day one, but that often doesn’t happen. If you’ve been around for a while, there are probably some manually created infrastructures hanging around somewhere.

This is a problem, because without infra-as-code your infrastructures aren’t reproducible. If they’re not reproducible, you can’t scale or grow, because it’s just not practical to write instructions for thousands of servers individually.

Another problem – without infra-as-code, there’s no written statement of what your infrastructure looks like. You know how you first set it up (hopefully, unless the person who set it up forgets, or moves on), but you don’t know how it looks now or how it has changed.

This reduces visibility (a key component of DevOps) and makes it really hard to apply policies (because the things you’re writing the rules for might have changed). Policies are sets of rules designed to apply to servers at scale. They make your infrastructure more secure, which is also a major part of DevOps and development in general.

Life with infra as code

Although infra-as-code is by far the preferable approach, it’s not completely without issue. Writing infra-as-code manually is time-consuming and – although valuable – not as valuable as other jobs devs and ops might be doing. It also runs a high risk of error (and by nature, it reproduces, so if there’s a mistake in your code, the mistake will scale too!). Maintaining up-to-date infra-as-code and/or migrating to the latest version manually can also be awkward and time-consuming.

You’ll also have to take care that everyone in-house is comfortable and au fait with IaC. If it’s something you’ve introduced, but other employees aren’t on board, they can make changes without using it, and break everything as a result.

Basically, infra-as-code is best practice and one of those tasks that developers know they should do, but for various reasons, haven’t. Sometimes it’s managers who are unconvinced of the importance, or sometimes life just seems too busy.

So where does Terraform come into it?

Once you start to find out more about infrastructure as code, you’ll start hearing about Terraform. Terraform is a type of infrastructure-as-code created by a company called Hashicorp. It’s not the only one – all the major cloud providers have their own IaC. Google offers Google Cloud Deployment Manager, AWS makes CloudFormation, and Microsoft’s Azure makes Azure Resource Manager.

Terraform gets all the love, however, because it’s open-source and platform agnostic, which means that once you learn it, you can use it in conjunction with any of the cloud providers or even write a custom plugin, to use it anywhere you like. If you learn one of the other proprietary tools, however, it only works on that platform and can’t be used on the others.

Summing up

So, there you have it. Infrastructure as code is a descriptive technique that allows developers and ops people to create one set of instructions that will be automatically set up and maintained by the cloud provider they relate to. Infra-as-code is pretty essential if you’re dealing with a sprawling, complex modern infrastructure – it’s too time-consuming to interact with the cloud provider manually. Infra-as-code is often written in Terraform, which is an open-source, platform-agnostic IaC tool, but there are others. Infra-as-code is a pillar of DevOps best practice and an all-around good idea, and now when the subject comes up in work, you can be sure you know exactly what you’re talking about!

See more about TerraCognita.


OpenTelemetry Best Practices (Overview Part 2/2)

By | Blog

Member Post

Guest post originally published on the Espagon blog by Ran Ribenzaft, co-founder and CEO at Epsagon

In the second article of our OpenTelemetry series, we’ll focus on best practices for using OpenTelemetry, after covering the OpenTelemetry ecosystem and its components in part 1.

As a reminder, OpenTelemetry is an exciting new observability ecosystem with a number of leading monitoring companies behind it. It is a provider-agnostic observability solution supported by the CNCF and represents the third evolution of open observability after OpenCensus and OpenTracing. OpenTelemetry is a brand new ecosystem and is still in the early stages of development. Because of this, there are not yet many widespread best practices. This article outlines some of the best practices that are currently available and what to consider when using OpenTelemetry. Additionally, it explores the current state of OpenTelemetry and explains the best way to get in touch with the OpenTelemetry community.

OpenTelemetry Best Practices

Just because it’s new doesn’t mean there aren’t some good guidelines to keep in mind when implementing OpenTelemetry. Here below are our top picks.

Keep Initialization Separate from Instrumentation

One of the biggest benefits of OpenTelemetry is that it enables vendor-agnostic instrumentation through its API. This means that all telemetry calls made inside of an application come through the OpenTelemetry API, which is independent of any vendor being used. As you see in Figure 1, you can use the same code for any supported OpenTelemetry provider:

Separation between the vendor provider and instrumentation. (Source: GitHub)

This example shows how exporting is decoupled from instrumentation, so all that’s required of the instrumentation is a call to getTracer. The instrumentation code doesn’t have any knowledge of the providers (exporters) that were registered, only that it can get and use a tracer:

const tracer = opentelemetry.trace.getTracer('example-basic-tracer-node');

An additional best practice for OpenTelemetry here is to keep the provider configuration at the top level of your application or service–usually in the application’s entry point. This ensures that OpenTelemetry instrumentation is separate from the instrumentation calls and allows you to choose the best tracing framework for your use case without having to change any instrumentation code. Separating the provider configuration from the instrumentation enables you to switch a provider simply with a flag or environmental variable.

A CI environment running integration tests may not want to provision jaeger or zipkin or another tracing/metrics provider. This is often meant to reduce costs or complexity by removing moving parts. For example, in local development, the tracer or metrics could be using an in-memory exporter, but production may be using a hosted SaaS. Keeping the initialization of a provider decoupled from instrumentation makes it easy to switch providers depending on your environment; and this in turn allows you to easily switch implementations in case of a vendor switch.

Know the Configuration Knobs

OpenTelemetry tracing supports two strategies to get traces out of an application, a “SimpleSpanProcessor” and a “BatchSpanProcessor.” The SimpleSpanProcessor will submit a span every time a span is finished, but the BatchSpanProcessor buffers spans until a flush event occurs. Flush events can occur when a buffer is filled or when a timeout is reached.

The BatchSpanProcessor has a number of properties:

  • Max Queue Size is the maximum number of spans buffered in memory. Any span beyond this will be discarded.
  • Schedule Delay is the time between flushes. This ensures that you don’t get into a flush loop during times of heavy traffic.
  • Max per batch is the maximum number of spans that will be submitted during each flush.

BatchSpanProcessor configuration options. (Source: GitHub)

When the queue is full, the frameworks begin to drop new spans (load shedding), meaning that data loss can occur if these aren’t configured correctly. Without hard limits, the queue could grow indefinitely and affect the application’s performance and memory usage. This is especially important for “on-line” request/reply services but is also necessary for asynchronous services.

Look For Examples… and Sometimes Tests

OpenTelemetry is still a very young project. This means the documentation for most of the libraries is still sparse. To illustrate this, for the BatchSpanProcessor discussed above, configuration options are not even documented in the Go OpenTelemetry SDK! The only way to find examples is by searching the code:

BatchSpanProcessor configuration usage. (Source: GitHub)

Since the project is still in active development and the focus is on features and implementation, this makes sense. So, if you need answers, go to the source code!

Use Auto-Instrumentation by Default… but Be Careful of Performance Overhead

OpenTelemetry supports auto-tracing for Java, Ruby, Python, JavaScript, and PHP. Auto-tracing will automatically capture traces and metrics for built-in libraries such as:

  • HTTP Clients
  • HTTP Servers & Frameworks
  • Database Clients (Redis, MySQL, Postgres, etc.)

Auto-instrumentation significantly decreases the barrier to adopting observability, but you need to monitor it closely because it adds additional overhead to program execution.

During a normal execution, the program calls out directly to the HTTP client or database driver. Auto-instrumentation wraps these functions with additional functionality that costs time and resources in terms of memory and CPU. Because of this, it’s important to benchmark your application with auto-tracing enabled vs. auto-tracing disabled to verify that the performance is acceptable.

Unit Test Tracing Using Memory Span Exporters

Most of the time, unit testing focuses on program logic and ignores telemetry. But occasionally, it is useful to verify the correct metadata is present, including tags, metric counts, and trace metadata. A mock or stub implementation that records calls is necessary to achieve this.

Most languages don’t document usage of these constructs, so remember that the best place to find examples of usage is in the actual OpenTelemetry unit tests for each project!

Initializing meter in a unit test. (Source: GitHub)

Using a metric meter in a unit test. (Source: GitHub)

Tests are able to configure their own test exporter (“meter” in Figure 4 above). Remember that OpenTelemetry separates instrumentation from exporting, which allows the production code to use a separate exporter from testing. The example in Figure 5 shows a test configuring its own “meter,” making calls to it, and then making assertions on the resulting metric. This allows you to test that your code is setting metrics and metric tags correctly.

Customization & Beta Versions

OpenTelemtry officially reached its beta phase on March 30, 2020! Since the project is still young, each language has different levels of support, documentation, examples, and exporters available. So, make sure to check before assuming support for a specific provider.

The best place to find information is in the language-specific repo in GitHub under the OpenTelemetry organization or in the language-specific Gitter channel. The official OpenTelemetry project page is another good source of information.


OpenTelemetry has a very active community on Gitter, with a global channel available at open-telemetry/community. Each language has its own Gitter channel as well, and there are tons of opportunities to contribute, especially to documentation and exporters. Since OpenTelemetry is young, even the specification is still in active development, so this is a great time to give feedback and get involved in OpenTelemetry.


OpenTelemetry has recently reached beta and is available for use. It’s important to remember that the libraries and ecosystem here are still very young. Make sure to consider these OpenTelemetry best practices and dig for documentation in tests and examples:

  • Keep instrumentation separate from exporting.
  • Know what OpenTelemetry is doing under the hood.
  • Look for tests as documentation.

Also, if you ever run into issues, the OpenTelemetry community is very helpful and easily accessible through Gitter.

KubeCon + CloudNativeCon EU Virtual – NEW schedule is live!

By | Blog

CNCF Staff Blog

We are thrilled to announce that the new schedule for KubeCon + CloudNativeCon Europe 2020 – Virtual is live! 

We have been hard at work confirming speakers and making this an amazing virtual experience.

As our physical event has shifted to a virtual one, we have been taking thoughtful actions to create an immersive experience that provides you with interactive content and collaboration with peers. As an attendee, you will have the ability to network with other attendees, attend presentations with live Q&A, interact with sponsors real-time, and much more.

Below you can find all the exciting information about how to interact in our new digital experience. 

 #KeepCloudNativeConnected Worldwide

Live content will begin daily at 13:00 CEST with presentations over four consecutive days. Keynotes will be in the middle of each day’s schedule to make it easier for a variety of time zones to participate.




#KeepCloudNativeConnected at Any Hour of the Day

Did you miss a keynote or have two concurrent sessions and had to make a choice? The event platform will be accessible 24 hours a day to all registered attendees. That means anyone in any time zone can view what has been released, including all presentations and sponsor content. 

You can also visit sponsor booths to check out the content they are sharing or meet up with people in the community for a chat whenever it’s convenient for you! 


 Keep Cloud Native Questioning

All tutorial, maintainer track, and breakout sessions will be presented in a scheduled time slot and the presenter(s) will be free to talk to you via live Q+A! Review the schedule ahead of time and get your questions ready to participate in lively community discussions.



 Keep Cloud Native Delighted

Don’t just lurk in the session shadows – get engaged and have some fun! Opportunities will abound for interactive learning and networking, including the ability to earn points and win prizes, collaborate with peers in networking lounges, calm your mind with a mini chair yoga session, or break up the day with a live musical performance or game.


#KeepCloudNativeConnected to Our Sponsors

Diamond, platinum, and gold sponsors will be presenting demos of their products throughout the event – don’t miss the latest and greatest they have to showcase to the community!

Connect with sponsors of all levels in the showcase where you can visit their virtual booth to view videos, download resources, chat 1:1 with a company representative, and collect virtual swag



 Keep Cloud Native Learning

Along with your event registration, you will receive a 50% off voucher for the CKA/CKAD exam + training bundle as well as 30% off all other Linux Foundation trainings. More details will be provided at the start of the event to those that attend. 




We are so excited to see you online on August 17-20!

Be sure to register now for $75 to access the full experience! If you have any questions, feel free to reach out!


How JD.com Saves 60% Maintenance Time Using Harbor for Its Private Image Central Repository

By | Blog

CNCF Staff Blog

JD.com is the world’s third largest Internet company by revenue, and at its heart it considers itself “a science and technology company with retail at its core,” says Vivian Zhang, Product Manager at JD.com and CNCF Ambassador.

JD’s Infrastructure Department is responsible for developing the hyperscale containerized and Kubernetes platform that powers all facets of JD’s business, including the website, which serves more than 380 million active customers, and its e-commerce logistics infrastructure, which delivers over 90% of orders same- or next-day.

In 2016, the team was in need of a cloud native registry to provide maintenance for its private image central repository. Specifically, the platform needed a solution for:

  1. User authorization and basic authentication
  2. Access control for authorized users
  3. UI management for the JD Image Center
  4. Image vulnerability scanning for image security

The team evaluated a number of different solutions, including the native registry. But there were two main problems with the native registry: the need to implement authorization certification, and having the meta information on images traverse the file system. As such, with the native registry, “the performance could not meet our requirements,” says Zhang. 

Harbor, a cloud native registry project that just graduated within CNCF, turned out to be “the best fit for us,” says Zhang. “Harbor introduces the database, which can accelerate the acquisition of image meta information. Most importantly, it delivers better performance.”

Harbor also integrates Helm’s chart repository function to support the storage and management of the Helm chart, and provides user resource management, such as CPU, memory, and disk space — which were valuable to JD. 

“We found that Harbor is most suitable in the Kubernetes environment, and it can help us address our challenges,” says Zhang. “We have been a loyal user of Harbor since the very beginning.”

In order for it to work best at JD, “Harbor is used in concert with other products to ensure maximum efficiency and performance of our systems,” says Zhang. Those products include the company’s own open source project (and now a CNCF sandbox project), ChubaoFS, a highly available distributed file system for cloud native applications that provides scalability, resilient storage capacity, and high performance. With ChubaoFS serving as the backend storage for Harbor, multiple instances in a Harbor cluster can share container images. “Because we use ChubaoFS, we can achieve a highly available image center cluster,” says Zhang.

Indeed, Harbor has made an impact at JD. “We save approximately 60% of maintenance time for our private image central repository due to the simplicity and stability of Harbor,” says Zhang. Plus, Harbor has enabled authorization authentication and access control for images, which hadn’t been possible before.

For more about JD.com’s usage of Harbor, read the full case study.


Rust at CNCF

By | Blog

Staff Post

Rust is a systems language originally created by Mozilla to power parts of its experimental Servo browser engine. Once highly experimental and little used, Rust has become dramatically more stable and mature in recent years and is now used in a wide variety of settings, from databases to operating systems to web applications and far beyond. And developers seem to really love it.

You may be surprised to find out that the venerable Rust has established a substantial toehold here at CNCF as well. In fact, two of our incubating projects, TiKV and Linkerd, have essential components written in Rust and both projects would be profoundly different—and potentially less successful—in another language.

In this post, I’d like to shed light on how TiKV and Linkerd are contributing to the Rust ecosystem.


TiKV is a distributed, transactional key-value database originally created by the company PingCAP. Its core concepts are drawn from the venerable Google Spanner and Apache HBase and it’s primarily used to provide lower-level key/value—the “KV” in “TiKV”—storage for higher-level databases, such as TiDB.

In addition to the core repo, the TiKV project has contributed a number of libraries to the Rust ecosystem:

  • grpc-rs, a Rust wrapper for gRPC core.
  • raft-rs, a Rust implementation of the Raft consensus protocol. This is the consensus protocol used by TiKV as well as etcd, the distributed key-value store used by Kubernetes and a fellow CNCF project.
  • fail-rs, for injecting “fail points” at runtime
  • async-speed-limit, a library for asynchronously speed-limiting multiple byte streams
  • rust-prometheus, a Prometheus client for Rust that enables you to instrument your Rust services, i.e. to expose properly formatted metrics to be scraped by Prometheus.
  • pprof-rs, a CPU profiler that can be integrated into Rust programs. Enables you to create flame graphs of CPU activity and offers support for Protocol Buffers output.

PingCAP’s blog has also featured some highly regarded articles on Rust, including The Rust Compilation Model Calamity and Why did we choose Rust over Golang or C/C++ to develop TiKV? If you’re like me and excited about witnessing a new generation of databases written in Rust, you should really keep tabs on TiKV and its contributions to the Rust ecosystem.


Linkerd is a service mesh that’s relentlessly focused on simplicity and user-friendliness. If you’ve ever felt frustrated or overwhelmed by the complexity of other service mesh technologies, I cannot recommend the breath of fresh air that is the Linkerd Getting Started guide more highly. And in case you missed it, Linkerd had a huge 2019 and is continuing apace in 2020.

Arguably the most important component of Linkerd is its service proxy, which lives alongside your services in the same Kubernetes Pod and handles all network traffic to and from the service. Services proxies are hard to write because they need to be fast, they need to be safe, and they need to have the smallest memory footprint that’s commensurate with speed and safety.

The Linkerd creators opted for Rust for the Linkerd service proxy. Why did they make this choice? I reached out to Linkerd co-creator Oliver Gould to provide the breakdown:

When we started building Linkerd ~5 years ago, some of our first prototypes were actually in Rust (well before the language hit 1.0). Unfortunately, at the time, it wasn’t mature enough for our needs, so Linkerd’s first implementation grew out of Twitter’s Scala ecosystem. As we were working on Linkerd 1.x, Rust’s Tokio runtime started to take shape and was especially promising for building something like a proxy. So in early 2017 we set out to start rewriting Linkerd with a Go control plane and a Rust data plane. Tokio (with its sister projects, Tower & Hyper) made this all possible by extending Rust’s safe, correct memory model with asynchronous networking building blocks. These same building blocks are now being used in a variety of performance-sensitive use cases outside of Linkerd, and we’ve built a great community of contributors around both projects. If this is interesting to you, please come get involved!

In terms of contributions back to the Rust ecosystem, Linkerd has upstreamed core components to Tower and Tokio, such as Linkerd’s load balancer and Tokio’s tracing module.

In addition, the project also undertook a security audit of the rustls library (sponsored by CNCF). As the name suggests, rustls is a transport security layer (TLS) library for Rust that’s used by the Linkerd proxy for its mutual TLS (mTLS) feature, which is crucial to the security guarantees that the Linkerd service mesh provides. You can see the result of the audit in this PDF. Cure53, the firm responsible for security audits of several other CNCF projects, was “unable to uncover any application-breaking security flaws.” A sterling result if I say so myself!

More to come?

I’m a huge fan of Rust myself, though I’ve really only dabbled in it. I have my fingers crossed that TiKV and Linkerd are just the beginning and that we’ll see a whole lot more Rust in the cloud native universe, be that in the form of new CNCF projects written in Rust, existing projects porting components into Rust, or new Rust client libraries for existing systems.

And if you’re curious about all of the programming languages in use amongst CNCF’s many projects, stay tuned for an upcoming blog post on precisely that topic.

TOC Approves SPIFFE and SPIRE to Incubation

By | Blog

Project Post

Today, the CNCF Technical Oversight Committee (TOC) voted to accept SPIFFE and SPIRE as incubation-level hosted projects.

The SPIFFE (Secure Production Identity Framework For Everyone) specification defines a standard to authenticate software services in cloud native environments through the use of platform-agnostic, cryptographic identities. SPIRE (the SPIFFE Runtime Environment) is the code that implements the SPIFFE specification on a wide variety of platforms and enforces multi-factor attestation for the issuance of identities. In practice, this reduces the reliance on hard-coded secrets when authenticating application services. 

“The underpinning of zero trust is authenticated identity,” said Andrew Harding, SPIRE maintainer and principal software engineer at Hewlett Packard Enterprise. “SPIFFE standardizes how cryptographic, immutable identity is conveyed to a workload. SPIRE leverages SPIFFE to help organizations automatically authenticate and deliver these identities to workloads spanning cloud and on-premise environments. CNCF has long understood the transformational value of these projects to the cloud native ecosystem, and continues to serve as a great home for our growing community.”

The projects are used by and integrated with multiple cloud native technologies, including Istio, and CNCF projects Envoy, and gRPC, and Open Policy Agent (OPA). SPIFFE and SPIRE also provide the basis for cross-authentication between Kubernetes-hosted workloads and workloads hosted on any other platform.

“Most traditional network-based security tools were not designed for the complexity and sheer scale of microservices and cloud-based architectures,” Justin Cormack, security lead at Docker and TOC member. “This makes a standard like SPIFFE, and the SPIRE runtime, essential for modern application development. The projects have shown impressive growth since entering the CNCF sandbox, adding integrations and support for new projects, and showing growing adoption.”

Since joining CNCF, the projects have grown in popularity and have been deployed by notable companies such as Bloomberg, Bytedance, Pinterest, Square, Uber, and Yahoo Japan. SPIRE has a thriving developer community, with an ongoing flow of commits and merged contributions from organizations such as Amazon, Bloomberg, Google, Hewlett-Packard Enterprise, Pinterest, Square, TransferWise, and Uber. 

Since admittance into CNCF as a sandbox level project, SPIRE has added the following key features:

  • Support for bare metal, AWS, GCP, Azure and more
  • Integrations with Kubernetes, Docker, Vault, MySQL, Envoy, and more
  • Support for both nested and federated SPIRE deployment topologies
  • Support for JWT-based SPIFFE identities, in addition to x.509 documents
  • Horizontal server scaling with client-side load balancing and discovery
  • Support for authenticating against OIDC-compatible validators
  • Support for non-SPIFFE-aware workloads

 “SPIFFE and SPIRE address a gap that has existed in security by enabling a modern standardized form of secure identity for cloud native workloads,” said Chris Aniszczyk, CTO/COO of Cloud Native Computing Foundation. “We are excited to work with the community to continue to evolve the specification and implementation to improve the overall security of our ecosystem.”

Earlier this year, the CNCF SIG Security conducted a security assessment of SPIFFE and SPIRE. They did not find any critical issues and commended its design with respect to security. SPIFFE and SPIRE have made a significant impact and play a pivotal role in enabling a more secure cloud native ecosystem.

“In addition to mitigating the risk of unauthorized access in the case of a compromise, a strong cryptographically-proven identity reduces the risk of bad configuration. It’s not uncommon for developers to try to test against production, which can be dubious,” said Tyler Julian, security engineer at Uber and SPIRE maintainer. “You have proof. You have cryptographic documents to prove who the service is. In reducing the amount of trust in the system, you reduced your assumption of behavior. Both good for the reliability of your system and the security of the data.”

“At Square, we have heterogeneous platforms that take advantage of cloud native technologies like Kubernetes and serverless offerings, as well as traditional server-based infrastructure,” said Matthew McPherrin, security engineer at Square and SPIRE maintainer. “SPIFFE and SPIRE are enabling us to build a shared service identity, underlying our Envoy service mesh that spans multiple datacenters and cloud providers in an interoperable way.”

Joining CNCF incubation-level projects like OpenTracing, gRPC, CNI, Notary, NATS, Linkerd, Rook, etcd, OPA, CRI-O, TiKV, CloudEvents, Falco, Argo, and Dragonfly, SPIFFE and SPIRE are part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. 

Every CNCF project has an associated maturity level: sandbox, incubating, or graduated. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria v.1.3. SPIFFE and SPIRE entered the CNCF sandbox in March 2018.

To learn more about SPIFFE and SPIRE, visit spiffe.io

1 2 50