Guest post by Jef Spaleta, Sensu, originally published on the Sensu blog
The appeal of running workloads in containers is intuitive and there are numerous reasons to do so. Shipping a process with its dependencies in a package that’s able to just run reduces the friction of organizational communication and operation. Relative to virtual machines, the size, simplicity, and reduced overhead of containers make a compelling case.
In a world where Docker has become a household name in technology circles, using containers to serve production is an obvious need, but real-world systems require many containers working together. Managing the army of containers you need for production workloads can become overwhelming. This is the reason Kubernetes exists.
Kubernetes is a production-grade platform as a service for running workloads in containers. The way it works, from a high level, is relatively straightforward.
You decide what your application needs to do. Then you package your software into container images. Following that, you document how your containers need to work together, including networking, redundancy, fault tolerance, and health probing. Ultimately, Kubernetes makes your desired state a reality.
But you need a few more details to be able to put it to use. In this post, I’ll help lay the groundwork with a few Kubernetes basics.
Building systems is hard. In constructing something nontrivial, one must consider many competing priorities and moving pieces. Further, automation and repeatability are prerequisites in today’s cultures that demand rapid turnaround, low defect rates, and immediate response to problems.
We need all the help we can get.
Containers make deployment repeatable and create packages that solve the problem of “works on my machine.” However, while it’s helpful having a process in a container with everything it needs to run, teams need more from their platforms. They need to be able to create multiple containers from multiple images to compose an entire running system.
The public cloud offerings for platform as a service give options for deploying applications without having to worry about the machines on which they run and elastic scaling options that ease the burden. Kubernetes yields a similar option for containerized workloads. Teams spell out the scale, redundancy, reliability, durability, networking, and other requirements, as well as dependencies in manifest files that Kubernetes uses to bring the system to life.
This means technologists have an option that provides the repeatability, replaceability, and reliability of containers, combined with the convenience, automation, and cost-effective solution of platform as a service.
What is Kubernetes?
When people describe Kubernetes, they typically do so by calling it a container orchestration service. This is both a good and incomplete way of describing what it is and what it does.
Kubernetes orchestrates containers, which means it runs multiple containers. Further, it manages where they operate and how to surface what they do — but this is only the beginning. It also actively monitors running containers to make sure they’re still healthy. When it finds containers not to be in good operating condition, it replaces them with new ones. Kubernetes also watches new containers to make sure not only that they’re running, but that they’re ready to start handling work.
Kubernetes is a full-scale, production-grade application execution and monitoring platform. It was born at Google and then later open-sourced. It’s now offered as a service by many cloud providers, in addition to being runnable in your datacenter.
How do you use it?
Setting up a Kubernetes cluster can be complex or very simple, depending on how you decide to do it. At the easy end of the spectrum are the public cloud providers, including Amazon’s AWS, Microsoft’s Azure, and Google’s Google Cloud Platform. They have offerings you can use to get up and running quickly.
With your cluster working, you can think about what to do with it. First, you’ll want to get familiar with the vocabulary introduced by Kubernetes. There are many terms you’ll want to be familiar with. This post contains only a subset of the Kubernetes vocabulary that you need to know; you can find additional terms defined more completely in our “How Kubernetes Works” post.
The most important concepts to know are pods, deployments, and services. I’ll define them below using monitoring examples from Sensu Go (for more on monitoring Kubernetes with Sensu, check out this post from CTO Sean Porter, as well as examples from the sensu-kube-demo repo).
Pods: As a starting point, you can think of a pod as a container. In reality, pods are one or more containers working together to service a part of your system. There are reasons a pod may have more than one container, like having a supporting Sensu Go agent process that monitors logs or application health metrics in a separate container. The pod abstraction takes care of the drudgery of making sure such supporting containers share network and storage resources with the main application container. Despite these cases, thinking of a pod as a housing for a single container isn’t harmful. Many pods have a single container.
Deployments: Deployments group pods of the same type together to achieve load balancing. A deployment has a desired number of identical pods and monitors to make certain that many pods remain running and healthy. Deployments work great to manage stateless workloads like web applications, where identical copies of the same application can run side-by-side to service requests without coordination.
StatefulSets: Similar to deployments, but used for applications where copies of the same applications must coordinate with each other to maintain state. StatefulSets manage the lifecycle of unique copies of pods. A Sensu Go backend cluster is a good candidate for a StatefulSet. Each Sensu Go backend holds its own state in a volume mount and must coordinate with its peers via reliable networking links. The StatefulSet manages the lifecycle of each requested copy of the Sensu Go backend pod as unique, making sure the networking and storage resources are reused if unhealthy pods need to be replaced.
Services: Services expose your deployments. This exposure can be to other deployments and/or to the outside world.
You interact with a cluster via the Kubernetes REST API. Rather than doing this by constructing HTTP requests yourself, you can use a handy command-line tool called kubectl.
Kubectl enables issuing commands against a cluster. These commands take the form below:
The kubectl tool can be easily installed with Homebrew on macOS, Chocolatey on Windows, or the appropriate package manager for your distribution on Linux. Better yet, recent versions of Docker Desktop on Mac or Windows (also easily installed with Homebrew or Chocolatey) include setup of a local single-node Kubernetes cluster and kubectl on your workstation.
With kubectl installed on your workstation, you’re almost ready to start issuing commands to a cluster. First you’ll need to configure and authenticate with any cluster with which you want to communicate.
You use the kubectl config command to set up access to your cluster or clusters and switch between the contexts you’ve configured.
With access set up, you can start issuing commands. You’ll probably use the kubectl get and kubectl describe commands the most, as you’ll use them to see the states of your pods, deployments, services, secrets, etc.
The get verb will list resources of the type you specify:
kubectl get pods
The above will list the pods running in your cluster (more precisely, the pods running in a namespace on your cluster, but that adds more complexity than desired here).
This example gets the pod named fun-pod (if such a pod exists).
kubectl get pod fun-pod
Finally, the describe verb gives a lot more detail related to the pod named fun-pod.
kubectl describe pod fun-pod
Using the following is useful for making resources in your cluster:
Outside of learning, it’s generally preferable to create manifest files and use kubectl apply to put them into use. This is an especially good way to deploy applications from continuous deployment pipelines.
Teams write manifests in either JSON or YAML. Such a manifest can describe pods, service, deployments, and more. The specification of a deployment includes the definition of the number of times a type of pod should replicate to constitute a healthy and running deployment.
Kubernetes creates or updates the resources in a file with the following command:
kubectl apply -f <filename>
You can easily start your active learning journey with Kubernetes with either a cluster in a public cloud or on your workstation. As mentioned earlier, Docker Desktop for Windows or Mac includes a Kubernetes installation. This makes it easy to run a cluster for learning, development, and testing purposes on your machine.
If you can’t or don’t want to use Docker Desktop, you can accomplish the same purpose (setting up a local cluster) by installing Minikube.
With either the Kubernetes installation with Docker Desktop or Minikube, you have a cluster on your machine with which you can interact. You can now use this setup for getting started and for trying deployments before you push them remotely.
Dive in and learn more
This is only the beginning. There’s a lot more to know before you become truly comfortable with Kubernetes. Such is the life of a technology professional!
Courses and resources exist that show more on how to gain confidence in using Kubernetes. The Kubernetes site itself has a wonderful “Learn Kubernetes Basics” section, with plenty of straightforward interactive tutorials. The best way to get up to speed is to get your hands dirty and start getting some experience. Install Docker Desktop or Minikube and start deploying!
You have 9 days to be able to nominate candidates from December 12 through December 20th.
We’ve extended the nomination deadline to January 4, 2020 at 12pm Pacific.
How do I get nominated?
If you are a maintainer of a graduated or incubating project, you can self-nominate, as explained below. Otherwise, you can ask a governing board member or end-user representative to nominate you.
The charter (section 6(e)(i)) says: “Each individual in a Selecting Group may nominate up to two (2) people, at most one (1) of whom may be from the same group of Related Companies. Each nominee must agree to participate prior to being added to the nomination list.”
How do maintainers get nominated?
Maintainers self-nominate by sending an email to email@example.com. They need to be endorsed (prior to the December 20th deadline) by two other maintainers from other projects and other companies (so, each nomination will, including the two endorsements, cover 3 projects and 3 companies). Details are in the new Maintainer Election Policy.
Why is the maintainers process different?
The Governing Board approved this new process to encourage broader representation.
What should go in a nomination?
Section 6(e)(i)(a) of the charter says: “A nomination requires a maximum one (1) page nomination pitch which should include the nominee’s name, contact information and supporting statement identifying the nominee’s experience in CNCF domains.”
Do the GB and End User nominations need endorsements?
Not this election, but if this new endorsements policy works for the maintainer seat, it may be expanded.
What is the evaluation and qualification process?
The Charter (section 6(e)(i)(c)) says: “A minimum of 14 calendar days shall be reserved for an Evaluation Period whereby TOC nominees may be contacted by members of the Governing Board and TOC.” Section 6(e)(ii) says: “After the Evaluation Period, the Governing Board and the TOC members shall vote on each nominee individually to validate that the nominee meets the qualification criteria. A valid vote shall require at least 50% participation. Nominees passing with over 50% shall be Qualified Nominees.”
Is the evaluation and qualification process the same for all nominees?
Guest post originally published on Sonobuoy, by John Schnake
In Sonobuoy 0.15.4, we introduced the ability for plugins to report their plugin’s progress to Sonobuoy by using a customizable webhook. Reporting status is incredibly important for long-running, opaque plugins like the e2e plugin, which runs the Kubernetes conformance tests.
We’re happy to announce that as of Kubernetes 1.17.0, the Kubernetes end-to-end (E2E) test framework will utilize this webhook to provide feedback about how many tests will be run, have been run, and which tests have failed.
This feedback helps you see if tests are failing (and which ones) before waiting for the entire run to finish. It also helps you identify whether tests are hanging or progressing.
How to Use It
There are two requirements to using this feature for the e2e plugin:
The conformance image used must correspond to Kubernetes 1.17 or later
Sonobuoy 0.16.5 or later must be used; we added this support prior to 0.17.0 to support Kubernetes prereleases.
First, start a run of the e2e plugin by running the following command, which kicks off a long-running set of tests:
$ sonobuoy run
Now, you can poll the status by using this command:
$ sonobuoy status --json | jq
After the tests start running, you will start to see output that includes a section like this:
"msg": "PASSED [sig-storage] HostPath should give a volume the correct mode [LinuxOnly] [NodeConformance] [Conformance]",
Voila! Anytime during a run, you can now check in and be more informed about how the run is going. As tests fail, the output will also return an array of strings with the test names in the failures field (and the “msg” field just reports the last test finished and its result). For example:
"msg": "FAILED [sig-network] [Feature:IPv6DualStackAlphaFeature] [LinuxOnly] should create service with ipv4 cluster ip [Feature:IPv6DualStackAlphaFeature:Phase2]",
"[sig-network] [Feature:IPv6DualStackAlphaFeature] [LinuxOnly] should create service with ipv4 cluster ip [Feature:IPv6DualStackAlphaFeature:Phase2]"
Q and A
Q: I’m using a new version of Kubernetes but am using an advanced test configuration I store as a YAML file. Can I still get progress updates?
A: Yes, there are just two environment variables for the e2e plugin that need to be set in order for this to work:
The E2E_USE_GO_RUNNER value ensures that the conformance test image uses the Golang-based runner, which enables passing extra arguments when the tests are invoked. The E2E_EXTRA_ARGS value sets the flag to inform the framework about where to send the progress updates.
The status updates are just sent to localhost because the test container and the Sonobuoy sidecar are co-located in the same pod.
Q: I want to try out this feature but don’t have a Kubernetes 1.17.0 cluster available; how can I test it?
A: The important thing is that the conformance test image is 1.17 or later so you can manually specify the image version if you just want to tinker. Since the test image version and the API server version do not match, the results might not be reliable (it might, for instance, test features your cluster doesn’t support) and would not be valid for the Certified Kubernetes Conformance Program.
You can specify the version that you want to use when you run Sonobuoy; here’s an example:
sonobuoy run --kube-conformance-image-version=v1.17.0-beta.2
Q: I’d like to implement progress updates in my own custom plugin. How do I do that?
A: To see an example use of this feature, check out the readme file for the progress reporter. The Sonobuoy sidecar will always be listening for progress updates if your plugin wants to send them, so it is just a matter of posting some JSON data to the expected endpoint.
Guest post by Zhimin Tang, Xiang Li and Fei Guo of Alibaba
Since 2015, the Alibaba Cloud Container Service for Kubernetes (ACK) has been one of the fastest growing cloud services on Alibaba Cloud. Today, ACK not only serves numerous Alibaba Cloud customers, but it also supports Alibaba’s internal infrastructure and many other Alibaba cloud services.
Like many other container services from world-class cloud vendors, reliability and availability are the top priorities for ACK. To achieve these goals, we built a cell-based and globally available platform to run tens of thousands of Kubernetes clusters.
In this blog post, we will share the experience of managing a large number of Kubernetes clusters on cloud infrastructure, as well as the design of the underlying platform.
Kubernetes has become the de facto cloud native platform to run various workloads. For instance, as illustrated in Figure 1, in Alibaba cloud, more and more stateful/stateless applications as well as the application operators now run in Kubernetes clusters. Managing Kubernetes has always been an interesting and a serious topic for infrastructure engineers. When people mention cloud providers like Alibaba cloud, they always mean to point out a scale problem. What is Kubernetes clusters management at scale? In the past, we have presented our best practices of managing Kubernetes with 10,000 nodes. Sure, that is an interesting scaling problem. But there is another dimension of scale – the number of clusters.
Figure 1. Kubenetes ecosystem in Alibaba Cloud
We have talked to many ACK users about cluster scale. Most of them prefer to run dozens, if not hundreds, of small or medium size Kubernetes clusters for good reasons such as controlling the blast radius, separating clusters for different teams, spinning ephemeral clusters for testing. Presumably, had ACK aimed to globally support customers with this usage model, it would need to manage a large number of clusters across over 20 regions reliably and efficiently.
Figure 2. The challenges of managing massive number of Kubernete clusters
What are the major challenges of managing clusters at scale? As summarized in Figure 2, there are four major issues that we need to tackle:
ACK needs to support different types of clusters, including standard, serverless, Edge, Windows and a few others. Different clusters require different parameters, components, and hosting models. Some of our customers need customizations to fit their use cases.
Different cluster sizes
Different cluster varies in size, ranging from a few nodes to tens of thousands of nodes, from several pods to thousands of pods. The resource requirements for the control plane of these cluster differ significantly. A bad resource allocation might hurt cluster performance or even cause failure.
Kubernetes itself evolves very fast with a new release cycle every few months. Customers are always willing to try new features. Hence, they may run their test workload against new versions, while running the production workload on stable versions. To satisfy this requirement, ACK needs to continuously deliver new versions of Kubernetes to our customers and support stable versions.
The clusters are distributed in different regions. Thus, they must comply with different compliance requirements. For example, the cluster in Europe needs to follow GDPR, and the financial cloud in China needs to have additional levels of protection. Failing to accomplish these requirements is not an option since it introduces huge risks for our customers.
The ACK platform is designed to resolve most of the above problems and currently manages more than 10K kubernetes clusters globally in a reliable and stable fashion. Let us reveal how this is achieved by going through a few key ACK design principles.
Kube-on-kube and cell-based architecture
Cell-based architecture, compared to a centralized architecture, is common for scaling the capacity beyond a single data center or for expanding the disaster recovery domain.
Alibaba cloud has more than 20 regions across the world. Each region consists of multiple available zones (AZs), and typically maps to a data center. In a large region (such as the HangZhou region), it is quite common to have thousands of customer Kubernetes clusters that are managed by ACK.
ACK manages these Kubernetes clusters using Kubernetes itself, meaning that we run a meta Kubernetes cluster to manage the control plane of our customers’ Kubernetes clusters. This architecture is also known as the Kube-on-kube (KoK) architecture. KoK simplifies the customer cluster management since it makes the cluster rollout easy and deterministic. More importantly, we can now re-use the features that native Kubernetes provides by itself. For example, we can use deployment to manage API Servers, use an etcd operator to operate multiple etcds. Dog-fooding always has its fun.
Within one region, we deploy multiple meta Kubernetes clusters to support the growth of the ACK customers. We call each meta cluster a cell. To tolerate AZ failures, ACK supports multi-active deployments in one region, which means the meta cluster spreads the master components of the customer Kubernetes clusters across multiple AZs, and runs them in active-active mode. To ensure the reliability and efficiency of the master components, ACK optimizes the placement of different components to ensure API Server and etcd are deployed close to each other.
This model enables us to manage Kubernetes effectively, flexibly and reliably.
Capacity planning for meta cluster
As we mentioned above, in each region, the number of meta clusters grows as the number of customers increases. But when shall we add a new meta cluster? This is a typical capacity planning problem. In general, a new meta cluster is instantiated when existing meta clusters run out of required resources.
Let’s take network resources for example. In the KoK architecture, customer Kubernetes components are deployed as Pods in meta cluster. We use Terway (https://github.com/AliyunContainerService/terway ,Figure 3) – a high performance container networking plugin developed by Alibaba Cloud to manage the container network. It provides a rich set of security policies, and enables us to use the Alibaba Cloud Elastic Networking Interface (ENI) to connect to users’ VPC. To provide network resources to nodes, pods, services in the meta cluster efficiently, we need to do careful allocation based on the network resource’s capacity and utilization inside the meta cluster VPC. When we are about to run out of networking resources, a new cell will be created.
We also consider the cost factors, density requirements, resources quota, reliability requirements, statistic data to determine the number of customer clusters in each meta cluster, and then decide when to create a new meta cluster. Note that small clusters can grow to larger ones, and thus requires more resources even if the number of clusters remain unchanged. We usually leave enough headroom to tolerate the growth of each cluster.
Figure 3. Cloud Native network architecture of Terway
Scaling the master components of customer clusters
The resource requirements of the Kubernetes master components are not fixed. The number relates to the number of nodes, pods in the cluster, and the number of custom controllers and operators that interact with the APIServer.
In ACK, each customer Kubernetes cluster varies in size and runtime requirements. We cannot use the same configuration to host the master components for all user clusters. If we mistakenly set a low resource request for big customers, they may perform poorly. If we set a conservative high resource request for all clusters, resources are wasted for small clusters.
To carefully handle the tradeoff between reliability and cost, ACK uses a type based approach. More specifically, we define different types of clusters: small, medium and large. For each type of clusters, a separate resource allocation profile is used. Each customer cluster is associated with a cluster type which will be identified based on the load of the master components, the number of nodes, and other factors. The cluster type may change overtime. ACK monitors the factors of interest constantly, and might promote/demote the cluster type accordingly. Once the cluster type is changed, the underlying resources allocation will be updated automatically with minimal user interruption.
We are working on both finer grained scaling and in place type updates to make the transition more smooth and cost effective.
Figure 4. Multi gears and intelligent shifting
Evolving customer clusters at scale
Previous sections describe some aspects on how to manage a large number of Kubernetes clusters. However, there is one more challenge to solve: the evolution of clusters.
Kubernetes is “Linux” in the cloud native era. It keeps on getting updated and is more modular. We need to continuously deliver new versions of Kubernetes in a timely fashion, fix CVEs and do upgrades for the existing clusters, and manage a large number of related components (CSI, CNI, Device Plugin, Scheduler Plugin and many more) for our customers.
Let’s take Kubernetes component management as an example. We first develop a centralized system to register and manage all these pluggable components.
Figure 5. Flexible and pluggable components
To ensure that the upgrade is successful before moving on, we developed a health checking system for the plugin components and do both per-upgrade check and post-upgrade check.
Figure 6. Precheck for cluster components
To upgrade these components fast and reliably, we support continuous deployment with grayscale, pausing and other functions. The default Kubernetes controllers do not serve us well enough for this use case. Thus, we developed a set of custom controllers for cluster components management, including both plugin and sidecar management.
For example, the BroadcastJob controller is designed for upgrading components on each worker machine, or inspecting nodes on each machine. The Broadcast Job runs a pod on each node in the cluster until it ends like DaemonSet. However, DaemonSet always keeps a long running pod on each node, while the pod ends in BroadcastJob. The Broadcast controller also launches pods on newly joined nodes as well to initialize the node with required plugin components. In June 2019, we open sourced cloud native application automation engine OpenKruise we used internally.
Figure 7. OpenKurise orchestrates broadcast job to each node for flexible work
To help our customers with choosing the right cluster configurations, we also provide a set of predefined cluster profiles, including Serverless, Edge, Windows, and Bare Metal setups. As the landscape expands and the needs of our customers grow, we will add more profiles to simplify the tedious configuration process.
Figure 8. Rich and flexible cluster profiles for various scenarios
Global Observability Across Datacenters
As presented in Figure 9, the Alibaba Cloud Container service has been deployed in 20 regions around the world. Given this kind of scale, one key challenge to ACK is to easily observe status of running clusters so that once a customer cluster runs into trouble, we can promptly react to fix it. In other words, we need to come up with a solution to efficiently and safely collect the real-time statistics from the customer clusters in all regions and visually present the results.
Figure 9. Global deployment in 20 regions for Alibaba Cloud Container Service
Like many other Kubernetes monitoring solutions, we use Prometheus as the primary monitoring tool. For each meta cluster, the Prometheus agents collects the following metrics:
OS metrics, such as node resources (CPU, memory, disk, etc.) and network throughput;
Metrics for meta and guest K8s cluster’s control plane, such as kube-apiserver, kube-controller-manager, and kube-scheduler;
Metrics from kubernetes-state-metrics and cadvisor;
etcd metrics, such as etcd disk write time, DB size, peer throughput, etc.
The global stats collection is designed using a typical multi-layer aggregation model. The monitor data from each meta cluster is first aggregated in each region, and the data is then aggregated to a centralized server which provides the global view. To do this, we use Prometheus federation. Each data center has a Prometheus server to collect the data center’s metrics, and the central Prometheus is responsible for aggregating the monitoring data. An AlertManager connects to the central Prometheus and sends various alert notifications, by means of DingTalk, email, SMS, etc. The visualization is done by using Grafana.
In Figure 10, the monitoring system can be divided into three layers:
Edge Prometheus Layer
This layer is furthest away from the central Prometheus. An edge Prometheus server existing in each meta cluster collects metrics of meta and customer clusters within the same network domain.
Cascading Prometheus Layer
The function of cascading Prometheus is to collect monitoring data from multiple regions. Cascade Prometheus servers exist in larger regions such as China, Asia, Europe, and America. As the cluster size of each larger region grows, the larger region can be split into multiple new larger regions, and always maintain a cascade Prometheus in each new large region. Using this strategy, we could achieve flexible expansion and evolution of the monitoring scale.
Central Prometheus Layer
Central Prometheus connects to all cascading Prometheus servers and performs the final data aggregation. To improve reliability, two central Prometheus instances are deployed in different AZs and connect to the same cascading Prometheus servers.
Figure 10. Global multi-layer monitoring architecture based on Prometheus federation
With the development of cloud computing, Kubernetes based cloud-native technologies continue to promote the digital transformation of the industry. Alibaba Cloud ACK provides secure, stable, and high-performance Kubernetes hosting services. Which has become one of the best carriers for running Kubernetes on the cloud. The team behind Alibaba Cloud strongly believes in open source and its community. In the future, we will share our insights in operating and managing cloud native technologies.
The company behind the popular open source Grafana project, Grafana Labs offers customers a hosted metrics platform called Grafana Cloud, which incorporates Metrictank, a Graphite-compatible metrics service, andCortex, the CNCF sandbox project for multitenant, horizontally scalable Prometheus-as-a-Service.
Grafana Labs engineers run Metrictank and Cortex to troubleshoot their own technical issues. But as the company started adding scale—Cortex and Metrictank each process tens of thousands of requests per second—query performance issues became noticeable. That latency negatively impacts Grafana Cloud customers’ user experience.
Without a way to visualize the path of requests end-to-end, the team attempted to solve the problem by guessing the cause of the slowness and rolling out a “fix”—“many times shooting in the dark, only to have our assumptions invalidated after a lot of experimentation,” says Software Engineer Goutham Veeramachaneni.
The Metrictank team had already been using Jaeger distributed tracing to understand requests better and to see all logs in one place. With that experience using Jaeger, “we doubled down on it with Cortex to improve the query performance,” says VP of Product Tom Wilkie. Jaeger allowed the team to drill down to specific requests and quickly find the queries that were causing latency. The results with Jaeger were stellar: Query performance was improved by as much as 10x.
As it turned out, Jaeger has also helped the Grafana Labs team with bug-hunting. “It’s easier to visualize where the problems are, and it just made me more confident at tackling things because I’m able to see exactly what’s going wrong,” says Veeramachaneni. With Jaeger in place, “the confidence in operating our system grew by an order of magnitude.”
TL;DRModern database management systems (DBMS) are notorious for being complicated and having too many configuration options—or “knobs”— that mostly determine how the system performs. The traditional way of performance tuning is often based on the experience and intuition of the database administrator (DBA), which makes database tuning seem like a “black art” mastered only by a few people. That’s why a good DBA is expensive and hard to find.
Fortunately, the fast advancement of AI and machine learning (ML) technologies are reshaping the way people manage and tune databases. Among the important players, Oracle’s Autonomous Database has been an important development. Microsoft applies AI in Azure SQL Database Automatic tuning. On the academic side, we have Carnegie Mellon University’s OtterTune, which collects and analyzes configuration knobs and recommends possible settings by learning from previous tuning sessions. And Tencent implemented CDBTune, an end-to-end automatic tuning system for cloud databases based on deep reinforcement learning (RL). Given all these innovations, AI tuning for databases is really taking shape.
As the team behind TiDB, an open source distributed NewSQL database, we are always striving to make our databases easier to use. Our key-value (KV) storage engine,TiKV, uses RocksDB as the underlying store. Optimally configuring RocksDB can be difficult. Even the most experienced RocksDB developers don’t fully understand the effect of each configuration change. Inspired by the pioneering breakthroughs of automatic tuning, we developed AutoTiKV, a machine-learning-based tuning tool that automatically recommends optimal knobs for TiKV. The goal is to decrease tuning costs and make life easier for DBAs. In the meanwhile, we would love to see where the ship of AI tuning leads us.
So far, our exploration of automatic tuning has been rewarding—machine learning technologies applied to a database can not only yield optimal and efficient tuning, but also help us understand the system better. In this post, I’ll discuss AutoTiKV’s design, its machine learning model, and the automatic tuning workflow. I’ll also share the results of experiments we ran to verify whether the tuning results are optimal and as expected. Finally, I’ll share some interesting and unexpected findings.
AutoTiKV is an open source tool developed by TiKV, a Cloud Native Computing (CNCF) incubating project. The project is available on GitHub.
Our story of exploration and exploitation
Automatically tuning a database works similarly to automated machine learning (AutoML), in which automated hyperparameter tuning plays an essential role.Generally, there are three types of tuning methods:
Random search. This method does not involve guided sampling based on current results. Its efficiency is relatively low.
Multi-armed bandit. This approach takes into account both the “exploration” and “exploitation” properties. Bandit combined with Bayesian optimization forms the core of traditional AutoML.
Deep reinforcement learning. The advantage of this method is the shift from “learning from data” to “learning from action.” However, data training may be difficult, and sometimes the results are hard to reproduce.
Currently, most database automatic tuning research, including OtterTune’s, adopts the latter two methods. We drew our inspiration from OtterTune, with the following adaptations:
AutoTiKV runs on the same machine as the database, instead of on a centralized training server like OtterTune. This avoids the problem of data inconsistency caused by different machine configurations.
We refactored OtterTune’s architecture to reduce its coupling to specific DBMSs, so that it’s more convenient to port the entire model and pipeline to other systems; AutoTiKV is more suitable for lightweight KV databases.
AutoTiKV can adjust knobs with the same names in different sessions, making it extremely flexible. In contrast, OtterTune can only adjust global knobs.
The machine learning model
AutoTiKV uses the same Gaussian process regression (GPR) as OtterTune does to recommend new knobs. This is a nonparametric model based on the Gaussian distribution. The benefits of GPR are:
Compared with other popular methods like neural networks, GPR is a nonparametric model which is less compute intensive. Also, in conditions with fewer training samples, GPR outperforms neural networks.
GPR estimates the distribution of the sample—the mean of X, m(X), and its standard deviation, s(X). If there is not much data around X, s(X) is overly large. This indicates a large deviation between sample X and the other data points. We can understand intuitively that if there is insufficient data, the uncertainty is large, and this is reflected in the large standard deviation. Conversely, when the data is sufficient, the uncertainty is reduced, and the standard deviation decreases.
But GPR itself can only estimate the distribution of the sample. To get the final prediction, we need to apply the estimation to Bayesian optimization, which can be roughly divided into two steps:
Estimate the distribution of functions using GPR.
Use the acquisition function to guide the next samplen; that is, give the recommended value.
When looking for new recommended values, the acquisition function balances the two properties of exploration and exploitation:
Exploration: The function explores new points in unknown areas where there is currently insufficient data.
Exploitation: The function uses the data for model training and estimation to find the optimal prediction in the known areas with sufficient data.
In the recommendation process, these two properties need to be balanced. Excessive exploitation can cause results to fall into partial optimal values (repeated recommendations of the known optimal points while there may be better points to be found). Too much exploration can lead to low search efficiency (always exploring new areas without in-depth attempts in the current area). The core idea of balancing these two properties is:
When there is enough data, use the existing data for recommendation.
When there is not enough data, explore in the area with the least number of points so you can obtain the maximum amount of information in the least known area.
Applying GPR with Bayesian optimization can help us achieve this balance. As mentioned earlier, GPR can help us estimate m(X) and s(X), where m(X) can be used as the characterization value for exploitation, and s(X) can be used as the characterization value for exploration.
We use the Upper Confidence Bound (UCB) algorithm as the acquisition function. Suppose we need to find an X to make the Y value as large as possible, and U(X) is the definition of the acquisition function. Therefore, U(X) = m(X) + k*s(X), where k > 0 is an adjustable coefficient. We only need to find the X to make U(X) as large as possible.
If U(X) is large, either m(X) or s(X) may be large.
If s(X) is large, it means the difference among the data is large, and there is not much data around X. Therefore, the algorithm must explore new points in unknown areas.
If m(X) is large, it indicates that the mean value of the estimated Y is large, and the algorithm should find better points using the known data.
Coefficient k affects the proportion of exploration and exploitation. A larger k value means more exploration in new areas.
In the implementation, several candidate knobs are randomly generated at the beginning, and then their U(X) values are calculated based on the above model. The one with the largest U(X) value is identified as the recommendation result.
How AutoTiKV works
The following diagram and description show how AutoTiKV works:
You select the metric to be optimized; for example, throughput or latency.
All the training data (knob-metric pairs) are saved in the DataModel, where random knobs are generated as the initial data sets in the first 10 rounds. The DataModel sends the training data to the GPModel process.
GPR is implemented in GPModel module. It trains the model, and provides the recommended knobs.
The Controller includes functions that control TiKV directly. It changes TiKV knobs based on recommendations, runs benchmarking, and gets performance metrics.
The metrics and the corresponding knobs are sent back to the DataModel as a newly generated sample. The training data is updated incrementally.
By default, the entire process runs 200 rounds. You can also customize the number of rounds, or you can set the process to run until the results stabilize. AutoTiKV supports restarting TiKV after modifying parameters, or you can choose not to restart if it’s not required. You can declare the parameters to be adjusted and the metrics to be viewed in `controller.py`. The information of the DBMS is defined in `settings.py`.
To evaluate AutoTiKV, we simulated some typical TiKV workloads. For each type of workload, we selected the metric to optimize and a corresponding set of knobs for recommendation.
The experiment was conducted using the following environment:
CPU: AMD Ryzen5-2600
RAM: 32 GB
Storage: 512 GB NVME SSD
Operating system: Ubuntu 18.04
Installer: tidb-ansible: v.3.0.1
Database capacity: 80 GB
We simulated the following typical workloads using go-ycsb, a Yahoo! Cloud Server Benchmark (YCSB) Go port:
range-scan (both long and short)
We selected the following parameters as configuration knobs:
Valid range/value set
write-heavy: turning on is better
point-lookup, range-scan: turning off is better
point-lookup: the smaller the better
range-scan: the larger the better
point-lookup, range-scan: larger the better
turning off is better
Sets the size of the data block where RocksDB saves the data. When RocksDB looks up a key, it needs to load the whole block where this key resides. For point-lookups, a larger block increases read amplification and decreases the performance. However, for range-scans, a larger block makes more efficient use of the disk bandwidth.
Determines whether to disable auto compaction. Compaction takes up disk bandwidth and decreases the write speed. However, without compaction, Level0 files accumulate, which affects the read performance.
Sets the number of bits in the Bloom filter. For read performance, the bigger the better.
Disables (true) or enables (false) Bloom filters at the bottom layer of the log-structured merge-tree (LSM) tree. Bloom filters at the bottom layer can be large and may take up block cache. If the key to query exists in the store, there is no need to enable Bloom filters for the bottom layer.
We selected the following target metrics for optimization:
Depending on the specific workload, it is divided into write-throughput, get -throughput, and scan-throughput.
Depending on the specific workload, it is divided into write-latency, get-latency, and scan-latency.
Throughput and latency are obtained from the output of go-ycsb.
The evaluation is based on the performance of the corresponding metrics and the comparison between the recommended configurations and the expected behaviors.
For all experiments, the first 10 rounds used the randomly-generated configurations, while the rest of the rounds used configurations recommended by AutoTiKV.
The recommended result in this experiment is to enable compaction (`disable-auto-compactions==false`) and set the block size to 4 KB (`block-size==4k`) . Theoretically, you need to disable compaction to improve the write performance. However, because TiKV uses Percolator for distributed transactions, a write process also involves read operations (for example, detecting write conflicts). This means turning off compaction can degrade write performance. Similarly, a smaller block size improves the performance for both point-lookup and write.
The recommendation for the `optimization-filters-for-hits` option was somewhat wavering—it recommended two different values alternatively—but it didn’t affect the get-latency metric much in this case.
The overall recommended results are just as expected. Regarding the `optimization-filters-for-hits` option, it should be noted that the block cache was large enough in the experiment, so the size of the Bloom filter had little effect on the cache performance. On the other hand, the options we configured were for the defaultCF (column family. Refer to How TiKV reads and writes for more details on column families in TiKV). For TiKV, before a query is performed in default CF, the corresponding key should already be known to exist. Therefore, whether to enable the Bloom filter had little effect.
Set `optimize-filters-for-hits` in the writeCF. The default value for defaultCF is 0.
Set `bloom-filter-bits-per-key` in defaultCF and writeCF respectively, and use them as two knobs.
Adjust the workload by setting the `recordcount` of the run phase to twice of that of the load phase to measure the effect of the Bloom filter as much as possible. This way, half of the keys for the query won’t exist, so the recommended value for `optimize-filters-for-hits` should be `disable`.
The results are shown in the figure below, along with an interesting finding:
The only difference between their knobs is that the Bloom filter (`optimize-filters-for-hits==True`) is enabled for sample 20, while it’s disabled for sample 30. However, the throughput of sample 20 is even a bit lower than sample 30, which is completely different from our expectations. To find out what happened, we took a closer look at the Grafana charts of the two samples. (In both cases, `block-cache-size` was 12.8 GB.）
In the figure, the left side of the pink vertical line is the load stage, and the right side is the run stage. It can be seen that the cache hits in these two cases are actually not much different, with sample 20 slightly lower. This is because the Bloom filter itself also consumes space. If the block cache size is sufficient, but the Bloom filter takes up a lot of space, the cache hit is affected. This finding, though not expected, is a positive demonstration that the ML model can help us in ways we would not otherwise be able to figure out from experience or intuition.
block-size==32KB or 64KB
According to Intel’s solid-state drive (SSD) white paper, an SSD’s random read performance is similar for 32 KB and 64 KB block sizes. The number of bits in the Bloom filter has little effect on the scan operation. Therefore, the results of this experiment are also in line with expectations.
Beyond the recommendations
As indicated in the above experiments, although most of the recommended knobs are in line with our expectations, there were some deviations that helped us locate some important issues:
Some parameters have little effect on the results. For example, the scenario in which the parameter in question works is not triggered at all, or the hardware associated with it does not have a performance bottleneck.
Some parameters require the workload to run long enough to take effect. Therefore, dynamic adjustments may not show their effects immediately. For example, you must wait until the workload fills in the block cache to see the effect of increased block cache size on the overall cache hit.
The effect of some parameters is in contrary to expectations. It was later found that the parameter in question actually has side effects in certain scenarios (such as the above example of the Bloom filter).
Some workloads are not entirely read or written, and some other operations may be involved. DBAs are likely to ignore this when they manually predict the expected effect (such as the write-heavy case). In an actual production environment, DBAs can’t know in advance what kind of workload they will encounter. This is probably where automatic tuning should kick in.
AutoTiKV has only been around for about 4 months. Due to constraints on time and effort, there are still some limitations in the current implementation, compared to OtterTune:
AutoTiKV omits the workload mapping step by using only one default workload. OtterTune uses this step to pick from the repository the training sample most similar to the current workload.
AutoTiKV gives recommendations for user-specified knobs, while OtterTune uses lasso regression to automatically select important knobs to be tuned.
What does the future hold?
A complex system requires a lot of trade-offs to achieve the best overall results. This requires a deep understanding of all aspects of the system. AutoTiKV, as our initial exploration of automatic database tuning technologies, not only yields some positive signals in recommending knobs, but also help us identify some issues.
What does the future hold for AutoTiKV? We will continuously improve it, focusing on the following:
Adapting to dynamically changing workloads
Avoiding trapping into local optimal results of the ML model by working with a larger search space
Drawing on different ML models to improve the efficacy of results and reduce the time required for recommendations
Of course, AutoTiKV means much more than just TiKV tuning. We would love to see this work be the start of a brand new world, and a new era of intelligent databases. We look forward to seeing AI and machine learning technologies applied in other projects such as TiDB and TiFlash. An AI-assisted SQL tuner/optimizer? Stay tuned!
The Cloud Native Computing Foundation (CNCF) has three main bodies: a Governing Board (GB) that is responsible for marketing, budget and other business oversight decisions for the CNCF, a Technical Oversight Committee (TOC) that is responsible for defining and maintaining the technical vision, and an End User Community (EUC) that is responsible for providing feedback from companies and startups to help improve the overall experience for the cloud native ecosystem. Each of these groups has its own defining role in the Foundation.
The TOC responsibilities and operations are defined in the CNCF Charter. The TOC is intended to be a resource of technical architects in the community that both guides the direction of the CNCF and acts as a resource to projects. Their role is also to help define what new projects are accepted to the CNCF.
The charter for the TOC was updated in September 2019. Below are the details of those changes.
The TOC is expanding from 9 members to 11 members; adding an additional End User seat and a new seat for non-sandbox project maintainers. The TOC will now consist of:
The next TOC Elections will take place in December 2019, with 3 GB-appointed TOC members up for election, as well as the new end user-appointed member and a non-sandbox project maintainer-appointed member, for a total of 5 seats. All TOC terms will be 2 years.
Nomination qualifications are defined in the charter. A nominee should be a senior engineer in the scope of the CNCF, have the available bandwidth to be invested in the TOC, and be able to operate neutrally in discussions regardless of company or project affiliation.
Each selecting group may nominate up to two people, and the nominee must agree to be nominated. A one-page nominating statement should include the nominee’s qualifying experience and contact information. Following the nomination, there is a two week evaluation period for the Governing Board and TOC to review and contact nominees. After the evaluation period, the Governing Board and the TOC votes on each nominee to ensure qualification, with a 50% vote to become a Qualified Nominee. We will publish the list of Qualified Nominees.
I’m happy to announce that CNCF will participate again in the Community Bridge Mentorship program. As well as Google Summer of Code and Outreachy, Community Bridge is a platform that brings an opportunity to offer paid internships and mentorships.
The current round starts in December 2019, and will continue until March 2020. We are happy to offer 5 sponsored slots for the students, who will work on the CNCF projects. If you are a project maintainer and you’d like to have your project participating in Community Bridge, please submit a PR as described. Once the projects list will be finalized, mentees will be able to apply via the Community Bridge platform directly.
Please submit your project ideas to GitHub by December 16, 2019.
When it was founded in 2000, uSwitch helped consumers in the U.K. compare prices for their utilities. The company eventually expanded to include comparison and switching tools for a wide range of verticals, including broadband service, credit cards, and insurance.
Internally, those verticals translated into a decentralized technology infrastructure. Teams—which were organized around the markets their applications price-compared—ran all their own AWS infrastructure and were responsible for configuring load-balancers, EC2 instances, ECS cluster upgrades, and more.
Though many teams had converged on the idea of using containerization for web applications and deploying to ECS, “everyone was running different kinds of clusters with their own way of building and running them, with a whole bunch of different tools,” says Head of Engineering Paul Ingles. With the increasing cloud and organizational complexity, they had difficulty scaling teams. “It was just inefficient,” says Infrastructure Lead Tom Booth. “We wanted to bring some consistency to the infrastructure that we had.”
Booth and Ingles had both experimented with Kubernetes before, when the company first adopted containerization. Separately, the infrastructure team had done an evaluation of several orchestration systems, in which Kubernetes came out on top. In late 2016, the two began working on building a Kubernetes platform for uSwitch.
Today, all 36 teams at uSwitch are running at least some of their applications on the new platform, and as a result, the rate of deployments has increased almost 3x. The release rate per person per week has almost doubled, with releases now happening as often as 100 times a day. Moreover, those improvements have been sustained, even as more people were added to the platform.
In the security world, one of the most established methods to identify that a system was compromised, abused or mis-configured is to collect logs of all the activity performed by the system’s users and automated services, and to analyze these logs.
Audit Logs as a Security Best Practice
In general, audit logs are used in two ways:
Proactively identifying non-compliant behavior. Based on a configured set of rules, that should faithfully filter any violation to the organization’s policies, an investigator finds in the audit log entries that prove a non-compliant activity has taken place. With automated filters, a collection of such alerts is periodically reported to compliance investigators.
Reactively investigating a specific operational or security problem. The known problem is traced back to the responsible party, root causes or contributing factors by a post-mortem investigation, following deduced associations from state to causing action and previous state.
Kubernetes Audit Logs
Let us examine how audit logs are configured and used in the Kubernetes world, what valuable information they contain, and how they can be utilized to enhance the security of the Kubernetes-based data center.
The Kubernetes audit log is intended to enable the cluster administrator to forensically recover the state of the server and the series of client interactions that resulted in the current state of the data in the Kubernetes API.
In technical terms, Kubernetes audit logs are detailed descriptions of each call made to the Kubernetes API-Server. This Kubernetes component exposes the Kubernetes API to the world. It is the central touch point that is accessed by all users, automation, and components in the Kubernetes cluster. The API server implements a RESTful API over HTTP and performs all the API operations.
Upon receiving a request, the API server processes it through several steps:
Authentication: establishing the identity associated with the request (a.k.a. principal). There are several mechanisms for authentication.
RBAC/Authorization: The API server determines whether the identity associated with the request can access the combination of the verb and the HTTP path in the request. If the identity of the request has the appropriate role, it is allowed to proceed.
Admission Control: determines whether the request is well formed and potentially applies modifications to the request before it is processed.
Validation: ensures that a specific resource included in a request is valid.
Perform the requested operation. The types of supported operations include:
Create resource (e.g. pod, namespace, user-role)
Delete a resource or a collection of resources
List resources of a specific type (e.g. pods, namespaces), or get a detailed description of a specific resource
Open a long-running connection to the API server, and through it to a specific resource. Such a connection is then used to stream data between the user and the resource. For example, this enables the user to open a remote shell within a running pod, or to continuously view the logs of an application that runs inside a pod.
Monitors a cluster resource for changes.
A request and its processing steps may be stored in the Kubernetes audit log. The API server may be configured to store all or some of these requests, with varying degrees of details. This audit policy configuration may also specify where the audit logs are stored. Analysis tools may request to receive these logs through this hook.
Challenges of Auditing a Kubernetes Cluster
While the principles of audit logs collection and analysis naturally apply to the cloud, and specifically to data centers built on Kubernetes, in practice the scale, the dynamic nature and the implied context of such environments make analyzing audit logs difficult, time consuming and expensive.
The scale of the activities in the cluster means any analysis that relies solely on manual inspection of many thousands of daily log entries is impractical. A “quiet” cluster with no human-initiated actions and no major shifts in applicative activity, is still processing thousands of API calls per hour, generated by the internal Kubernetes mechanisms that ensure that the cluster is alive, its resources are utilized according to the specified deployments, and failures are automatically identified and recovered from. Even with log filtering tools, the auditor needs a lot of experience, intuition and time to be able to zoom in on a few interesting entries.
The dynamic nature of a system like a Kubernetes cluster means that workloads are being added, removed or modified at a fast pace. It’s not a matter of an auditor focusing on access to a few specific workloads containing a database – it’s a matter of identifying which workloads contain a sensitive database at each particular instant in the audited time period, which users and roles had legitimate reason to access each of these database-workloads at what times, and so on.
Furthermore, while finding some interesting results is just a matter of finding the specific entries in the log that are known in advance to correlate to undesirable activity, finding suspicious but previously unknown activity in the logs requires a different set of tools and skills, especially if this suspicious behavior can only be understood from a wider context over a prolonged period, and not just one or two related log entries. For example, it is rather simple to detect a user’s failure to authenticate himself to the system, as each login attempt shows up as a single log entry. However, identifying a potential theft of user credentials may only be detected if the auditor connects seemingly different entries into a whole pattern, for example access to the system using a specific user’s credentials from a previously unknown Internet address outside the organization, while the same user’s credentials are used concurrently to access the system from within the organization’s network.
Making Log Auditing a Viable Practice Again
In order to make the audit of a large, complex Kubernetes cluster a viable practice, we need to adapt the auditor’s tools to this environment. Such tools would automatically and proactively identify anomalous and problematic behavior, specifically in the context of Kubernetes control security, in order to:
Detect security-related abuse of Kubernetes cluster, especially behavior that can only be detected from observing extended context over multiple activities.
Focus compliance investigations on Kubernetes misuses that are beyond detection by simple log filtering rules.
Of course, in order to achieve these goals, such a tool must be able to:
Automatically analyze Kubernetes Audit logs, detecting anomalous behavior of users and automated service accounts and anomalous access to sensitive resources.
Summarize the detected anomalies as well as important trends and statistics of the audit information for user-friendly understanding. At the end of the day, the auditor should have enough information that she can understand, qualify or ignore the results of the automatic analysis.
Let us describe some of the more complex threat scenarios that we would like the envisioned audit log analyzer to detect automatically:
Adversaries may steal the credentials of a specific user or service account (outside the cluster) or capture credentials earlier in their reconnaissance process through social engineering as a means of gaining initial access to cluster resources.
Adversaries may use token theft (inside the cluster, from compromised or legitimately accessible resources) or token impersonation to operate under a different user or service account (i.e., security context) to perform actions for lateral movement, privilege escalation, data access and data manipulation, while evading detection.
Adversaries may use insufficient or misconfigured RBAC to gain access under their own security context to privileged and sensitive resources (cluster’s secrets, configuration maps, authorization policies etc.) for lateral movement, privilege escalation, data access and data manipulation.
Adversaries may exploit vulnerabilities in Kubernetes API Server (authentication, authorization, admission control or validation requests processing phases) to gain access to privileged and sensitive resources (cluster’s secrets, configuration maps, authorization policies etc.) for initial access, lateral movement, privilege escalation, data access and data manipulation.
Obviously, detection of such scenarios is way beyond a simple filtering of the audit log using predetermined rules. Sophisticated machine learning algorithms must be used to achieve this automatically, at scale and within a reasonable time after actual threatening activity.
The analyzer tool should collect features of the operational and security activity of the cluster as they are presented in the log and feed them to the machine learning algorithm for measurement, weighting and linking together. At the end of this process, possible patterns of suspicious behavior can be presented to the auditor for validation and further investigation.
Auditing of system logs is a well-established practice to identify security threats to the system, whether before these threats have a detrimental effect on after the threat is realized. In fact, some forms of periodic audit are mandated by laws and regulations.
Nevertheless, identifying suspicious patterns in audit logs of complex, large-scale, dynamic systems like modern day Kubernetes clusters is a formidable task. While using some sort of automation is mandatory for such and analysis, most existing audit-tools are just mindless filters, hardly assisting the auditor in the deeper challenges of her task.
In this article we presented a vision for an automated Kubernetes audit log analyzer that goes far beyond that. Using machine learning, such a tool can autonomously detect potential threatening pattern in the log that the auditor can focus on, even in real time. Additionally, summarizing the information in the audit log in a way that is user-digestible lets the auditor quickly validate the identified patterns as well as helps her investigate additional hidden suspicious activities.