KubeCon + CloudNativeCon Europe Virtual | August 17-20, 2020 | Don’t Miss Out | Learn more

Category

Blog

OpenTelemetry Best Practices (Overview Part 2/2)

By | Blog

Member Post

Guest post originally published on the Espagon blog by Ran Ribenzaft, co-founder and CEO at Epsagon

In the second article of our OpenTelemetry series, we’ll focus on best practices for using OpenTelemetry, after covering the OpenTelemetry ecosystem and its components in part 1.

As a reminder, OpenTelemetry is an exciting new observability ecosystem with a number of leading monitoring companies behind it. It is a provider-agnostic observability solution supported by the CNCF and represents the third evolution of open observability after OpenCensus and OpenTracing. OpenTelemetry is a brand new ecosystem and is still in the early stages of development. Because of this, there are not yet many widespread best practices. This article outlines some of the best practices that are currently available and what to consider when using OpenTelemetry. Additionally, it explores the current state of OpenTelemetry and explains the best way to get in touch with the OpenTelemetry community.

OpenTelemetry Best Practices

Just because it’s new doesn’t mean there aren’t some good guidelines to keep in mind when implementing OpenTelemetry. Here below are our top picks.

Keep Initialization Separate from Instrumentation

One of the biggest benefits of OpenTelemetry is that it enables vendor-agnostic instrumentation through its API. This means that all telemetry calls made inside of an application come through the OpenTelemetry API, which is independent of any vendor being used. As you see in Figure 1, you can use the same code for any supported OpenTelemetry provider:

Separation between the vendor provider and instrumentation. (Source: GitHub)

This example shows how exporting is decoupled from instrumentation, so all that’s required of the instrumentation is a call to getTracer. The instrumentation code doesn’t have any knowledge of the providers (exporters) that were registered, only that it can get and use a tracer:

const tracer = opentelemetry.trace.getTracer('example-basic-tracer-node');

An additional best practice for OpenTelemetry here is to keep the provider configuration at the top level of your application or service–usually in the application’s entry point. This ensures that OpenTelemetry instrumentation is separate from the instrumentation calls and allows you to choose the best tracing framework for your use case without having to change any instrumentation code. Separating the provider configuration from the instrumentation enables you to switch a provider simply with a flag or environmental variable.

A CI environment running integration tests may not want to provision jaeger or zipkin or another tracing/metrics provider. This is often meant to reduce costs or complexity by removing moving parts. For example, in local development, the tracer or metrics could be using an in-memory exporter, but production may be using a hosted SaaS. Keeping the initialization of a provider decoupled from instrumentation makes it easy to switch providers depending on your environment; and this in turn allows you to easily switch implementations in case of a vendor switch.

Know the Configuration Knobs

OpenTelemetry tracing supports two strategies to get traces out of an application, a “SimpleSpanProcessor” and a “BatchSpanProcessor.” The SimpleSpanProcessor will submit a span every time a span is finished, but the BatchSpanProcessor buffers spans until a flush event occurs. Flush events can occur when a buffer is filled or when a timeout is reached.

The BatchSpanProcessor has a number of properties:

  • Max Queue Size is the maximum number of spans buffered in memory. Any span beyond this will be discarded.
  • Schedule Delay is the time between flushes. This ensures that you don’t get into a flush loop during times of heavy traffic.
  • Max per batch is the maximum number of spans that will be submitted during each flush.

BatchSpanProcessor configuration options. (Source: GitHub)

When the queue is full, the frameworks begin to drop new spans (load shedding), meaning that data loss can occur if these aren’t configured correctly. Without hard limits, the queue could grow indefinitely and affect the application’s performance and memory usage. This is especially important for “on-line” request/reply services but is also necessary for asynchronous services.

Look For Examples… and Sometimes Tests

OpenTelemetry is still a very young project. This means the documentation for most of the libraries is still sparse. To illustrate this, for the BatchSpanProcessor discussed above, configuration options are not even documented in the Go OpenTelemetry SDK! The only way to find examples is by searching the code:

BatchSpanProcessor configuration usage. (Source: GitHub)

Since the project is still in active development and the focus is on features and implementation, this makes sense. So, if you need answers, go to the source code!

Use Auto-Instrumentation by Default… but Be Careful of Performance Overhead

OpenTelemetry supports auto-tracing for Java, Ruby, Python, JavaScript, and PHP. Auto-tracing will automatically capture traces and metrics for built-in libraries such as:

  • HTTP Clients
  • HTTP Servers & Frameworks
  • Database Clients (Redis, MySQL, Postgres, etc.)

Auto-instrumentation significantly decreases the barrier to adopting observability, but you need to monitor it closely because it adds additional overhead to program execution.

During a normal execution, the program calls out directly to the HTTP client or database driver. Auto-instrumentation wraps these functions with additional functionality that costs time and resources in terms of memory and CPU. Because of this, it’s important to benchmark your application with auto-tracing enabled vs. auto-tracing disabled to verify that the performance is acceptable.

Unit Test Tracing Using Memory Span Exporters

Most of the time, unit testing focuses on program logic and ignores telemetry. But occasionally, it is useful to verify the correct metadata is present, including tags, metric counts, and trace metadata. A mock or stub implementation that records calls is necessary to achieve this.

Most languages don’t document usage of these constructs, so remember that the best place to find examples of usage is in the actual OpenTelemetry unit tests for each project!

Initializing meter in a unit test. (Source: GitHub)

Using a metric meter in a unit test. (Source: GitHub)

Tests are able to configure their own test exporter (“meter” in Figure 4 above). Remember that OpenTelemetry separates instrumentation from exporting, which allows the production code to use a separate exporter from testing. The example in Figure 5 shows a test configuring its own “meter,” making calls to it, and then making assertions on the resulting metric. This allows you to test that your code is setting metrics and metric tags correctly.

Customization & Beta Versions

OpenTelemtry officially reached its beta phase on March 30, 2020! Since the project is still young, each language has different levels of support, documentation, examples, and exporters available. So, make sure to check before assuming support for a specific provider.

The best place to find information is in the language-specific repo in GitHub under the OpenTelemetry organization or in the language-specific Gitter channel. The official OpenTelemetry project page is another good source of information.

Feedback!

OpenTelemetry has a very active community on Gitter, with a global channel available at open-telemetry/community. Each language has its own Gitter channel as well, and there are tons of opportunities to contribute, especially to documentation and exporters. Since OpenTelemetry is young, even the specification is still in active development, so this is a great time to give feedback and get involved in OpenTelemetry.

Conclusion

OpenTelemetry has recently reached beta and is available for use. It’s important to remember that the libraries and ecosystem here are still very young. Make sure to consider these OpenTelemetry best practices and dig for documentation in tests and examples:

  • Keep instrumentation separate from exporting.
  • Know what OpenTelemetry is doing under the hood.
  • Look for tests as documentation.

Also, if you ever run into issues, the OpenTelemetry community is very helpful and easily accessible through Gitter.

KubeCon + CloudNativeCon EU Virtual – NEW schedule is live!

By | Blog

CNCF Staff Blog

We are thrilled to announce that the new schedule for KubeCon + CloudNativeCon Europe 2020 – Virtual is live! 

We have been hard at work confirming speakers and making this an amazing virtual experience.

As our physical event has shifted to a virtual one, we have been taking thoughtful actions to create an immersive experience that provides you with interactive content and collaboration with peers. As an attendee, you will have the ability to network with other attendees, attend presentations with live Q&A, interact with sponsors real-time, and much more.

Below you can find all the exciting information about how to interact in our new digital experience. 

 #KeepCloudNativeConnected Worldwide

Live content will begin daily at 13:00 CEST with presentations over four consecutive days. Keynotes will be in the middle of each day’s schedule to make it easier for a variety of time zones to participate.

 

 

 

#KeepCloudNativeConnected at Any Hour of the Day

Did you miss a keynote or have two concurrent sessions and had to make a choice? The event platform will be accessible 24 hours a day to all registered attendees. That means anyone in any time zone can view what has been released, including all presentations and sponsor content. 

You can also visit sponsor booths to check out the content they are sharing or meet up with people in the community for a chat whenever it’s convenient for you! 

 

 Keep Cloud Native Questioning

All tutorial, maintainer track, and breakout sessions will be presented in a scheduled time slot and the presenter(s) will be free to talk to you via live Q+A! Review the schedule ahead of time and get your questions ready to participate in lively community discussions.

 

 

 Keep Cloud Native Delighted

Don’t just lurk in the session shadows – get engaged and have some fun! Opportunities will abound for interactive learning and networking, including the ability to earn points and win prizes, collaborate with peers in networking lounges, calm your mind with a mini chair yoga session, or break up the day with a live musical performance or game.

 

#KeepCloudNativeConnected to Our Sponsors

Diamond, platinum, and gold sponsors will be presenting demos of their products throughout the event – don’t miss the latest and greatest they have to showcase to the community!

Connect with sponsors of all levels in the showcase where you can visit their virtual booth to view videos, download resources, chat 1:1 with a company representative, and collect virtual swag

 

 

 Keep Cloud Native Learning

Along with your event registration, you will receive a 50% off voucher for the CKA/CKAD exam + training bundle as well as 30% off all other Linux Foundation trainings. More details will be provided at the start of the event to those that attend. 

 

 

 

We are so excited to see you online on August 17-20!

Be sure to register now for $75 to access the full experience! If you have any questions, feel free to reach out!

 

How JD.com Saves 60% Maintenance Time Using Harbor for Its Private Image Central Repository

By | Blog

CNCF Staff Blog

JD.com is the world’s third largest Internet company by revenue, and at its heart it considers itself “a science and technology company with retail at its core,” says Vivian Zhang, Product Manager at JD.com and CNCF Ambassador.

JD’s Infrastructure Department is responsible for developing the hyperscale containerized and Kubernetes platform that powers all facets of JD’s business, including the website, which serves more than 380 million active customers, and its e-commerce logistics infrastructure, which delivers over 90% of orders same- or next-day.

In 2016, the team was in need of a cloud native registry to provide maintenance for its private image central repository. Specifically, the platform needed a solution for:

  1. User authorization and basic authentication
  2. Access control for authorized users
  3. UI management for the JD Image Center
  4. Image vulnerability scanning for image security

The team evaluated a number of different solutions, including the native registry. But there were two main problems with the native registry: the need to implement authorization certification, and having the meta information on images traverse the file system. As such, with the native registry, “the performance could not meet our requirements,” says Zhang. 

Harbor, a cloud native registry project that just graduated within CNCF, turned out to be “the best fit for us,” says Zhang. “Harbor introduces the database, which can accelerate the acquisition of image meta information. Most importantly, it delivers better performance.”

Harbor also integrates Helm’s chart repository function to support the storage and management of the Helm chart, and provides user resource management, such as CPU, memory, and disk space — which were valuable to JD. 

“We found that Harbor is most suitable in the Kubernetes environment, and it can help us address our challenges,” says Zhang. “We have been a loyal user of Harbor since the very beginning.”

In order for it to work best at JD, “Harbor is used in concert with other products to ensure maximum efficiency and performance of our systems,” says Zhang. Those products include the company’s own open source project (and now a CNCF sandbox project), ChubaoFS, a highly available distributed file system for cloud native applications that provides scalability, resilient storage capacity, and high performance. With ChubaoFS serving as the backend storage for Harbor, multiple instances in a Harbor cluster can share container images. “Because we use ChubaoFS, we can achieve a highly available image center cluster,” says Zhang.

Indeed, Harbor has made an impact at JD. “We save approximately 60% of maintenance time for our private image central repository due to the simplicity and stability of Harbor,” says Zhang. Plus, Harbor has enabled authorization authentication and access control for images, which hadn’t been possible before.

For more about JD.com’s usage of Harbor, read the full case study.

 

Rust at CNCF

By | Blog

Staff Post

Rust is a systems language originally created by Mozilla to power parts of its experimental Servo browser engine. Once highly experimental and little used, Rust has become dramatically more stable and mature in recent years and is now used in a wide variety of settings, from databases to operating systems to web applications and far beyond. And developers seem to really love it.

You may be surprised to find out that the venerable Rust has established a substantial toehold here at CNCF as well. In fact, two of our incubating projects, TiKV and Linkerd, and of our sandbox projects, OpenEBS, have essential components written in Rust and both projects would be profoundly different—and potentially less successful—in another language.

In this post, I’d like to shed light on how TiKV and Linkerd are contributing to the Rust ecosystem.

TiKV

TiKV is a distributed, transactional key-value database originally created by the company PingCAP. Its core concepts are drawn from the venerable Google Spanner and Apache HBase and it’s primarily used to provide lower-level key/value—the “KV” in “TiKV”—storage for higher-level databases, such as TiDB.

In addition to the core repo, the TiKV project has contributed a number of libraries to the Rust ecosystem:

  • grpc-rs, a Rust wrapper for gRPC core.
  • raft-rs, a Rust implementation of the Raft consensus protocol. This is the consensus protocol used by TiKV as well as etcd, the distributed key-value store used by Kubernetes and a fellow CNCF project.
  • fail-rs, for injecting “fail points” at runtime
  • async-speed-limit, a library for asynchronously speed-limiting multiple byte streams
  • rust-prometheus, a Prometheus client for Rust that enables you to instrument your Rust services, i.e. to expose properly formatted metrics to be scraped by Prometheus.
  • pprof-rs, a CPU profiler that can be integrated into Rust programs. Enables you to create flame graphs of CPU activity and offers support for Protocol Buffers output.

PingCAP’s blog has also featured some highly regarded articles on Rust, including The Rust Compilation Model Calamity and Why did we choose Rust over Golang or C/C++ to develop TiKV? If you’re like me and excited about witnessing a new generation of databases written in Rust, you should really keep tabs on TiKV and its contributions to the Rust ecosystem.

Linkerd

Linkerd is a service mesh that’s relentlessly focused on simplicity and user-friendliness. If you’ve ever felt frustrated or overwhelmed by the complexity of other service mesh technologies, I cannot recommend the breath of fresh air that is the Linkerd Getting Started guide more highly. And in case you missed it, Linkerd had a huge 2019 and is continuing apace in 2020.

Arguably the most important component of Linkerd is its service proxy, which lives alongside your services in the same Kubernetes Pod and handles all network traffic to and from the service. Services proxies are hard to write because they need to be fast, they need to be safe, and they need to have the smallest memory footprint that’s commensurate with speed and safety.

The Linkerd creators opted for Rust for the Linkerd service proxy. Why did they make this choice? I reached out to Linkerd co-creator Oliver Gould to provide the breakdown:

When we started building Linkerd ~5 years ago, some of our first prototypes were actually in Rust (well before the language hit 1.0). Unfortunately, at the time, it wasn’t mature enough for our needs, so Linkerd’s first implementation grew out of Twitter’s Scala ecosystem. As we were working on Linkerd 1.x, Rust’s Tokio runtime started to take shape and was especially promising for building something like a proxy. So in early 2017 we set out to start rewriting Linkerd with a Go control plane and a Rust data plane. Tokio (with its sister projects, Tower & Hyper) made this all possible by extending Rust’s safe, correct memory model with asynchronous networking building blocks. These same building blocks are now being used in a variety of performance-sensitive use cases outside of Linkerd, and we’ve built a great community of contributors around both projects. If this is interesting to you, please come get involved!

In terms of contributions back to the Rust ecosystem, Linkerd has upstreamed core components to Tower and Tokio, such as Linkerd’s load balancer and Tokio’s tracing module.

In addition, the project also undertook a security audit of the rustls library (sponsored by CNCF). As the name suggests, rustls is a transport security layer (TLS) library for Rust that’s used by the Linkerd proxy for its mutual TLS (mTLS) feature, which is crucial to the security guarantees that the Linkerd service mesh provides. You can see the result of the audit in this PDF. Cure53, the firm responsible for security audits of several other CNCF projects, was “unable to uncover any application-breaking security flaws.” A sterling result if I say so myself!

OpenEBS

OpenEBS is a container-attached (CAS) and container-native storage system that enables you to easily manage Persistent Volumes in Kubernetes. The data plane for OpenEBS, MayaStor, is one of OpenEBS’ core components and written in Rust. Here’s what Even Powell, CEO of MayaData, the company that originally created OpenEBS, had to say about the choice of Rust:

Due to its (almost) shared-nothing design in the data path, Rust is an ideal fit if you want to avoid accidentally moving or sending data between cores. To integrate with the Poll Mode Drivers (PMDs), Data Plane Development Kit (DPDK), or Storage Performance Development Kit (SPDK), we’ve implemented a simple reactor that enables us to integrate with other Rust projects, such as tonic-rs (a gRPC implementation that’s Rust native, unlike grpc-rs from PingCAP) while also adhering to strict rules for working with PMDs. With Rust, those rules are actually enforced by the compiler.

More to come?

I’m a huge fan of Rust myself, though I’ve really only dabbled in it. I have my fingers crossed that TiKV and Linkerd are just the beginning and that we’ll see a whole lot more Rust in the cloud native universe, be that in the form of new CNCF projects written in Rust, existing projects porting components into Rust, or new Rust client libraries for existing systems.

And if you’re curious about all of the programming languages in use amongst CNCF’s many projects, stay tuned for an upcoming blog post on precisely that topic.

TOC Approves SPIFFE and SPIRE to Incubation

By | Blog

Project Post

Today, the CNCF Technical Oversight Committee (TOC) voted to accept SPIFFE and SPIRE as incubation-level hosted projects.

The SPIFFE (Secure Production Identity Framework For Everyone) specification defines a standard to authenticate software services in cloud native environments through the use of platform-agnostic, cryptographic identities. SPIRE (the SPIFFE Runtime Environment) is the code that implements the SPIFFE specification on a wide variety of platforms and enforces multi-factor attestation for the issuance of identities. In practice, this reduces the reliance on hard-coded secrets when authenticating application services. 

“The underpinning of zero trust is authenticated identity,” said Andrew Harding, SPIRE maintainer and principal software engineer at Hewlett Packard Enterprise. “SPIFFE standardizes how cryptographic, immutable identity is conveyed to a workload. SPIRE leverages SPIFFE to help organizations automatically authenticate and deliver these identities to workloads spanning cloud and on-premise environments. CNCF has long understood the transformational value of these projects to the cloud native ecosystem, and continues to serve as a great home for our growing community.”

The projects are used by and integrated with multiple cloud native technologies, including Istio, and CNCF projects Envoy, and gRPC, and Open Policy Agent (OPA). SPIFFE and SPIRE also provide the basis for cross-authentication between Kubernetes-hosted workloads and workloads hosted on any other platform.

“Most traditional network-based security tools were not designed for the complexity and sheer scale of microservices and cloud-based architectures,” Justin Cormack, security lead at Docker and TOC member. “This makes a standard like SPIFFE, and the SPIRE runtime, essential for modern application development. The projects have shown impressive growth since entering the CNCF sandbox, adding integrations and support for new projects, and showing growing adoption.”

Since joining CNCF, the projects have grown in popularity and have been deployed by notable companies such as Bloomberg, Bytedance, Pinterest, Square, Uber, and Yahoo Japan. SPIRE has a thriving developer community, with an ongoing flow of commits and merged contributions from organizations such as Amazon, Bloomberg, Google, Hewlett-Packard Enterprise, Pinterest, Square, TransferWise, and Uber. 

Since admittance into CNCF as a sandbox level project, SPIRE has added the following key features:

  • Support for bare metal, AWS, GCP, Azure and more
  • Integrations with Kubernetes, Docker, Vault, MySQL, Envoy, and more
  • Support for both nested and federated SPIRE deployment topologies
  • Support for JWT-based SPIFFE identities, in addition to x.509 documents
  • Horizontal server scaling with client-side load balancing and discovery
  • Support for authenticating against OIDC-compatible validators
  • Support for non-SPIFFE-aware workloads

 “SPIFFE and SPIRE address a gap that has existed in security by enabling a modern standardized form of secure identity for cloud native workloads,” said Chris Aniszczyk, CTO/COO of Cloud Native Computing Foundation. “We are excited to work with the community to continue to evolve the specification and implementation to improve the overall security of our ecosystem.”

Earlier this year, the CNCF SIG Security conducted a security assessment of SPIFFE and SPIRE. They did not find any critical issues and commended its design with respect to security. SPIFFE and SPIRE have made a significant impact and play a pivotal role in enabling a more secure cloud native ecosystem.

“In addition to mitigating the risk of unauthorized access in the case of a compromise, a strong cryptographically-proven identity reduces the risk of bad configuration. It’s not uncommon for developers to try to test against production, which can be dubious,” said Tyler Julian, security engineer at Uber and SPIRE maintainer. “You have proof. You have cryptographic documents to prove who the service is. In reducing the amount of trust in the system, you reduced your assumption of behavior. Both good for the reliability of your system and the security of the data.”

“At Square, we have heterogeneous platforms that take advantage of cloud native technologies like Kubernetes and serverless offerings, as well as traditional server-based infrastructure,” said Matthew McPherrin, security engineer at Square and SPIRE maintainer. “SPIFFE and SPIRE are enabling us to build a shared service identity, underlying our Envoy service mesh that spans multiple datacenters and cloud providers in an interoperable way.”

Joining CNCF incubation-level projects like OpenTracing, gRPC, CNI, Notary, NATS, Linkerd, Rook, etcd, OPA, CRI-O, TiKV, CloudEvents, Falco, Argo, and Dragonfly, SPIFFE and SPIRE are part of a neutral foundation aligned with its technical interests, as well as the larger Linux Foundation, which provides governance, marketing support, and community outreach. 

Every CNCF project has an associated maturity level: sandbox, incubating, or graduated. For more information on maturity requirements for each level, please visit the CNCF Graduation Criteria v.1.3. SPIFFE and SPIRE entered the CNCF sandbox in March 2018.

To learn more about SPIFFE and SPIRE, visit spiffe.io

Kubernetes RBAC 101: Overview

By | Blog

Member Post

Guest post originally published on the Kublr blog by Oleg Chunikhin

Cloud native and open source technologies have modernized how we develop software, and although they have led to unprecedented developer productivity and flexibility, they were not built with enterprise needs in mind.

A primary challenge is bridging the gap between cloud native and enterprise reality. Enterprises need a centralized Kubernetes management control plane with logging and monitoring that supports security and governance requirements extended through essential Kubernetes frameworks.

But the job doesn’t end with reliable, enterprise-grade Kubernetes clusters. Organizations are also struggling to define new practices around this new stack. They find they must adjust established practices and processes and learn how to manage these new modern applications. Managing roles and permissions is part of that learning process.

Role-based access control (RBAC) is critical, but it can cause quite a bit of confusion. Organizations seek guidance on where to start, what can be done, and what a real-life implementation looks like. In this first blog in a  three-part series on Kubernetes RBAC, we’ll provide an overview of the terminology and available authentication and authorization methods. Part two and three will take a deeper dive into authentication and authorization.

RBAC is a broad topic. Keeping practicality in mind, we’ll focus on those methods that are most useful for enterprise users.

What is Kubernetes RBAC?

RBAC or access control is a way to define which users can do what within a Kubernetes cluster. These roles and permissions are defined either through various extensions or declaratively.

If you are familiar with Kubernetes, you already know that there are different resources and subjects. But if this is new to you, here is a quick summary.

Kubernetes provides a single API endpoint through which users manage containers across multiple distributed physical and virtual nodes. Following standard REST conventions, everything managed within Kubernetes is handled as a resource. Objects inside the Kubernetes master server are available through API objects like pods, nodes, config maps, secrets, deployments, etc.

In addition to resources, you also need to consider subjects and operations which are all connected through access control.

Operations and Subjects

Operations on resources are expressed through HTTP verbs sent to the API. Based on the REST URL called, Kubernetes will translate HTTP verbs from incoming HTTP requests into a wider set of operations. For example, while the GET verb applied to a specific Kubernetes object is interpreted as a “get” operation for that object; a GET verb applied to a class of objects in the Kubernetes API is interpreted as a “list” operation. This distinction is important when writing Kubernetes RBAC rules, as we’ll explain that in detail in part three (RBAC 101: Authorization).

Subjects represent actors in the Kubernetes API and RBAC, namely processes, users, or clients that call the API and perform operations on Kubernetes objects. There are three subject categories: users, groups, and service accounts. Technically, only service accounts exist as objects within a Kubernetes cluster API, users and groups are virtual — they don’t exist in the Kubernetes database, but Kubernetes identifies them by a string ID.

When sending an API request to the Kubernetes API, Kubernetes will first authenticate the request by identifying the user and group the sender belongs to. Depending on the authentication and method used, Kubernetes may also extract additional information and represent it as a map of key-value pairs associated with the subject.

Resource versus Non-Resource Requests

The Kubernetes API server adds an additional attribute to the request by tagging it as a resource or a non-resource request. This is necessary because, in addition to operating resources and objects via the Kubernetes API, users can also send requests to non-resource APIs endpoints, such as the “/version” URL, a list of available APIs, and other metadata.

For an API resource request, Kubernetes decodes the API request verb, namespace (in case of a namespaced resource), API group, resource name, and, if available, sub-resource. The set of attributes for API non-resource requests is smaller: an HTTP request verb and request path. The access control framework uses these attributes to analyze and decide whether a request should be authorized or not.

Kubernetes API Request Attributes

With non-resource requests, the HTTP request verb is obvious. In the case of resource requests, the verb gets mapped to an API resource action. The most common actions are get, list, create, delete. But there are also some less evident actions, such as watch, patch, bind, escalate, and use.

Authentication Methods for Kubernetes 

There are a number of authentication mechanisms, from client certificates to bearer tokens to HTTP basic authentication to authentication proxy.

Client certificates. There are two ways to sign client certificates so clients can use them to authenticate their Kubernetes API server requests. One is the manual creation of CSR and signing by an administrator, or signing a certificate through an enterprise certificate authority PKI infrastructure, in which case the external infrastructure signs the client certificates.

Another way that doesn’t require an external infrastructure — although not suitable for large scale deployments — is leveraging Kubernetes, which can also sign client certificates.

Bearer token. There are a number of ways for getting a bearer token. There are bootstrap and node authentication tokens which we won’t cover as they are mostly used internally in Kubernetes for initialization and bootstrapping. Static token files is another option that we won’t discuss because they are considered bad practice and are insecure. The most practical and useful methods are from service accounts and OIDC and we’ll cover that in detail in our next blog.

HTTP basic auth. HTTP basic auth is considered insecure as it can only be done through static configuration files in Kubernetes.

Authentication proxy. Mainly used by vendors, authentication proxies are often applied to set up different Kubernetes architectures. A proxy server processes requests to the Kubernetes API and establishes a trusted connection between the Kubernetes API and proxy. That proxy will authenticate users and clients any way it likes and add user identification into the request headers for requests sent through to the Kubernetes API. This allows the Kubernetes API to know who calls it.  Kublr, for example, uses this method to proxy dashboard requests, general web console requests, or provide a proxy Kubernetes API endpoint. Again, if you aren’t a vendor, you don’t really need to worry about this.

Impersonating: If you already have certain credentials providing access to the Kubernetes API, those credentials can be used to “impersonate” users by sending additional headers in the request with the impersonated user identity information. The Kubernetes API will switch your authentication context to that impersonated user based on the headers. Clearly this capability is only available if the “main” user account has permissions to impersonate.

Authorization Methods for Kubernetes

There are a few ways to manage authorization requests in Kubernetes.

First, we will quickly scan through the methods that you will not see or use in the everyday Kubernetes administrator’s life. Node authorization is used internally to authorize kubelet’s API calls, and should never be used by other clients. Authorization methods, such as ABAC and AlwaysDeny / AlwaysAllow are rarely used in real-life clusters: ABAC is based on a static config file and is considered insecure, and AlwaysDeny / AlwaysAllow are generally used for testing and are not approaches you’d use for production deployments.

WebHook is an external service the Kubernetes API can call when it needs to decide whether a request should be allowed or not. The API for this service is well documented in the Kubernetes documentation. In fact, the Kubernetes API itself provides this API. The most common use case for this mechanism is extensions. Extension servers provide authorization webhook endpoints to the API server to authorize access to extension objects.

From a practical standpoint, the most useful authorization method is RBAC. RBAC is based on declarative definitions of permissions stores and managed as cluster API objects. The main objects are roles and cluster roles, both representing a set of permissions on certain objects in the API. These are identified by API groups, source names, and actions performed on those objects. You can have a number of rules within a role or cluster role object.

To properly authorize users in a production grade deployment, it’s important to use RBAC. In the next blogs in this series, we’ll discuss how you can set up and use RBAC.

Conclusion

As we’ve seen, everything managed by Kubernetes is referred to as a resource. Operations on those resources are expressed as HTTP verbs, and subjects are the actors who’ll need to be authenticated and authorized.

There are a few ways to authenticate subjects. Client certificates, bearer tokens, HTTP basic auth, auth proxy, or impersonation (which does require a previous authentication). But only client certificates are a viable option for production deployments. We’ll explore them in more detail in our next blog (Kubernetes RBAC 101: Authentication).

For authorization, you also have a few options. There is ABAC, AlwaysDeny / AlwaysAllow, WebHook, and RBAC. Here too, only one option is viable for external clients in production deployments and that is RBAC. We’ll cover it in detail in part three of this series (Kubernetes RBAC 101: Authorization).

If you’d like to experiment with RBAC, download Kublr and play around with its RBAC feature. The intuitive UI helps speed up the steep learning curve when dealing with RBAC YAML files.

Testing Kubernetes Deployments within CI Pipelines

By | Blog

Member Post

Guest post originally published on eficode Praqma by Michael Vittrup Larsen, Cloud Infrastructure and DevOps Consultant at Eficode-Praqma

Low overhead, on-demand Kubernetes clusters deployed on CI Workers Nodes with KIND

How to test Kubernetes artifacts like Helm charts and YAML manifests in your CI pipelines with a low-overhead, on-demand Kubernetes cluster deployed with KIND – Kubernetes in Docker.

Containers have become very popular for packaging applications because they solve the dependency management problem. An application packaged in a container includes all necessary run-time dependencies so it becomes portable across execution platforms. In other words, if it works on my machine it will very likely also work on yours.

Automated testing is ubiquitous in DevOps and we should containerize our tests for exactly the same reasons as we containerize our applications: if a certain test validates reliably on my machine it should work equally well on yours, irrespective of which libraries and tools you have installed natively.

Testing with Containers

The following figure illustrates a pipeline (or maybe two, depending on how you organize your pipelines) where the upper part builds and packages the application in a container and the lower part does the same with the tests that will be used to validate the application. The application is only promoted if the container-based tests pass.

Test operations through network

If we assume that the application is a network-attached service where black-box testing can be executed through network connectivity, a setup like the one above is easily implemented by:

  1. Build application and test containers, e.g. using ‘docker build …’
  2. Start an instance of the application container attached to a network, e.g. with ‘docker run …’
  3. Start an instance of the test container attached to the same network as the application, e.g. with ‘docker run …’
  4. The exit code of the test container determines the application test result

This is illustrated in the figure below.

Test operations through network

Steps 2 through 4 outlined above can also be described in a docker-compose definition with two services, e.g. (the test container is configured with the application network location through an environment variable):

version: '3.7'
    services:
    test:
      image: test-container:latest
      environment:
        APPLICATION_URL: http://application:8080
    depends_on:
      - application
    application:
      image: application:latest
      ports:
        - 8080:8080

A test using the two containers can now be executed with:

docker-compose up --exit-code-from test

Testing Kubernetes Artifacts in the CI Pipeline

The process described above works well for tests at ‘container level’. But what if the output artifacts of the CI pipelines includes Kubernetes artifacts, e.g. YAML manifests or Helm charts, or needs to be deployed to a Kubernetes cluster to be validated? How do we test in those situations?

One option is to have a Kubernetes cluster deployed which the CI pipelines can deploy to. However, this gives us some issues to consider:

  • A shared cluster which all CI pipelines can deploy to basically becomes a multi-tenant cluster which might need careful isolation, security, and robustness considerations.
  • How do we size the CI Kubernetes cluster? Most likely the cluster capacity will be disconnected from the CI worker capacity i.e. they cannot share compute resources. This will result in low utilization. Also, we cannot size the CI cluster too small because we do not want tests to fail due to other pipelines temporarily consuming resources.
  • We might want to test our Kubernetes artifacts against many versions and configurations of Kubernetes, i.e. we basically need N CI clusters available.

We could also create a Kubernetes cluster on demand for each CI job. This requires:

  • Access to a cloud-like platform where we can dynamically provision Kubernetes clusters.
  • Our CI pipelines have the necessary privileges to create infrastructure which might be undesirable from a security point-of-view.

For some test scenarios we need a production-like cluster and we will have to consider one of the above solutions, e.g. characteristics tests or scalability tests. However, in many situations, the tests we want our CI pipelines to perform can be managed within the capacity of a single CI worker node. The following section describes how to create on-demand clusters on a container-capable CI worker node.

On-Demand Private Kubernetes Cluster with KIND

Kubernetes-in-Docker (KIND) is an implementation of a Kubernetes cluster using Docker-in-Docker (DIND) technology. Docker-in-docker means that we can run containers inside containers and those inner containers are only visible inside the outer container. KIND uses this to implement a cluster by using the outer container to implement Kubernetes cluster nodes. When a Kubernetes POD is started on a node it is implemented with containers inside the outer node container.

With KIND we can create on-demand and multi-node Kubernetes clusters on top of the container capabilities of our CI worker node.

A KIND Kubernetes cluster : KIND Kubernetes cluster

The cluster capacity will obviously be limited by CI worker node capacity, but otherwise the Kubernetes cluster will have many of the capabilities of a production cluster, including HA capabilities.

Let’s demonstrate how to test an application deployed with Helm to a KIND cluster. The application is the k8s-sentence-age application which can be found on Github, including a Github action that implements the CI pipeline described in this blog. The application is a simple service that can return a random number (an ‘age’) between 0 and 100 and also provides appropriate Prometheus compatible metrics.

Installing KIND

KIND is a single executable, named kind, which basically talks to the container runtime on the CI worker. It will create an (outer) container for each node in the cluster using container images containing the Kubernetes control-plane. An example of installing kind as part of a Github action can be found here.

Creating a Cluster

With the kind tool our CI pipelines can create a single node Kubernetes cluster with the following command:

kind create cluster --wait 5m

We can also create multi-node clusters if we need them for our tests. Multi-node clusters require a configuration file that lists node roles:

# config.yaml
  kind: Cluster
  apiVersion: kind.x-k8s.io/v1alpha4
  nodes:
  - role: control-plane
  - role: worker
  - role: worker

With the above configuration file we can create a three-node cluster with the following command:

kind create cluster --config config.yaml

We can specify which container image the KIND Kubernetes nodes should use and thereby control the version of Kubernetes:

kind create cluster --image "kindest/node:v1.16.4"

With this we can easily test compatibility against multiple versions of Kubernetes as part of our CI pipeline.

Building Application Images and Making Them Available to KIND

The example k8s-sentences-age application is packaged in a container named ‘age’ and the tests for the application are packaged in a container named ‘age-test’. These containers are built in the usual way as follows:

docker build -t age:latest ../app
docker build -t age-test:latest .

We can make the new version of these images available to our KIND Kubernetes nodes with the following command:

kind load docker-image age:latest
kind load docker-image age-test:latest

Loading the images onto KIND cluster nodes copies the image to each node in the cluster.

Running a Test

Our pipeline will deploy the application using its Helm chart and run the tests against this deployed application instance.

Deploying the application with the application Helm chart means that we not only test the application container when deployed to Kubernetes, but we also validate the Helm chart itself. The Helm chart contains the YAML manifests defining the application Kubernetes blueprint and this is particularly important to validate – not only against different versions of Kubernetes, but also in various configurations, e.g. permutations of values given to the Helm chart.

We install the application with the following Helm command. Note that we override the Helm chart default settings for image repository, tag and pullPolicy such that the local image used.

helm install --wait age ../helm/age \
--set image.repository=age \
--set image.tag=latest \
--set image.pullPolicy=Never

The test container is deployed using a Kubernetes Job resource. Kubernetes Job resources define workloads that run to completion and report completion status. The job will use the local ‘age-test‘ container image we built previously and will connect to the application POD(s) using the URLs provided in environment variables. The URLs reference the Kubernetes service created by the Helm chart.

apiVersion: batch/v1
kind: Job
metadata:
  name: component-test
spec:
  template:
    metadata:
      labels:
        type: component-test
    spec:
      containers:
      - name: component-test
        image: age-test
        imagePullPolicy: Never
        env:
        - name: SERVICE_URL
          value: http://age:8080
        - name: METRICS_URL
          value: http://age:8080/metrics
      restartPolicy: Never

The job is deployed with this command:

kubectl apply -f k8s-component-test-job.yaml

Checking the Test Result

We need to wait for the component test job to finish before we can check the result. The kubectl tool allows waiting for various conditions on different resources, including job completions. i.e. our pipeline will wait for the test to complete with the following command:

kubectl wait --for=condition=complete \
--timeout=1m job/component-test

The component test job will have test results as part of its logs. To include these results as part of the pipeline output we print the logs of the job with kubectl and with a label selector to select the job pod.

kubectl logs -l type=component-test

The overall status of the component test is read from the job POD field .status.succeeded and stored in a SUCCESS variable as shown below. If the status indicates failure the pipeline terminates with an error:

SUCCESS=$(kubectl get job component-test \
-o jsonpath='{.status.succeeded}')
if [ $SUCCESS != '1' ]; then exit 1; fi
echo "Component test successful"

The full pipeline can be found in the k8s-sentences-age repository on Github.

It is worth noting here that starting a test job and validating the result is what a helm test does. Helm test is a way of formally integrating tests into Helm charts such that users of the chart can run these tests after installing the chart. It therefore makes good sense to include the tests in your Helm charts and make the test container available to users of the Helm chart. To include the test job above into the Helm chart we simply need to add the annotation shown below and include the YAML file as part of the chart.

...
metadata:
  name: component-test
  annotations:
    "helm.sh/hook": test

When a KIND Cluster Isn’t Sufficient

In some situations a local Kubernetes cluster on a CI worker might not be ideal for your testing purposes. This could be when:

  • Unit tests have call functions e.g. use classes from the application. In this case application and tests are most likely a single container which could be executed without Kubernetes.
  • Components tests involve no Kubernetes-related artifacts. If the example shown above did not have a Helm chart to test, the docker-compose solution would have been sufficient.
  • Tests involve a characteristics test, e.g. measuring performance and scalability of your application. In such situations you would need infrastructure which is more stable with respect to capacity.
  • Integration tests that depend on other artifacts cannot easily be deployed in the local KIND cluster, like a large database with customer data.
  • Functional, integration or acceptance tests require the whole ‘application’ to be deployed. Some applications might not fit within the limited size of the KIND cluster.
  • Tests which have external dependencies, e.g. cloud provider specific ingress/load balancing, storage solutions, key management services etc. In some cases these can be simulated by deploying e.g. a database on the KIND cluster and in other cases they cannot.

However, there are still many cases where testing with a KIND Kubernetes cluster is ideal, e.g. when you have Kubernetes-related artifacts to test like a Helm chart or YAML manifests, and when an external CI/staging Kubernetes cluster involves too much maintenance overhead or is too resource in-efficient.

Identifying Kubernetes Config Security Threats: Pods Running as Root

By | Blog

Member Post

Guest post by Joe Pelletier, VP of Strategy at Fairwinds

With different teams – development, security and operations – and prioritization of speedy delivery over perfect configuration, mistakes are inevitable. As teams work on building and shipping new applications, mistakes are bound to happen if the only safeguard is an application developer remembering to adjust Kubernetes’ default configurations. Security, efficiency and reliability end up suffering

Having individual contributors design their own Kubernetes security configuration all but ensures inconsistency and mistakes. It doesn’t often happen intentionally, often it’s because engineers are focused on getting containers to run in Kubernetes. Unfortunately, many neglect to revisit configurations along the way causing gaps in security and efficiency.

A prime example is overpermissioning a deployment with root access to just get something working. Malicious attackers are constantly looking for holes to exploit and root access is ideal for them.

Platform teams responsible for security can attempt to manually go through each pod to check for misconfigured deployments. But many DevOps teams are under-staffed and don’t have the bandwidth to manually inspect every change introduced by a variety of engineering teams. They need a way to proactively audit workloads and validate configurations to identify weaknesses, container vulnerabilities, and misconfigured deployments. Configuration validation provides a tool to proactively identify holes in security instead of waiting for a breach to happen.

Kubernetes configuration validation ensures consistent security: 

  • Built-in centralized control | Often security teams require DevOps to implement a variety of infrastructure controls to meet internal standards. When it comes to Kubernetes, most security teams lack visibility beyond basic metrics, straining the relationship with DevOps. Consolidating this data in a single location bridges the gap between these two stakeholders.
  • Reduce the risk of mistakes | A configuration validation platform dramatically reduces the risk of errors from either Kubernetes inexperience or oversight by adding an expert configuration review into the development process.
  • Ensure security | Configuration validation is key to security for Kubernetes — getting it right dramatically reduces the risk of security incidents in production. Configuration validation ensures that security best practices are being followed organization-wide.

Platform teams can opt to build their own tool for absolute control, but few companies gain a competitive edge from having their own tool. There are open source options available, but teams must evaluate, manage and maintain and building the holistic platform can be time consuming. 

As the fastest way to identify Kubernetes misconfiguration, a purpose-built solution offers baked-in guidance curated by Kubernetes experts with dedicated support when needed. It allows teams to focus time on developing and deploying applications while simplifying operations.

Check out an example of a configuration validation solution. 

Learn more about configuration validation by visiting https://www.fairwinds.com/.

Interested in the Future of Cloud Native Observability? Join SIG-Observability

By | Blog

CNCF Staff Post

The Special Interest Group for observability was formed recently under the umbrella of CNCF, with the goal of fostering the ecosystem around observation of cloud native workloads. Chairs Matt Young and Richard Hartmann are spearheading the SIG’s activities, which include producing supporting material and best practices for end users as well as providing guidance for CNCF observability-related projects.

Among its stated scope:

  • Identify and report gaps in the CNCF’s project portfolio on topics of observability to the TOC and the wider CNCF community.
  • Collect, curate, champion, and disseminate patterns and current best practices related to the observation of cloud-native systems that are effective and actionable. Educate and inform users with unbiased, accurate, and pertinent information. Educate and help other CNCF projects regarding observability techniques and best current practices available within the CNCF.
  • Provide and maintain a vendor-neutral venue for relevant thought validation, discussion, and project feedback.
  • Provide a ladder for community members to become involved with the technical oversight of projects within the SIG’s scope in an open, transparent, and inclusive way.

Hartmann says that while “it’s always the right time” for an observability SIG, the organizers got together now because of the recent growth in the area of observability within CNCF. “Up to now, you mainly had Prometheus, but there are more and more efforts around [observability],” he says. “Cortex and Thanos are up for review to move from the sandbox to incubating. OpenMetrics is finally moving. OpenTelemetry is progressing. We needed a space to talk about the cooperation that will come from all of this.”

The short-term goals include “working through both the review and project progression backlog first, and we are making great strides here,” Hartmann says. “We also want to talk about BCPs, best current practices, so we’re able to actually make suggestions for how to operate observability in a cloud native manner.”

Looking ahead, Hartmann says the SIG is starting to talk about data analysis, “which will most likely lay some groundwork for machine learning and AI, i.e., doing data science on your monitoring data.”

As to why he and Young decided to take on this work, he says, “Personally, as silly as this might sound, I want to make the world a better place. Make things cleaner. You know the phrase from The Dark Knight, ‘Some people just want to see the world burn’? My tagline is ‘People just want to see the world turn.’”

To that end, he says he’s focused on “leading calls and conversations, sniffing out and making explicit the agreements between people which they don’t see, and bringing this to consensus.”

If you’re interested in getting involved with SIG-Observability, you are invited to attend the SIG call on the 2nd and 4th Tuesdays of every month at 1600 UTC. (See details on the CNCF Community Calendar.) Or join the conversation in the #sig-observability channel on the CNCF Slack. 

Introducing the CNCF Technology Radar

By | Blog

Today, we are publishing our first CNCF Technology Radar, a new initiative from the CNCF End User Community. This is a group of over 140 top companies and startups who meet regularly to discuss challenges and best practices when adopting cloud native technologies. The goal of the CNCF Technology Radar is to share what tools are actively being used by end users, the tools  they would recommend, and their patterns of usage.

Slides: github.com/cncf/enduser-public/blob/master/CNCFTechnologyRadar.pdf

How it works

A technology radar is an opinionated guide to a set of emerging technologies. The popular format originated at Thoughtworks and has been adopted by dozens of companies including Zalando, AOE, Porsche, Spotify, and Intuit.

The key idea is to place solutions at one of four levels, reflecting advice you would give to someone who is choosing a solution:

  • Adopt: We can clearly recommend this technology. We have used it for long periods of time in many teams, and it has proven to be stable and useful.
  • Trial: We have used it with success and recommend you take a closer look at the technology.
  • Assess: We have tried it out, and we find it promising. We recommend having a look at these items when you face a specific need for the technology in your project.
  • Hold: This category is a bit special. Unlike the other categories, we recommend you hold on using something. That does not mean that these technologies are bad, and it often might be OK to use them in existing projects. But technologies are moved to this category if we think we shouldn’t use them because we see better options or alternatives now.

The CNCF Technology Radar is inspired by the format but with a few differences:

  • Community-driven: The data is contributed by the CNCF End User Community and curated by community representatives.
  • Focuses on future adoption, so there are only three rings: Assess, Trial, and Adopt.
  • Instead of covering several hundred items, one radar will display 10-20 items on a specific use case. This removes the need to organize into quadrants.
  • Instead of publishing annually, the cadence will be on a shorter time frame, targeting quarterly.

Our first technology radar focuses on Continuous Delivery.

CNCF Technology Radar: Continuous Delivery, June 2020

During May 2020, the members of the End User Community were asked which CD solutions they had assessed, trialed, and subsequently adopted. 177 data points were sorted and reviewed to determine the final positions.

This may be read as:

  • Flux and Helm are widely adopted, and few or none of the respondents recommended against.
  • Multiple companies recommend CircleCI, Kustomize, and GitLab, but something was lacking in the results. For example, not enough responses, or a few recommended against.
  • Projects in Assess lacked clear consensus. For example, Jenkins has wide awareness, but the placement in Assess reflects comments from companies that are moving away from Jenkins for new applications. Spinnaker also showed broad awareness, but while many had tried it, none in this cohort positively recommended adoption. Those who are looking for a new CD solution should consider those in Assess given their own requirements.

The Themes

The themes describe interesting patterns and editor observations:

  1. Publicly available solutions are combined with in-house tools: Many end users had tried up to 10 options and settled on adopting 2-4. Several large enterprise companies have built their own continuous delivery tools and open sourced components, including LunarWay’s release-manager, Box’s kube-applier, and stackset-controller from Zalando. The public cloud managed solutions on the CNCF landscape were not suggested by any of the end users, which may reflect the options available a few years ago.
  2. Helm is more than packaging applications: While Helm has not positioned itself as a Continuous Delivery tool (it’s the Kubernetes package manager first), it’s widely used and adopted as a component in different CD scenarios. 
  3. Jenkins is still broadly deployed, while cloud native-first options emerge. Jenkins and its ecosystem tools (Jenkins X, Jenkins Blue Ocean) are widely evaluated and used. However, several end users stated Jenkins is primarily used for existing deployments, while new applications have migrated to other solutions. Hence end users who are choosing a new CD solution should assess Jenkins alongside tools that support modern concepts such as GitOps (for example, Flux).

The Editor

Cheryl Hung is the Director of Ecosystem at CNCF. Her mission is to make end users successful and productive with cloud native technologies such as Kubernetes and Prometheus. Twitter: @oicheryl

Read more

CNCF Projects for Continuous Delivery: 

  • Argo is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. A CNCF incubating project, it is composed of Argo CD, Argo Workflows, and Argo Rollouts.
  • Flux is the open source GitOps operator for Kubernetes. It is a CNCF sandbox project.
  • Helm is the open source package manager for Kubernetes. It recently graduated within CNCF.

Case studies: Read how Babylon and Intuit are handling continuous delivery.

What’s next

The next CNCF Technology Radar is targeted for September 2020, focusing on a different topic in cloud native such as security or storage. Vote to help decide the topic for the next CNCF Technology Radar.

Join the CNCF End User Community to: 

  • Find out who exactly is using each project and read their comments
  • Contribute to and edit future CNCF Technology Radars. Subsequent radars will be edited by people selected from the End User Community.

We are excited to provide this report to the community, and we’d love to hear what you think. Email feedback to info@cncf.io.

About the methodology

In May 2020, the 140 companies in the CNCF End User Community were asked to describe what their companies recommended for different solutions: Hold, Assess, Trial, or Adopt. They could also give more detailed comments. As the answers were submitted via a Google Spreadsheet, they were neither private nor anonymized within the group.

33 companies submitted 177 data points on 21 solutions. These were sorted in order to determine the final positions. Finally, the themes were written to reflect broader patterns, in the opinion of the editors.

1 4 5 6 53