Category

Blog

TOC Approves CNCF SIGs and Creates Security and Storage SIGs

By | Blog

Earlier this year, the Technical Oversight Committee (TOC) voted to create CNCF Special Interest Groups (SIGs). CNCF SIGs are currently being bootstrapped in various focus areas and primarily led by recognized experts and supported by contributors. They report directly to the TOC and we encourage developers and end users to get involved in the formation:

Name (to be finalised) Area Current CNCF Projects
Traffic networking, service discovery, load balancing, service mesh, RPC, pubsub, etc. Envoy, Linkerd, NATS, gRPC, CoreDNS, CNI
Observability monitoring, logging, tracing, profiling, etc. Prometheus, OpenTracing, Fluentd, Jaeger, Cortex, OpenMetrics,
Governance authentication, authorization, auditing, policy enforcement, compliance, GDPR, cost management, etc SPIFFE, SPIRE, Open Policy Agent, Notary, TUF,  Falco,
App Delivery PaaS, Serverless, Operators,… CI/CD,  Conformance, Chaos Eng, Scalability and Reliability measurement etc. Helm, CloudEvents, Telepresence, Buildpacks, (CNCF CI)
Core and Applied Architectures orchestration, scheduling, container runtimes, sandboxing technologies, packaging and distribution, specialized architectures thereof (e.g. Edge, IoT, Big Data, AI/ML, etc). Kubernetes, containerd, rkt, Harbor, Dragonfly, Virtual Kubelet

The TOC and CNCF Staff will start drafting an initial set of charters for the above SIGs, and solicit suitable chairs. Visit the CNCF SIG page for more information.

Security SIG

Approved by the TOC earlier this month, the Security SIG’s mission is to reduce risk that cloud native applications expose end user data or allow other unauthorized access.

While there are many open source security projects, security has generally received less attention than other areas of the cloud native landscape. The visibility of these projects’ internals has been limited, and their integration into cloud native tooling as well. There is also a lack of security experts focused on the ecosystem. All of this has contributed to an uncertainty on how to securely set up and operate cloud native architectures.

It is essential to design common architectural patterns to improve overall security in cloud native systems.

The TOC has defined three objectives for this SIG. This will complete what is currently being done by CNCF’s security-related projects:

  • Protection of heterogeneous, distributed and fast changing systems, while providing needed access
  • Common understanding and common tooling to help developers meet security requirements
  • Common tooling for audit and reasoning about system properties.

Security must be addressed at all levels of the stack and across the entire ecosystem. As a result, the Security SIG is looking for participation and membership from a diverse range of roles, industries, companies and organizations. See the Security SIG Charter for more information.

TOC Liaisons: Liz Rice and Joe Beda

Co-Chairs: Sarah Allen, Dan Shaw, Jeyappragash JJ

Storage SIG

The Storage SIG was approved in late May, and aims to enable widespread and successful storage of persistent state in cloud native environments. The group focuses on storage systems and approaches suitable for and commonly used in modern cloud native environments, including:

  • Storage systems that differ significantly from systems and approaches previously commonly used in traditional enterprise data center environments,
  • Those that are not already adequately covered by other groups within the CNCF
  • Block stores, file systems, object stores, databases, key-value stores, and related caching mechanisms.

The Storage SIG strives to understand the fundamental characteristics of different storage approaches with respect to availability, scalability, performance, durability, consistency, ease-of-use, cost and operational complexity. The goal then is to clarify suitability for various cloud native use cases.  

If you are interested in participating in the Storage SIG, check out the Charter for more information.

TOC Liaisons: Xiang Li

Co-Chairs: Alex Chircop, Quinton Hoole


TOC 批准 CNCF SIG 并创建安全和存储 SIG

今年早些时候,技术监督委员会 (TOC) 投票决定创建 CNCF 特别兴趣小组 (SIG)。CNCF SIG 目前正在各个重点领域稳步发展,主要由知名专家领导,并得到了贡献者的广泛支持。他们直接向 TOC 报告,我们鼓励开发人员和最终用户积极参与小组组建:

Name (to be finalised) Area Current CNCF Projects
名称(待敲定) 区域 当前 CNCF 项目
Traffic networking, service discovery, load balancing, service mesh, RPC, pubsub, etc. Envoy, Linkerd, NATS, gRPC, CoreDNS, CNI
流量 网络、服务发现、负载均衡、服务网格、RPC、pubsub 等 Envoy、Linkerd、NATS、gRPC、CoreDNS、CNI
Observability monitoring, logging, tracing, profiling, etc. Prometheus, OpenTracing, Fluentd, Jaeger, Cortex, OpenMetrics,
可观察性 监控、记录、跟踪、分析等 Prometheus、OpenTracing、Fluentd、Jaeger、Cortex、OpenMetrics
Governance authentication, authorization, auditing, policy enforcement, compliance, GDPR, cost management, etc SPIFFE, SPIRE, Open Policy Agent, Notary, TUF,  Falco,
治理 认证、授权、审计、策略执行、合规、GDPR、成本管理等 SPIFFE、SPIRE、开放策略代理、Notary、TUF、Falco
App Delivery PaaS, Serverless, Operators,… CI/CD,  Conformance, Chaos Eng, Scalability and Reliability measurement etc. Helm, CloudEvents, Telepresence, Buildpacks, (CNCF CI)
应用交付 PaaS、无服务器、运营商……CI/CD、合规、混沌引擎、可扩展性和可靠性衡量等 Helm、CloudEvents、Telepresence、Buildpack、(CNCF CI)
Core and Applied Architectures orchestration, scheduling, container runtimes, sandboxing technologies, packaging and distribution, specialized architectures thereof (e.g. Edge, IoT, Big Data, AI/ML, etc). Kubernetes, containerd, rkt, Harbor, Dragonfly, Virtual Kubelet
核心和应用架构 编排、调度、容器运行时、沙盒技术、封装和分发、专业架构(例如 Edge、物联网、大数据,人工智能/机器学习 等)。 Kubernetes、containerd、rkt、Harbour、Dragonfly、Virtual Kubelet

TOC 和 CNCF 员工将开始为上述 SIG 起草一套初步章程,并招募合适的主席。如需了解更多信息,请访问 CNCF SIG 页面。

安全 SIG

本月初,安全 SIG 通过了 TOC 审批。其使命是降低云原生应用 泄露最终用户数据或允许其他未授权访问的风险。

尽管有许多开源安全项目,但安全重视程度通常低于云原生环境的其他领域。这些项目内部结构的可视性受到限制,并集成至云原生工具中。此外,它们还缺少专注于生态系统的安全专家。上述所有因素均造成了如何安全设置并运行云原生架构的不确定性。

设计通用架构模式来提高云原生系统的整体安全性至关重要。

TOC 为此 SIG 设定了以下三个目标。这将完成 CNCF 安全相关项目目前正在进行的工作:

  • 保护异构、分布式、快速变化的系统,同时提供所需访问权限
  • 达成共识,并确定通用工具,以帮助开发人员满足安全要求
  • 确定用于审计和推理系统属性的通用工具。

必须在所有堆栈层级及整个生态系统中解决安全问题。因此,安全 SIG 正在寻求不同角色、行业、公司和组织成员的积极参与。更多信息,请参阅安全 SIG 章程

TOC 联络人:Liz RiceJoe Beda

联合主席:Sarah AllenDan ShawJeyappragash JJ

存储 SIG

存储 SIG 于 5 月底获批,致力于在云原生环境中广泛、成功地实现持久状态存储。该小组专注于适合并常用于现代云原生环境的存储系统和方法,包括:

  • 与以前常用于传统企业数据中心环境的系统和方法显著不同的存储系统
  • CNCF 内其他小组尚未充分涉及的系统和方法
  • 数据块存储、文件系统、对象存储、数据库、键值存储及相关缓存机制。

存储 SIG 致力于了解不同存储方法在可用性、可扩展性、性能、耐用性、一致性、易用性、成本和运营复杂性方面的基本特征。其目标是阐明各种云原生用例的适用性。  

如果您有兴趣参加存储 SIG,请查看章程了解更多信息。

TOC 联络人:Xiang Li

联合主席:Alex ChircopQuinton Hoole

Virtual Cluster – Extending Namespace Based Multi-tenancy with a Cluster View

By | Blog

Guest post by Fei Guo and Lei Zhang of Alibaba

Abstract:

In this guest post, the Kubernetes team from Alibaba will share how they are building hard multi-tenancy on top of upstream Kubernetes by leveraging a group of plugins named “Virtual Cluster” and extending tenant design in the community. The team has decided to open source the these K8s plugins and contribute them to Kubernetes community in the upcoming KubeCon.

Introduction

In Alibaba, the internal Kubernetes team is using one web-scale cluster to serve large number of business units as end users. In this case, every end user actually become a “tenant” to this K8s cluster which makes hard multi-tenancy as a strong need.

However, instead of hacking Kubernetes APIServer and resource model, the team in Alibaba tried to build a “Virtual Cluster” multi-tenancy layer without changing any code of Kuberentes. With this architecture, every tenant will be assigned a dedicated K8s control plane (kube-apiserver + kube-controller-manager) and several “Virtual Node” (pure Node API object but no corresponding kubelet) so there’s no worries for naming or node conflicting at all, while the tenant workloads are still mixed running in the same underlying “Super Cluster” so resource utilization is guaranteed. This design is detailed in [virtual cluster proposal] which has received lots of feedback.

Although a new concept of “tenant master” is introduced in this design, virtual cluster is simply an extension built on top of the existing namespace based multi-tenancy in K8s community, which is referred to as “namespace group” in the rest of the document. Virtual cluster fully relies on the resource isolation mechanisms proposed by namespace group, and we are eagerly expecting and pushing them to be addressed in the on-going efforts of Kubernetes WG-multitenancy.

If you want to know more details about Virtual Cluster design, please do not hesitate to read the [virtual cluster proposal] , while in this document, we will focus on the high level idea behind virtual cluster and elaborating how we extend the namespace group with “tenant cluster” view and why this extension is valuable to Kubernetes multi-tenancy use cases.

Background

This section briefly reviews the architecture of namespace group multi-tenancy proposal.

We borrow a diagram from the K8s Multi-tenancy WG Deep Dive presentation, as shown Figure1, to explain the high level idea of using namespaces to organize tenant resources.

                         Figure 1. Namespace group multi-tenancy architecture

In namespace group, all tenant users share the same access point, the K8s apiserver, to utilize the tenant resource. Their accounts, assigned namespaces and resource isolation policies are all specified in tenant CRD objects, which are managed by tenant admin. Tenant user view is limited in the per tenant namespaces. The tenant resource isolation policies are defined to disable the direct communication between tenants and to protect tenant Pods from security attacks. They are realized by native Kubernetes resource isolation mechanisms including RBAC, Pod security policy, network policy, admission control and sandbox runtime. Multiple security profiles can be configured and applied for different levels of isolation requirements. In addition, resource quotas, chargeback and billing happen at tenant level.

How Virtual Cluster Extends the View Layer

Conceptually, virtual cluster provides a view layer extension on top of the namespace group solution. Its technical details can be found in [virtual cluster]. In virtual cluster, tenant admin still needs to use the same tenant CRD used in namespace group to specify the tenant user accounts, namespaces and resource isolation policy in the tenant resource provider, i.e., the super master.

                         Figure 2. View Layer Extension By Virtual Cluster

As illustrated in Figure 2, thanks to the new virtual cluster view layer, tenant users now have different access points and tenant resource views. Instead of accessing super master and view the tenant namespaces directly, tenant users interact with dedicated tenant masters to utilize tenant resources and are offered complete K8s master views. All tenant requests are synchronized to super master by the sync-manager, which creates corresponding custom resources on behalf of the tenant users in super master, following the resource isolation policy specified in the tenant CRD. That being said, virtual cluster primarily changes tenant user view from namespaces to an APIserver. From super master perspective, the same workflow is triggered by the tenant controller in respect to the tenant CRD.

Benefits of Virtual Cluster View Extension

There are quite a few benefits of having a Virtual Cluster view on top of the existing namespace view for tenant users:

  • It provides flexible and convenient tenant resource management for tenant users. For example, a nested namespace hierarchy, as illustrated in Figure 3(a), can easily resolve some hard problems like naming conflicts, namespace visibility, sub-partition tenant resources in namespace group solution [Tenant Concept].  However, it is almost impractical to change native K8s master to support nested namespaces. By having a virtual cluster view, the namespaces created in the tenant master, along with the corresponding namespace group in super master, can achieve similar user experiences as if nested namespaces are used.

As shown in Figure 3(b), tenant users can do self-service namespace creation in tenant master without worrying about naming conflict with other tenants. The conflict is resolved by sync-manager when it adds the tenant namespaces to super master namespace group. Tenant A users can never view tenant B users’ namespaces since they access different masters. It is also convenient for tenant to customize policy for different tenant users which only takes effect locally in tenant master.

  • It provides stronger tenant isolation and security since it avoids certain problems due to sharing the same K8s master among multiple tenant users. For example, DOS attacks, API access rate control among tenants and tenant controllers isolation are not concerns any more.
  • It allows tenant users to create cluster scope objects in tenant masters without affecting other tenants. For instance, a tenant user can now create CRD, ClusterRole/ClusterRoleBinding, PersistentVolume, ResourceQuota, ServiceAccount, NetworkPolicy freely in tenant master without worrying about conflicting with other tenants.
  • It alleviates the scalability stress on super master. First, RBAC rules, policies, user accounts managed in super master can be offloaded to the tenant masters, which can be scaled independently. Secondly, tenant controllers and operators access to multiple tenant masters instead of a single super master, which again can be scaled independently.
  • It is much easier to create users for tenant user. Nowadays, if a tenant user want to expose its tenant resources to other users (for example, a team leader wants to add team members to use the resources assigned to the team), tenant admin has to create all the users. In case a tenant admin needs to serve hundreds of such teams in a big organization, creating users for tenant user can be a big burden. Virtual cluster completely offloads such burden to tenant users from the tenant admin.

Limitations

Since virtual cluster mainly extends the multi-tenancy view option and prevents problems due to sharing apiserver from happening, it inherits the same limitations/challenges faced by namespace group solution in making kubernetes node components tenant-aware. The node components need to be enhanced include but not limited to:

  • Kubelet and CNI-plugin. They need to be tenant-aware to support strong network isolation scenarios like VPC.
    • For example, how readiness/liveness probe works if a pod is isolated in different VPC from the node? This is one of the issues we’ve already started to cooperate with SIG-Node on upstream.
  • Kube-proxy/Kube-dns. They need to be tenant-aware to make cluster-IP type of tenant services work.
  • Tools: For example, monitoring tools should be tenant-aware to avoid leaking tenant information. Performance tuning tools should be tenant-aware to void unexpected performance interference between tenants.

Of course, virtual cluster needs extra resources to run tenant master for each tenant which may not be affordable in some cases.

Conclusions

Virtual cluster extends the namespace group multi-tenancy solution with a user friendly cluster view. It leverages the K8s under-line resource isolation mechanisms and existing Tenant CRD & controller in community, but provides uses with experience of dedicated tenant cluster. Overall, we believe virtual cluster together with namespace based multi-tenancy can offer comprehensive solutions for various Kubernetes multi-tenancy use cases in production clusters, and we are actively working on contributing this plugin to the upstream community.

See ya at KubeCon!

Linkerd Benchmarks

By | Blog

Originally published on linkerd.io by William Morgan

glaciers

Update 5/30/2019: Based on feedback from the Istio team, Kinvolk has re-run some of the Istio benchmarks. The results are largely similar to before, with Linkerd maintaining a significant advantage over Istio in latency, memory footprint, and possibly CPU. Below, we’ve noted the newer numbers for Istio when applicable.

Linkerd’s goal is to be the fastest, lightest, simplest service mesh in the world. To that end, several weeks ago we asked the kind folks at Kinvolk to perform an independent benchmark. We wanted an unbiased evaluation by a third party with strong systems expertise and a history of benchmarking. Kinvolk fit this description to a T, and they agreed to take on the challenge.

We asked Kinvolk for several things:

  • A benchmark measuring tail latency, CPU usage, and memory consumption—the three metrics we believe are most indicative of the cost of operating a service mesh.
  • A comparison to the baseline of not using a service mesh at all.
  • A comparison to Istio, another service mesh. (We’re frequently asked how the two compare.)
  • A realistic test setup for an application “at load” and “at scale”, including an apples-to-apples comparison between features, and controls for variance and measurement error.
  • A downloadable framework for reproducing these tests, so that anyone can validate their work.

Today, Kinvolk published their results. You can see the full report here: Kubernetes Service Mesh Benchmarking. Kinvolk measured Linkerd 2.3-edge-19.5.2 and Istio 1.1.6, the latest releases that were available at the time of testing. They measured performance under two conditions: “500rps” and “600rps”, representing effectively “high” and “very high” load for the test harness.

Here’s a summary of their results. (Note that for Istio, Kinvolk tested two configurations, “stock” and “tuned”. We’re looking purely at the “tuned” configuration below.)

Latency

500rps latency chart600rps latency chart

Latency is arguably the most important number for a service mesh, since it measures the user-facing (as opposed to operator-facing) impact of a service mesh. Latency is also the most difficult to reason about, since it is best measured as a distribution.

Kinvolk measured latency from the perspective of load generator, which means that these latency numbers are a function of the application they tested—if the call graph was deeper, we’d see additional latency, and if it were shallower, these numbers would be less. Thus, raw numbers are not as important as the comparisons—how did Linkerd do versus the baseline, and versus Istio?

In the 500rps condition, Linkerd’s p99 latency was 6.7ms, 3.6ms over the baseline p99 of no service mesh of 3.1ms. (In other words, 99% of the time, a request without a service mesh took less than 3.1 ms, and 99% of the time, a request with Linkerd took less than 6.7ms.) At the p999 level (the 99.9th percentile), Linkerd’s latency was significantly worse, at 675ms above the baseline’s p999 of 4ms. The worst response time seen over the whole test with Linkerd was a full 1.8s of latency, compared to the baseline’s worst case of 972ms.

By comparison, Istio’s p99 latency in the 500rps case was 643ms, almost 100x worse than Linkerd’s p99. Its p999 was well over a second, compared to Linkerd’s 679ms, and its worst case was a full 5s of latency, 2.5x what was measured with Linkerd.

(Update: Kinvolk’s re-tuned Istio benchmarks dropped Istio’s p99 from 100x that of Linkerd’s to 26x and 59x that of Linkerd’s across two runs. It also dropped Istio’s p999 to just under a second, though still double Linkerd’s p999.)

In the 600rps condition, the difference between the two service meshes is exaggerated. While Linkerd’s p99 elevates from 6.7ms to 7ms, 4ms over the “no service mesh” baseline, Istio’s p99 was a full 4.4 minutes (!). While Linkerd’s p999 climbed to 850ms, compared to the baseline of 3.8ms, Istio’s p999 is almost 6 minutes. Even Istio’s p50 (median) latency was an unacceptable 17.6 seconds. In short, Istio was not able to perform effectively in Kinvolk’s 600rps condition.

(Update: Kinvolk’s re-tuned Istio benchmark showed similar performance in the 600rps condition, with p99 latency for Istio remaining in the minutes and median latency between 10 and 20 seconds.)

Summary: Linkerd had a latency advantage over Istio. In the 500rps condition, Istio’s p99 was 100x of Linkerd’s. In the 600rps condition, Istio’s latency was unacceptable throughout. However, both meshes introduced significant latency at the 99.9th percentile compared to the baseline in the 500rps condition.

Memory consumption

600rps memory chart

At 500rps, Linkerd’s memory usage was 517mb across all data plane proxies (averaging 5.7mb per proxy), and a little under 500mb for the control plane itself, for a total of ~1gb of memory. By comparison, Istio’s memory usage was 4307mb across all data plane proxies (averaging 47mb per proxy), and 1305mb for the control plane, for a total of almost 5.5gb.

The situation was almost identical in the 600rps condition. Of course, both meshes suffer greatly when compared to the baseline usage of 0mb!

(As a side note, 25% of Linkerd’s control plane memory usage in these runs was its Prometheus instance, which temporarily stores aggregated metrics results to power Linkerd’s dashboard and CLI. Arguably, this should have been excluded, since Prometheus was disabled in Istio.)

Summary: Linkerd had a clear memory advantage. Istio consumed 5.5x as much memory as Linkerd. Linkerd’s data plane, in particular, consumed less than an 1/8th of the RAM that Istio’s did.

CPU consumption

500rps cpu chart600rps cpu chart

When measuring CPU consumption, the two meshes product comparable results. In the 500rps run, Linkerd’s data plane proxies took 1618mc (millicores) cumulatively, and its control plane consumed 82mc, for a total of 1700mc. Istio’s data plane proxies consumed 1723mc and its control plane consumed 379mc, for a total of 2100mc, a 23% increase over Linkerd. However, in the 600rps run, the results were flipped, with Linkerd taking 1951mc vs Istio’s 1985mc. In this run, Linkerd’s data plane at 600RPS condition consumed 15% more CPU than Istio’s. (Though it should be noted that, since Istio was not able to actually return 600rps, it’s not entirely a fair comparison to Linkerd.)

Summary: CPU usage was similar Linkerd and Istio. Linkerd’s data plane CPU utilization was higher than Istio’s in the 600rps condition, though it’s hard to know how real this result was because Istio was not able to fully perform in this condition.

(Update: Kinvolk’s re-tuned Istio benchmarks showed a “massive increase in CPU usage of Istio’s proxy sidecar”. While no numbers were reported, from this description it is clear that Linkerd’s CPU usage was less than Istio’s in this configuration.)

Conclusion

Overall we are happy with Linkerd’s performance in this test, and we’re very happy to have a thorough quantification of the relative cost of introducing a service mesh, and a publicly available, reproducible harness for running these benchmarks.

We’re pleased to see that Linkerd has a very significant advantage in latency and memory consumption over Istio. While CPU usage was comparable, we feel that Linkerd can do better. We suspect there is low-hanging fruit in linkerd-proxy, and we’re eager to see if a little profiling and tuning can reduce CPU consumption over the next few releases. (And we’d love your help – hop into the Linkerd Slack.)

In the future, we hope that projects like Meshery can provide an industry standard approach for these sorts of benchmarks in the future. They’re good for users and good for projects too.

We’re impressed by the thoroughness of Kinvolk’s tests, and we’d like to thank them for taking on this sizeable effort. The full report details the incredible amount of effort they put into building accurate test conditions, reducing variability, and generating statistically meaningful results. It’s definitely worth a read!

Finally, we’d also like to extend a huge THANK YOU to the kind folks at Packet, who allowed Kinvolk to use the CNCF community cluster to perform these experiments. The Linkerd community owes you a debt of gratitude.


Linkerd is a community project and is hosted by the Cloud Native Computing Foundation. If you have feature requests, questions, or comments, we’d love to have you join our rapidly-growing community! Linkerd is hosted on GitHub, and we have a thriving community on Slack, Twitter, and the mailing lists. Come and join the fun!

Image credit: Bernard Spragg. NZ

Diversity Scholarship Series: My KubeCon & CloudNativeCon Europe Experience 2019

By | Blog

Guest post by Ines Cheikhrouhou, DevOps and Cloud consultant, Agyla originally published on Medium

Hi, My name is Ines and I’m one of the lucky people who was sent an invite as a diversity scholar to Kubecon + CloudNativeCon in Barcelona.

First of all, I want to thank CNCF for this great and life-changing opportunity for me.

I was very happy to receive the email and I was very excited to meet the kind of people who love staying in front of computers. It motivates me a lot to know more technical aspects of Kubernetes, mainly the core project. However, little did I know that this event was much more than that. It gave me all of the motivation and also all of the technical information and knowledge that I wanted to have.

My first day at KubeCon was the AWS container day co-located event which was a perfect choice for me since I work with AWS and I wanted to dig deeper into it and other cloud-native projects.

Throughout the day, I learned a lot about what’s new in AWS with relation to Kubernetes or simply other cloud-native tools.

One of the best discoveries for me was the APP mesh which is based on Envoy Proxy, as well as some advanced services for observability such as cloudwatch container insights.

In addition to that, I learned about the huge benefit of doing machine learning development workflow on Kubernetes, as well as the advantages of kubeflow.
And mostly, the famous eksctl CLI that helps provision your cluster in an easy way.

And here are my favorite pictures for the first day.

 

 

The end of the first day was very successful and it made me feel like I belonged with these smart and motivating people. I also got the chance to tour the beautiful Barcelona city (Thanks CNCF for the great choice).

The second day was in a different place which is the main event place at the Fira Gran Via, which was a HUGE building. It was very organized and you would find the planning of the day in each corner.

The day started with a perfect keynote where we learned about almost all of the CNCF projects with details from people who work daily on these projects and I got to listen to some of the best speakers such as Dan Kohn, Bryan Liles and especially Cheryl Hung, a woman who inspired me a lot. Seeing all of these women who participated as speakers made me want to work hard and be there on stage one day.

And last but not least, the famous presentation from Lucas and Nikhita that marked a starting point in my life which is a contribution.

I always thought it was just some smart people coding in a shared GitHub account on a project that I would probably not understand. But it’s more than that, it’s not even about coding, it’s the family that it creates, a family that is composed of people who encourage you, help you and inspire you to show your best. It’s about sharing and improving.

And here is the famous picture that presented the CNCF projects.

After the keynote, there were so many sessions that I wouldn’t want to miss and there was a huge place for sponsor showcase which was the perfect place for you if you had a question in mind or request of a demo or a sticker and a t-shirt. I got to meet so many people and I got to see a lot of demos and know about new CNCF projects that I didn’t hear about before.

And It was also such a pleasure meeting talented people such as Joe Beda, Ali Saad, Arun Gupta, Janet Kuo and all of the speakers that taught us so many things in a short period of time.

I also participated in the Networking and Mentoring session the next day which was the best part of the whole event for me. I was lucky to share a table with 3 of the greatest people I met in the event, Nikhita, Hippie Hacker and carolynvs. They introduced me to the world of contribution, the steps to follow, they taught me where to start and I even made my first PR that day.

These kinds of people and all of the community’s love of Open Source is what makes it fun and interesting to be a part of. And I hope everyone like me whom at a certain point were scared or doubtful, knows that it’s really a safe place to be around.

In the end, the event was a SUCCESS for me and I enjoyed the 5th-anniversary party of Kubernetes and the big Poble Espanyol party with free food and drinks.

Overall, Kubecon was a dream experience for me. Before this conference, I wouldn’t have been able to talk about Kubernetes or other Cloud Native Projects with confidence. But after this event, I gained a lot of knowledge and met so many people who offered me help. The whole experience offered me great opportunities to improve my personal and professional development. I’m excited to share this experience with my friends and I’m inspired to start being an active member of the community.

With this, I look forward to KubeCon + CloudNativeCon Europe 2020. Thank you KubeCon + CloudNativeCon, Europe 2019 and thank you CNCF for the amazing opportunity.

Here is a quick summary about what you’ve probably missed during this event, and let’s start with OpenTelemetry that is the next major version of the openTracing and openCensus projects, CNAB that allows one to package up multiple formats and their toolchains into a single artifact, different Kuberentes operators, how bezel helps with streamlining Kubernetes application CI/CD, containerd, crio-o, Autoscaling multi-cluster observability using Prometheus, and linkerd, building docker on Kubernetes using build kit and the great cloud-native storage orchestrator Rook.

In addition to Jaeger, its agents and its scaling, enhancing security made by Envoy SDS, the new fluentbit after fluentd for extending your logging pipeline with Go, the amazing Grafana Loki for logs and its integration with existing observability tools, multiple types of load balancing such as gRPC load balancing and its benefits with service mesh and Kubert VMs that provides networking functions to Kubernetes objects.

And lastly, the superstar Helm, Prometheus and custom metrics of k8s, Calico and SPIRE and their integration with Envoy, the new trending GitOps strategies and the serverless future of cloud computing.

CNCF Openness Guidelines

By | Blog

CNCF is an open source technical community where technical project collaboration, discussions, and decision-making should be open and transparent. Please see our charter, particularly section 3(b), for more background on CNCF values.

Design, discussions, and decision-making around technical topics of CNCF-hosted projects should occur in public view such as via GitHub issues and pull requests, public Google Docs, public mailing lists, conference calls at which anyone may participate (and which are normally published afterward on YouTube), and in-person meetings at KubeCon + CloudNativeCon and similar events. This includes all SIGs, working groups, and other forums where portions of the community meet.

This is particularly important in light of the Linux Foundation’s (revised) Statement on the Huawei Entity List Ruling. (Note that CNCF is part of the Linux Foundation.) Our technical community operates openly and in public which affords us exceptions to regulations other closed organizations may have to address differently. This open, public technical collaboration is also critical to our community’s success as we navigate competitive and shifting industry dynamics. Openness is particularly important in any discussions involving encryption since encryption technologies can be subject to Export Administration Regulations.

If you have questions or concerns about these guidelines, I encourage you to discuss it with your company’s legal counsel and/or to email me and Chris Aniszczyk at openness@cncf.io. Thank you.

Apple Joins Cloud Native Computing Foundation as Platinum End User Member

By | Blog

The Cloud Native Computing Foundation (CNCF), which sustains and integrates open source technologies like Kubernetes® and Prometheus™, today announced that Apple has joined the CNCF as a Platinum End User Member.

Apple has completely revolutionized personal and enterprise technology, and has long been a pioneer in cloud native computing and one of the earlier adopters of container technology. Apple has also contributed to several CNCF projects, including Kubernetes, gRPC, Prometheus, Envoy Proxy, Vitess and hosted the FoundationDB Summit at KubeCon + CloudNativeCon last year.

“Having a company with the experience and scale of Apple as an end user member is a huge testament to the vitality of cloud native computing for the future of infrastructure and application development,” said Chris Aniszczyk, CTO of the Cloud Native Computing Foundation. “We’re thrilled to have the support of Apple, and look forward to the future contributions to the broader cloud native project community.”

As part of Apple’s Platinum membership, Tom Doron, Senior Engineering Manager at Apple has joined CNCF’s Governing Board.

Apple will join 87 other end user companies including Adidas, Akatsuki, Amadeus, Atlassian, AuditBoard, Bloomberg, Box, Cambia Health Solutions, Capital One, Concur, Cookpad, Cruise,  Curve, DENSO Corporation, DiDi, Die Mobiliar, DoorDash, eBay, Form3, GE Transportation, GitHub, Globo, Goldman Sachs, Granular, i3 Systems, Indeed,Intuit,JD.com, JP Morgan, Kuelap, Mastercard, Mathworks, Mattermost, Morgan Stanley, MUFG Union Bank, NAIC, Nasdaq, NCSOFT, New York Times, Nielsen, NIPR, Pinterest, PostFinance, Pusher, Reddit, Ricardo.ch, Salesforce, Shopify, Showmax, SimpleNexus, Spotify, Spredfast, Squarespace, State Farm, State Street, Steelhouse, Stix Utvikling AS, Testfire Labs, Textkernel, thredUP, TicketMaster, Tradeshift, Twitter, Two Sigma,University of Michigan – ARC, Upsider, Walmart, Werkspot, WeWork, WikiMedia, WooRank, Workday, WPEngine, Yahoo Japan Corporation, Zalando SE, and Zendesk in CNCF’s End User Community. This group meets monthly and advises the CNCF Governing Board and Technical Oversight Committee on key challenges, emerging use cases and areas of opportunity and new growth for cloud native technologies.

Additional Resources

Square: How Vitess Enables ‘Near Unlimited Scale’ for Cash App

By | Blog

Four years ago, Square branched out into peer-to-peer transactions via its Cash App. After doing so, users started increasing by the minute and they needed to come up with a long term solution for scalability.  Vitess was the answer to the scalability issue. With Vitess, Cash App didn’t have to completely change how developers built applications and were able to change only 5% of their system vs. 95% to respond to increased customer demand.  Additionally, Cash App developers can do multiple shard splits per week with less than a second of downtime. Read the full case study here.

 

Reflections on the Fifth Anniversary of Kubernetes

By | Blog

Guest post from the Kubernetes Project

Five years ago, Kubernetes was released into the world. Like all newborns, it was small, limited in functionality, and had only a few people involved in its creation. Unlike most newborns, it also involved a great deal of code written in Bash. Today, at the five year mark, Kubernetes is full grown, and while a human would be just entering kindergarten, Kubernetes is at the core of production workloads from startups to global financial institutions.

They say that success has a thousand parents and failure is an orphan, but in the case of Kubernetes the truth is that its success is due to its thousands (and thousands) of parents. Kubernetes came from humble beginnings, with just a handful of developers and in record time grew into its current state with literally thousands of contributors – and even more people involved in meetups, docs, education, release management, and supporting the broader community. At many points in the project, when it seemed that it might be moving too fast or becoming too big, the community has responded and stepped up with new ways of organizing and new ways of supporting the project so that it could have continued success. It is an amazing achievement to see a project reach this scale and continue to operate successfully, and it is a tribute to each and every member of our amazing community that we’ve been able to do this while maintaining an open, neutral and respectful community.

Five years in, it’s worth reflecting on the things that Kubernetes has achieved. It is one of the largest, if not the single largest open source project on the planet. It has managed to sustain a fast pace of development across a team of thousands of distributed engineers working in a myriad of different companies. It has merged tens of thousands of commits while sustaining a regular release cadence of high-quality software that has become mission-critical for countless organizations and companies. This would be no small achievement within a single company, but to do this while being driven by dozens of different companies and thousands of individuals (many of whom have other jobs or even school!) is truly amazing. It is a credit to the selflessness of all of the folks in the community who chop wood and carry water every single day to ensure that our tests are green (ish), our releases get patched, our security is maintained, and our community conducts itself within the bounds of our code of conduct. To all of the people who do this often tedious, and sometimes emotionally draining work, you deserve our deepest thanks. We could never have gotten here without you.

Of course, the story of Kubernetes isn’t just a story of community, it is also a story of technology. It is breath-taking to see the speed with which the ideas of cloud-native development have shaped the narrative of how reliable and scalable applications are built. Kubernetes has become a catalyst for the digital transformation of organizations toward cloud-native technologies and techniques. It has become the rallying point and supporting platform for the development of an entire ecosystem of projects and products that add powerful cloud-native capabilities for developers and operators. By providing a ubiquitous and extensible control-plane for application development, Kubernetes has successfully uplifted a whole class of higher-level abstractions and tools.

One of the most important facets of the Kubernetes project was knowing where it should stop. This has been a core tenet of the project since the beginning and though the surface area of Kubernetes continues to grow, it has an asymptotic limit. Because of this, there is a flourishing ecosystem on top of and alongside the core APIs. From package managers to automated operators, from workflow systems to AI and deep learning, the Kubernetes API has become the substrate on which a vibrant cloud-native biome is growing.

As Kubernetes turns five, we naturally look to the future and contemplate how we can ensure that it continues to grow and flourish. In the celebration of everything that has been achieved, it must also be noted that there is always room for improvement. Though our community is broad and amazing, ensuring a diverse and inclusive community is a journey, not a destination, and requires constant attention and energy. Likewise, despite the promise of cloud-native technologies, it is still too hard to build reliable, scalable services. As Kubernetes looks to its future, these are core areas where investment must occur to ensure continued success. It’s been an amazing five years, and with your help the next five will be even more amazing. Thank you!

A Look Back At KubeCon + CloudNativeCon Barcelona 2019

By | Blog

 

Hot off an amazing three days in Barcelona, here is a snapshot into some of the key highlights and news from KubeCon + CloudNativeCon Europe 2019! This year we welcomed more than 7,700 attendees from around the world to hear compelling talks from CNCF project maintainers, end users and community members.

The annual European event grew by more than 3,000 attendees than last year in Copenhagen. At the conference, CNCF announced its ever-growing ecosystem has hit over 400 member companies, of which there are now more than 88 end user members. We also learned that Kubernetes has more than 2.66 million posts from 26,214 contributors.

This year we welcomed Bryan Liles as a KubeCon + CloudNativeCon co-chair! He took the stage to announce all the great project news that has come out in the last couple months.

During the opening keynotes we also heard from CNCF Executive Director, Dan Kohn who spoke about the key factors that contributed to the massive growth of the Kubernetes ecosystem, and CNCF Director of Ecosystem, Cheryl Hung who shared CNCF’s growth and plans to continue growing a positive community. Lucas Käldström, CNCF Ambassador, Independent & Nikhita Raghunath, Software Engineer, Loodse shared insights on the what, why and how of contributing to Kubernetes.

Kubernetes Boothday Party!

While we celebrated the cloud native community, we also got to celebrate Kubernetes’ fifth birthday with a “Boothday Party” and donut wall!  

Continuing to Embrace Diversity in the Ecosystem

At KubeCon + CloudNativeCon EU, CNCF’s diversity program offered scholarships to 56 recipients, from traditionally underrepresented and/or marginalized groups, to attend the conference! The $100K investment for Barcelona was donated by CNCF, Aspen Mesh, Google Cloud, Red Hat, Twistlock and VMware.

CNCF has offered more than 300 diversity scholarships to attend KubeCons since November 2016.

We also had a wonderful time at the Diversity lunch and EmpowerUs events!

Take Good Care: Open Sourcing Mental Illness

This year at KubeCon + CloudNativeCon EU, we made sure that self care and mental wellness was top of mind for everyone. As a result, we got to hear inspiring talks throughout the conference on these topics, plus, we had a booth 100% dedicated to relaxing and mental health. We felt so much community support!

All Attendee Party at Poble Espanyol!

Our events team organized a fantastic party at Poble Espanyol, celebrating the many achievements of the cloud native ecosystem in the beautiful Spanish courtyard!

Keynote and Session Highlights

All presentations and videos are available to watch. Here is how to find all the great content from the show:

  • Keynotes, sessions and lightning talks can be found on the CNCF YouTube
  • Photos can be found on the CNCF Flickr
  • Presentations can be found on the Conference Event Schedule, click on the session and scroll to the bottom of the page to see the PDF of the presentation for download

“From the people Computer Weekly spoke to at Kubecon-CloudNativeCon, there is a sense that Kubernetes is breaking out of the open source developer space into the enterprise.” Cliff Saran, ComputerWeekly

“Five years after Google released the toolkit for managing workloads to the open source community, Kubernetes became the celebrated boy. Nothing seems to stop its advance, especially because the developers have embraced this system.” Alfred Montie, Computable.nl

“The Kubernetes container-orchestration system is one platform that is both surviving and thriving.” Nick Marinoff, SiliconANGLE

“It’s apparent that whether an application lives on the Google Cloud Platform or in an on-premises data center, it can be done with containers.” Kristen Nicole, SiliconANGLE

“As stated by Dan Kohn, Kubernetes has emerged on the shoulders of giants: on Linux, the Internet and various cluster manager implementations of cloud-native businesses from Spotify to Facebook to Google. The exciting question for the conference is which innovations will come on the shoulders of the giant Kubernetes in the next few days.” Josef Adersberger, Alex Krause, Heise Online

“The obvious conclusion: If you’re interested in enterprise IT infrastructure, Kubernetes should be your technology of choice, and KubeCon is the place to be.” Jason Bloomberg, SiliconANGLE

It’s a Wrap!

Save the Dates!

Register now for KubeCon + CloudNativeCon + Open Source Summit China 2019, scheduled for June 24-26, 2019 at the Shanghai Expo Centre, Shanghai, China

Register now for KubeCon + CloudNativeCon North America 2019, scheduled for November 18-21, 2019 at the San Diego Convention Center, San Diego, California. The CFP closes July 12.

Save the date for KubeCon + CloudNativeCon Europe 2020, scheduled for March 30-April 2, 2020 in Amsterdam, The Netherlands

Observability should not slow you down

By | Blog

Originally published on Medium by Travis Jeppson, Sr. Director of Engineering, Nav Inc

In any application, the lack of observability is the same as riding a bike with a blindfold over your eyes. The only inevitable outcome is crashing, and crashing always comes with a cost. This cost tends to be the only focus we have when we look at observability, but this isn’t the only cost. The other cost of observability isn’t usually addressed until it becomes more painful than the cost of crashing —the cost of maintenance and adaptability.

I’ve listened to, and watched, many conference talks about this subject; and had my fair share of conversations with vendors as well. Maintenance and adaptability aren’t generally mentioned. I’ve only had these topics come up when I’m talking to other companies about their adopted platform, how they were actually able to integrate observability into real-life situations, and from my own experiences doing the same. The reason these topics come up after some practical application is that we’ve all hit the proverbial wall.

We’ve all run into problems, or incompatibilities, or vendor lock-in that feels almost impossible to get rid of. Our observability begins to dwindle, the blindfold starts falling down over our eyes, and we’re again heading to an inevitable fate. What can be done? Revisit the entire scenario? Wait for a major crash and create an ROI statement to show we have to re-invest in major parts of our applications? This can’t possibly be the only way to deal with this problem. This is an anti-pattern to the way we build software. Observability is supposed to empower speed and agility, not hold it back.

There is another way, and it starts by determining the key elements on which you won’t make concessions. During the last iteration of trying to get this right at Nav, we had a lot of discussions around our previous attempts. The first attempt was a solution that we thought initially had unlimited integrations; it turns out it didn’t have the one we needed, Kubernetes. We also couldn’t produce custom metrics from our applications, so that solution had to go. We weren’t about to wait for them to tell us an integration was ready, we were ready to move. We decided to go with a solution that was end-to-end customizable, we could spend time developing our telemetry data, and how to interpret it. This, unfortunately, forced us into a maintenance nightmare. On the third iteration, however, we decided to settle somewhere in the middle. We sat down and defined our “no compromise” priorities, and started finding solutions that fit. Here’s how we saw the priorities for Nav.

1. Customization! We needed adaptability, no waiting for integrations

First and foremost the solution needed to allow for custom metrics, and handle them like a first-class citizen. This needed to be true for our infrastructure metrics as well as anything coming from our applications. Adaptability was key in our decision: If the solution we chose was adaptable, then we should be free to adjust any component of our infrastructure without having to check if our observability would be affected.

2. No vendor-specific code in our applications, not even libraries

This may seem a little harsh at first, but the fact of the matter is that we didn’t want to have a dependency on a vendor. We use a wide variety of languages at Nav —Ruby, Elixir, Go, Python, Javascript, Java, the list goes on. It was almost impossible to find a vendor solution that would work with all of those languages. We decided the language needed to be agnostic, which means we couldn’t have any vendor code or libraries in our applications. The other side of this is that we didn’t want to be locked to the solution, since we had previously run into issues with that problem.

3. HELP! The maintenance cannot be overwhelming

This meant that at some point we would probably need a vendor to help us out. We didn’t want a ridiculous uptime for our observability platform to be our concern, we wanted to worry about the uptime of our application instead. We also didn’t want to worry about the infrastructure of the observability platform, we wanted to worry about our own. Catch my drift? We also wanted some guidance about what to pay attention to. We wanted a simple way to build dashboards, and the ability to allow pretty much every engineer to build their own dashboards around their own metrics.

Now the Rest: Our Second Tier of Priorities

Now we get into the “like to have” priorities. The following were more of a wish list, the top three were dealbreakers for the solution we came up with. Fortunately, as will be illustrated later, we didn’t need to compromise on any of our priorities.

4. Alerting needed to be easy to do, and integrate with our on-call solution

With our end-to-end customized solution (attempt #2 in observability) alerting was ridiculously tedious. It was a JSON document that had so many defining parts that we never really had any good alerts setup. We also caused a lot of on-call burnout due to large amounts of false positives. We didn’t want to repeat this.

5. We didn’t want to pay the same price for our non-production environments as we do for production

It is a giant pet-peeve of mine that it is required of anyone to pay the same price for observability, just because the size of the environments is the same. Why must this be? I don’t actually care nearly as much if a development environment goes down for 5 minutes; but I definitely care if production is down for 5 minutes.

The Final Decision: Nav’s Tri-Product Solution

With these priorities in hand, we set out to create a solution that worked. To cut a long story short, there didn’t end up being the perfect solution, there wasn’t a solution that could give us the top 3 priorities … on their own. It turns out we needed multiple pieces to work seamlessly together.

Prometheus

Prometheus is an open source metric aggregation service. The fantastic thing about Prometheus is that it is built around a standard, which they also created. This standard is called the exposition format. You can provide a text-based endpoint and Prometheus will come by and “scrape” the data off of this endpoint and feed it into a time series database. This … is … simply … amazing! Our developers were able to write this endpoint in their own code bases, and publish any kind of custom metric they desired.

StatsD

StatsD was a solution originally written by Etsy, StatsD provided a way for us to push metrics on our software that wasn’t associated with a web server, such as short-lived jobs, or event-driven computations.

Between StatsD and Prometheus, we were able to publish custom metrics from virtually anywhere. The other great thing, is with both of these solutions being open source, there was already a thriving community building out assistive components to these two libraries.

The final piece of the puzzle for us was where the vendor came into play. With our priorities set, we found a vendor that did seamlessly integrate with Prometheus metrics, they would even scrape the metrics for us, so we didn’t even need to run Prometheus, just use their standards. They also ingested our StatsD metrics without a hitch.

SignalFx

SignalFx was the vendor we ended up selecting, this is what ended up working for us, and our priorities. The key component with the vendor selection is that the solution fulfills your needs from a managed, and ease-of-use view point. That being said, I’ll illustrate how SignalFx fulfilled this for us.

The tailing part of our third priority is we wanted some guidance on what to pay attention to, SignalFx had some very useful dashboards out of the gate that used our Prometheus metrics to pinpoint some of our key infrastructure components, like Kubernetes and AWS.

They also have a very robust alerting system which was as simple as identifying the “signal” we wanted to pay attention to, and adding a particular constraint to it. These constraints could be anything between a static threshold, to outliers, to historical anomalies. This was significantly simpler than our second attempt, and this was built around custom metrics! Win, Win!

Finally, SignalFx charges per metric you send them, the great thing about this is that our non-prod environments are pretty quiet, we dialed down their resolution to a minute or two, so the metrics that are constantly being generated, like CPU, or memory, didn’t cost an arm and a leg. This fulfilled our final priority, and allowed us to save a significant amount of money over other vendor solutions.

The takeaway from all of this is that the observability platform we use, if built around standardized systems, doesn’t have to be painful. In fact it can be just the opposite. We have been able to accelerate our development and we have never had surprises due to the maintainability and adaptability of our observability platform.

For more on Nav’s cloud native journey, check out the case study and video