KubeCon + CloudNativeCon San Diego | November 18 – 21 | Learn more

Category

Blog

Diversity Scholarship Series: Experiencing KubeCon + CloudNativeCon + Open Source Summit in Shanghai

By | Blog

Guest post by Irvi Aini, originally published on Medium

As someone who is still a newbie in the open source world, I often feel intimidated. At first, I wasn’t even sure whether I should contribute or not, even though I’ve been involved in several communities in my home country, Indonesia. This kind of feeling doesn’t dissipate easily (previously I’ve been active in the Docker and Kubernetes communities).

Whenever I want to start contributing, I just feel like, well they must be really smart and I am still a newbie. Then several things happened and someone told me that I’ll be disappointed if I keep getting stuck inside your own mind, forever undecided. Then I gathered my courage to start looking at something that may I could achieve. I began to look at the design docs, various SIGs, and I even tried to look at what the biweekly meeting looks like.

One day my friend asked me to help him initiating localization for Kubernetes’ documentation in Bahasa Indonesia. I had been looking at this project as well, and with the help of a fellow contributor from 🇫🇷, I begin initiating this project with this friend of mine. I made several mistakes, but the response from the reviewer was actually beyond my expectation, they were really nice.. and I got a lot of help as well… To be honest I was really struggling, because when I started doing this project my knowledge about Bahasa Indonesia was not as good as I ever thought beforehand. Now that I was involved in this project I was promoted as a member of the Kubernetes organization with sponsorship from the code OWNERS and fellow contributor from 🇫🇷. At that time, this friend also dared me to write a paper and submit it for KubeCon. However I still don’t have the courage of doing public speaking with a worldwide audience. Even the thought about it is still too much for me.

After this localization project, I began working on the Kubernetes-client organization which involved coding. It was fun, since I got the chance to learn more about Haskell. At this point I dared myself to apply for a Kubecon contributor discount. Since it was too late for Barcelona, I tried my luck for KubeCon Shanghai. I emailed Linux Foundation, included my Github handle and voila I got the discount: so I got the ticket for “free”. However another problem was that I still needed accommodations to be able to attend the event. I remember that Linux Foundation already spent about $770,000 USD for the purpose of providing scholarships. Diversity scholarship program provides support to those from traditionally underrepresented and/or marginalized group in technology and/or open source communities. A recipient will receive up to $1500 USD to reimburse actual travel expenses (airfare, hotel, and ground transport). I began to search if KubeCon Shanghai is also supported, I saw this link about the diversity scholarship and then I applied. The application mechanism is actually pretty straightforward, you’ll be required to fill in the details about your experiences, motivations, and what you will gain after you attend the event. A few weeks later, I got a reply from Linux Foundation and they said they will give me travel funds for my accommodation. I felt so happy and blessed since I was selected as a part of 309 recipients (accumulated for 7 KubeCon events) around the world. I also felt excited to meet all the folks that I’ve known before through Slack or Mailing List. I had the chance to discuss deeper topics about CNCF related projects.

My intention of writing this is actually simple. I don’t know how much people feel the same as me,but I hope that what I share can show that sometimes it’s better to try something new and then say I’m glad I did that, instead of letting the possibility of doing something new pass you by and then regret the decision made in the past. Don’t hesitate to contribute just because you feel intimidated, especially if you’re planning on contributing to CNCF projects. I think CNCF and Kubernetes have very nice people who are eager to help during your journey as a fellow contributor. Cheers..

For the localization project

Surely we will be really happy if you want to help as well 😊. We listed all of the things that need to be done for this first milestone on the tracking page in this Github issue. Not all contributions include coding in the open source project. Happy contributing!

A Look Back At KubeCon + CloudNativeCon + Open Source Summit China 2019

By | Blog

We had an amazing time in Shanghai, and wanted to share an overview of the key highlights and news from KubeCon + CloudNativeCon + Open Source Summit China 2019! This year we welcomed more than 3,500 attendees from around the world to talk about exciting cloud native and open source topics and hear from CNCF project maintainers, end users, and community members.

Our second annual China event grew by 1,000 to more than 3,500 attendees this year! At the conference, CNCF announced the creation of Special Interest Groups (SIGs) as well as two new groups: Storage and Security.

CNCF shared the news that the largest mobile payments company in the world, Ant Financial has joined as a Gold End User member.

CNCF also awarded DiDi, the world’s leading multi-modal transportation platform, the End User Award in recognition of its contributions to the cloud native ecosystem.

We watched an exciting line-up of speakers discuss cloud native technology, open source, and their experience building and using the technology. We also learned how Chinese developers have been engaging increasingly with CNCF and open source. Chinese developers have even now collectively made China the second largest Kubernetes contributor country in the world! 

Diversity Scholarships in Asia! 

For KubeCon + CloudNativeCon + Open Source Summit China this year, CNCF enabled 15 diversity scholars to attend! We are thrilled to have sponsors Aspen Mesh, CNCF, Google Cloud, Red Hat, Twistlock, and VMware come together to help offer this opportunity to attend KubeCon to these recipients! Don’t forget to keep an eye on our blog for our Diversity Scholarship Series, where recipients share their experiences.    

Community Party! 

The community came together on Day 1 at the Welcome Reception in the Sponsor Showcase. Sponsors, attendees and fellow community members enjoyed an evening of networking, food, drink and entertainment, including local calligraphy, sugar and candy floss artists. 

As always, we had a killer job board at the event showing continued growth in the ecosystem. 

That’s a wrap!


Save the Dates!

The call for proposals for KubeCon + CloudNativeCon North America 2019, which takes place in San Diego from November 18-21, closes July 12th. Registration is open. 

We’ll be back in Europe for KubeCon + CloudNativeCon Europe in Amsterdam from March 30-April 2, 2020. Registration will go live at the end of the year. We’ll soon be announcing the location for our China event in the summer of 2020.

We hope to see you at one of or all of these upcoming events!

 

 

 

Diversity Scholarship Series: KubeCon + CloudNativeCon EU 2019

By | Blog

Guest post by Semah Mhamdi originally posted on Medium

At the beginning of May in Barcelona, the sun shines, the birds sing … but the experts, they continue to work! Since the morning, a little more than 8000 people have gathered at Fira Gran Via for KubeCon + CloudNativeCon EU 2019 hosted by the Cloud Native Computing Foundation.

I’m one of the lucky people who were there in Barcelona thanks to the CNCF Diversity scholarship.

My introduction to the world of containers happened one year ago, when I stumbled upon one of my now great friends at a technical workshop, in Tunisia.

I am very grateful for the introduction to this world that happened on that fateful day. The day I found Kubernetes (amongst many other things).

I have been working with Kubernetes and containers as a user and a developer, building tools around it. I really wanted to dig deeper into the world of cloud-native architecture and apps. The scholarship gave me an opportunity to interact with amazing communities who work on cloud technology. I’ve learned a lot from them. It also helped me get started on one of my biggest goals of contributing back to these amazing communities.

From day one, I found myself talking to experts as well as beginners. Everyone I approached was willing to talk about their ideas and work. They also listened to my ideas and offered me advice. I even got some really cool swag.

I was present for the new contributor workshop which helped me learn a lot about how to contribute to Kubernetes. It has now led me to join the Kubernetes community.

The next three days worth of keynotes were really amazing, especially when we meet people that we’ve been following for a long time like Kris Nova, Joe Beda, and Paris Pittman. The list of projects under CNCF is growing, and it was amazing to see the various stages in which the projects were featured on.

Overall, Kubecon was a dream for me to attend.

I have also started on the path to contributing more to amazing upstream projects.

Thanks again, to the Linux Foundation and to the Cloud Native Computing Foundation, for giving me this opportunity to expand my horizons, to meet great people, to make cool, amazing friends. To learn and to grow … and to look forward to KubeCon + CloudNativeCon North America 2019.

KubeCon + CloudNativeCon Europe 2019 Conference Transparency Report: Another Record-Breaking CNCF Event

By | Blog

KubeCon + CloudNativeCon Europe 2019 was a great success with record-breaking registrations, attendance, sponsorships, and co-located events for our annual European conference. With 7,700 registrations, attendance for this year’s event in Barcelona grew by 84% from last year’s event in Copenhagen. 74% of these attendees were at KubeCon + CloudNativeCon for the first time.  

The KubeCon + CloudNativeCon Europe 2019 conference transparency report

Key Takeaways

  • Attendance grew by 84% from last year’s KubeCon event in Copenhagen.
  • Feedback from attendees was very positive with all surveyed highly recommending the event to a colleague or friend.
  • 95 media and analysts attended the event, generating more than 5,300 clips of compelling event news.
  • 353 speakers – 23% – were accepted to speak at the event, out of 1,535 CFP submissions – a new record for our European event.
  • Over 50% of attendees participated in one or more of the 27 workshops, mini-summits and training sessions hosted by community members the day prior to the conference.
  • The event drew attendees from 93 countries across 6 continents. 
  • Leveraging the $100,000 in diversity scholarship funds available from Aspen Mesh, CNCF, Google Cloud, Red Hat, Twistlock, and VMware, CNCF provided travel and/or conference registration to 56 applicants.
  • 40% of the keynote and 14% of the track speakers were led by women.  
  • Over 200 people attended the Diversity Lunch + Hack, sponsored by Google Cloud and over 60 people attended the EmpowerUs reception, sponsored by Red Hat.
  • Kubernetes, Prometheus and Helm were the top three projects in terms of attendee interest.
  • The two reasons respondents cited for attending KubeCon + CloudNativeCon were to learn (72.4%) and to network (18.6%)

Save the Dates!

Speaker submissions are open and due on July 12 for KubeCon + CloudNativeCon North America 2019 which takes place in San Diego from November 18-21. Registration is open. 

We’ll be back in Europe for KubeCon + CloudNativeCon Europe in Amsterdam from March 30-April 2, 2020. Registration will go live at the end of the year.

We hope to see you at one of or all of these upcoming events!

With Kubernetes, Spotify’s Capacity Planning Went From Almost an Hour to Seconds or Minutes

By | Blog

Spotify’s audio-streaming platform has grown to over 200 million monthly active users across the world since launching in 2008. Their homegrown container orchestration system called Helios was no longer efficient enough for their needs. Spotify wanted to benefit from added velocity and reduced cost Kubernetes provides. Today they have over 150 microservices running on Kubernetes.  Find out more by reading the full case study.

Save the Dates for 2019! KubeCon + CloudNativeCon + Open Source Summit China will be happening from June 24-26. Registration is now open.

And finally, we will be in sunny San Diego for KubeCon + CloudNativeCon North America from November 18-21.

Demystifying Containers – Part I: Kernel Space

By | Blog

This blog post was first published on suse.com by Sascha Grunert.

This series of blog posts and corresponding talks aims to provide you with a pragmatic view on containers from a historic perspective. Together we will discover modern cloud architectures layer by layer, which means we will start at the Linux Kernel level and end up at writing our own secure cloud native applications.

Simple examples paired with the historic background will guide you from the beginning with a minimal Linux environment up to crafting secure containers, which fit perfectly into todays’ and futures’ orchestration world. In the end it should be much easier to understand how features within the Linux kernel, container tools, runtimes, software defined networks and orchestration software like Kubernetes are designed and how they work under the hood.


Part I: Kernel Space

This first blog post (and talk) is scoped to Linux kernel related topics, which will provide you with the necessary foundation to build up a deep understanding about containers. We will gain an insight about the history of UNIX, Linux and talk about solutions like chroot, namespaces and cgroups combined with hacking our own examples. Besides this we will peel some containers to get a feeling about future topics we will talk about.

Introduction

If we are talking about containers nowadays, most people tend to think of the big blue whale or the white steering wheel on the blue background.

Let’s put these thoughts aside and ask ourselves: What are containers in detail? If we look at the corresponding documentation of Kubernetes we only find explanations about “Why to use containers?“ and lots of references to Docker. Docker itself explains containers as “a standard unit of software“. Their explanations provide a general overview but do not reveal much of the underlying “magic“. Eventually, people tend to imagine containers as cheap virtual machines (VMs), which technically does not come close to the real world. This could be reasoned since the word “container” does not mean anything precisely at all. The same applies to the word “pod” in the container orchestration ecosystem.

If we strip it down then containers are only isolated groups of processes running on a single host, which fulfill a set of “common” features. Some of these fancy features are built directly into the Linux kernel and mostly all of them have different historical origins.

So containers have to fulfill four major requirements to be acceptable as such:

  1. Not negotiable: They have to run on a single host. Okay, so two computers cannot run a single container.
  2. Clearly: They are groups of processes. You might know that Linux processes live inside a tree structure, so we can say containers must have a root process.
  3. Okay: They need to be isolated, whatever this means in detail.
  4. Not so clear: They have to fulfill common features. Features in general seem to change over time, so we have to point out what the most common features are.

These requirements alone can lead into confusion and the picture is not clear yet. So let’s start from the historical beginning to keep things simple.

chroot

Mostly every UNIX operating system has the possibility to change the root directory of the current running process (and its children). This originates from the first occurrence of chroot in UNIX Version 7 (released 1979), from where it continued the journey into the awesome Berkeley Software Distribution (BSD). In Linux you can nowadays chroot(2) as system call (a kernel API function call) or the corresponding standalone wrapper program. Chroot is also referenced as “jail“, because some person used it as a honeypot to monitor a security hacker back in 1991. So chroot is much older than Linux and it has been (mis)used in the early 2000s for the first approaches in running applications as what we would call today “microservices”. Chroot is currently used by a wide range of applications, for example within build services for different distributions. Nowadays the BSD implementation differs a lots from the Linux one, where we will focus on the latter part for now.

What is needed to run an own chroot environment? Not that much, since something like this already works:

> mkdir -p new-root/{bin,lib64}
> cp /bin/bash new-root/bin
> cp /lib64/{ld-linux-x86-64.so*,libc.so*,libdl.so.2,libreadline.so*,libtinfo.so*} new-root/lib64
> sudo chroot new-root

We create a new root directory, copy a bash shell and its dependencies in and run chroot. This jail is pretty useless: All we have at hand is bash and its builtin functions like cd and pwd.

One might think it could be worth running a statically linked binary in a jail and that would be the same as running a container image. It’s absolutely not, and a jail is not really a standalone security feature but more a good addition to our container world.

The current working directory is left unchanged when calling chroot via a syscall, whereas relative paths can still refer to files outside of the new root. This call changes only the root path and nothing else. Beside this, further calls to chroot do not stack and they will override the current jail. Only privileged processes with the capability CAP_SYS_CHROOT are able to call chroot. At the end of the day the root user can easily escape from a jail by running a program like this:


#include sys/stat.h
#include
int main(void)
{
mkdir(".out", 0755);
chroot(".out");
chdir("../../../../../");
chroot(".");
return execl("/bin/bash", "-i", NULL);
}

We create a new jail by overwriting the current one and change the working directly to some relative path outside of the chroot environment. Another call to chroot might bring us outside of the jail which can be verified by spawning a new interactive bash shell.

Nowadays chroot is not used by container runtimes any more and was replaced by pivot_root(2), which has the benefit of putting the old mounts into a separate directory on calling. These old mounts could be unmounted afterwards to make the filesystem completely invisible to broken out processes.

To continue with a more useful jail we need an appropriate root filesystem (rootfs). This contains all binaries, libraries and the necessary file structure. But where to get one? What about peeling it from an already existing Open Container Initiative (OCI) container, which can be easily done with the two tools skopeo and umoci:

> skopeo copy docker://opensuse/tumbleweed:latest oci:tumbleweed:latest
[output removed]
> sudo umoci unpack --image tumbleweed:latest bundle
[output removed]

Now with our freshly downloaded and extracted rootfs we can chroot into the jail via:

> sudo chroot bundle/rootfs
#

It looks like we’re running inside a fully working environment, right? But what did we achieve? We can see that we may sneak-peak outside the jail from a process perspective:

> mkdir /proc
> mount -t proc proc /proc
> ps aux
[output removed]

There is no process isolation available at all. We can even kill programs running outside of the jail, what a metaphor! Let’s peek into the network devices:

> mkdir /sys
> mount -t sysfs sys /sys
> ls /sys/class/net
eth0 lo

There is no network isolation, too. This missing isolation paired with the ability to leave the jail leads into lots of security related concerns, because jails are sometimes used for wrong (security related) purposes. How to solve this? This is where the Linux namespaces join the party.

Linux Namespaces

Namespaces are a Linux kernel feature which were introduced back in 2002 with Linux 2.4.19. The idea behind a namespace is to wrap certain global system resources in an abstraction layer. This makes it appear like the processes within a namespace have their own isolated instance of the resource. The kernels namespace abstraction allows different groups of processes to have different views of the system.

Not all available namespaces were implemented from the beginning. A full support for what we now understand as “container ready” was finished in kernel version 3.8 back in 2013 with the introduction of the user namespace. We end up having currently seven distinct namespaces implemented: mnt, pid, net, ipc, uts, user and cgroup. No worries, we will discuss them in detail. In September 2016 two additional namespaces were proposed (time and syslog) which are not fully implemented yet. Let’s have a look into the namespace API before digging into certain namespaces.

API

The namespace API of the Linux kernel consists of three main system calls:

clone

The clone(2) API function creates a new child process, in a manner similar to fork(2). Unlike fork(2), the clone(2) API allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers. You can pass different namespace flags to clone(2)to create new namespaces for the child process.

unshare

The function unshare(2) allows a process to disassociate parts of the execution context which are currently being shared with others.

setns

The function setns(2) reassociates the calling thread with the provided namespace file descriptor. This function can be used to join an existing namespace.

proc

Besides the available syscalls, the proc filesystem populates additional namespace related files. Since Linux 3.8, each file in /proc/$PID/ns is a “magic“ link which can be used as a handle for performing operations (like setns(2)) to the referenced namespace.

> ls -Gg /proc/self/ns/
total 0
lrwxrwxrwx 1 0 Feb  6 18:32 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 0 Feb  6 18:32 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 0 Feb  6 18:32 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 0 Feb  6 18:32 net -> 'net:[4026532008]'
lrwxrwxrwx 1 0 Feb  6 18:32 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb  6 18:32 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb  6 18:32 user -> 'user:[4026531837]'
lrwxrwxrwx 1 0 Feb  6 18:32 uts -> 'uts:[4026531838]'

This allows us for example to track in which namespaces certain processes reside. Another way to play around with namespaces apart from the programmatic approach is using tools from the util-linux package. This contains dedicated wrapper programs for the mentioned syscalls. One handy tool related to namespaces within this package is lsns. It lists useful information about all currently accessible namespaces or about a single given one. But now let’s finally get our hands dirty.

Available Namespaces

Mount (mnt)

The first namespace we want to try out is the mnt namespace, which was the first implemented one back in 2002. During that time (mostly) no one thought that multiple namespaces would ever be needed, so they decided to call the namespace clone flag CLONE_NEWNS. This leads into a small inconsistency with other namespace clone flags (I see you suffering!). With the mnt namespace Linux is able to isolate a set of mount points by a group of processes.

A great use case of the mnt namespace is to create environments similar to jails, but in a more secure fashion. How to create such a namespace? This can be easily done via an API function call or the unshare command line tool. So we can do this:

> sudo unshare -m
# mkdir mount-dir
# mount -n -o size=10m -t tmpfs tmpfs mount-dir
# df mount-dir
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs              10240     0     10240   0% <PATH>/mount-dir
# touch mount-dir/{0,1,2}

Looks like we have a successfully mounted tmpfs, which is not available on the host system level:

> ls mount-dir
> grep mount-dir /proc/mounts
>

The actual memory being used for the mount point is laying in an abstraction layer called Virtual File System (VFS), which is part of the kernel and where every other filesystem is based on. If the namespace gets destroyed, the mount memory is unrecoverably lost. The mount namespace abstraction gives us the possibility to create entire virtual environments in which we are the root user even without root permissions.

On the host system we are able to see the mount point via the mountinfo file inside of the proc filesystem:

> grep mount-dir /proc/$(pgrep -u root bash)/mountinfo
349 399 0:84 / /mount-dir rw,relatime - tmpfs tmpfs rw,size=1024k

How to work with these mount points on a source code level? Well, programs tend to keep a file handle on the corresponding /proc/$PID/ns/mnt file, which refers to the used namespace. In the end mount namespace related implementation scenarios can be really complex, but they give us the power to create flexible container filesystem trees. The last thing I want to mention is that mounts can have different flavors (shared, slave, private, unbindable), which is best explained within the shared subtree documentation of the Linux kernel.

UNIX Time-sharing System (uts)

The UTS namespace was introduced in Linux 2.6.19 (2006) and allows us to unshare the domain- and hostname from the current host system. Let’s give it a try:

> sudo unshare -u
# hostname
nb
# hostname new-hostname
# hostname
new-hostname

And if we look at the system level nothing has changed, hooray:

> hostname
nb

The UTS namespace is yet another nice addition in containerization, especially when it comes to container networking related topics.

Interprocess Communication (ipc)

IPC namespaces came with Linux 2.6.19 (2006) too and isolate interprocess communication (IPC) resources. In special these are System V IPC objects and POSIX message queues. One use case of this namespace would be to separate the shared memory (SHM) between two processes to avoid misusage. Instead, each process will be able to use the same identifiers for a shared memory segment and produce two distinct regions. When an IPC namespace is destroyed, then all IPC objects in the namespace are automatically destroyed, too.

Process ID (pid)

The PID namespace was introduced in Linux 2.6.24 (2008) and gives processes an independent set of process identifiers (PIDs). This means that processes which reside in different namespaces can own the same PID. In the end a process has two PIDs: the PID inside the namespace, and the PID outside the namespace on the host system. The PID namespaces can be nested, so if a new process is created it will have a PID for each namespace from its current namespace up to the initial PID namespace.

The first process created in a PID namespace gets the number 1 and gains all the same special treatment as the usual init process. For example, all processes within the namespace will be re-parented to the namespace’s PID 1 rather than the host PID 1. In addition the termination of this process will immediately terminate all processes in its PID namespace and any descendants. Let’s create a new PID namespace:

> sudo unshare -fp --mount-proc
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.4  0.6  18688  6608 pts/0    S    23:15   0:00 -bash
root        39  0.0  0.1  35480  1768 pts/0    R+   23:15   0:00 ps aux

Looks isolated, doesn’t it? The --mount-proc flag is needed to re-mount the proc filesystem from the new namespace. Otherwise we would not see the PID subtree corresponding with the namespace. Another option would be to manually mount the proc filesystem via mount -t proc proc /proc, but this also overrides the mount from the host where it has to be remounted afterwards.

Network (net)

Network namespaces were completed in Linux 2.6.29 (2009) and can be used to virtualize the network stack. Each network namespace contains its own resource properties within /proc/net. Furthermore, a network namespace contains only a loopback interface on initial creation. Let’s create one:

> sudo unshare -n
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Every network interface (physical or virtual) is present exactly once per namespace. It is possible that an interface will be moved between namespaces. Each namespace contains a private set of IP addresses, its own routing table, socket listing, connection tracking table, firewall, and other network-related resources.

Destroying a network namespace destroys any virtual and moves any physical interfaces within it back to the initial network namespace.

A possible use case for the network namespace is creating Software Defined Networks (SDN) via virtual Ethernet (veth) interface pairs. One end of the network pair will be plugged into a bridged interface whereas the other end will be assigned to the target container. This is how pod networks like flannel work in general.

Let’s see how it works. First, we need to create a new network namespace, which can be done via ip, too:

> sudo ip netns add mynet
> sudo ip netns list
mynet

So we created a new network namespace called mynet. When ip creates a network namespace, it will create a bind mount for it under /var/run/netns too. This allows the namespace to persist even when no processes are running within it.

With ip netns exec we can inspect and manipulate our network namespace even further:

> sudo ip netns exec mynet ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> sudo ip netns exec mynet ping 127.0.0.1
connect: Network is unreachable

The network seems down, let’s bring it up:

> sudo ip netns exec mynet ip link set dev lo up
> sudo ip netns exec mynet ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.016 ms

Hooray! Now let’s create a veth pair which should allow communication later on:

> sudo ip link add veth0 type veth peer name veth1
> sudo ip link show type veth
11: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:d1:fc:31:9c:d3 brd ff:ff:ff:ff:ff:ff
12: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0f:37:18:76:52 brd ff:ff:ff:ff:ff:ff

Both interfaces are automatically connected, which means that packets sent to veth0 will be received by veth1 and vice versa. Now we associate one end of the veth pair to our network namespace:

> sudo ip link set veth1 netns mynet
> ip link show type veth
12: veth0@if11: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0f:37:18:76:52 brd ff:ff:ff:ff:ff:ff link-netns mynet

Our network interfaces need some addresses for sure:

> sudo ip netns exec mynet ip addr add 172.2.0.1/24 dev veth1
> sudo ip netns exec mynet ip link set dev veth1 up
> sudo ip addr add 172.2.0.2/24 dev veth0
> sudo ip link set dev veth0 up

Communicating in both directions should now be possible:

> ping -c1 172.2.0.1
PING 172.2.0.1 (172.2.0.1) 56(84) bytes of data.
64 bytes from 172.2.0.1: icmp_seq=1 ttl=64 time=0.036 ms

--- 172.2.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.036/0.036/0.036/0.000 ms
> sudo ip netns exec mynet ping -c1 172.2.0.2
PING 172.2.0.2 (172.2.0.2) 56(84) bytes of data.
64 bytes from 172.2.0.2: icmp_seq=1 ttl=64 time=0.020 ms

--- 172.2.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.020/0.020/0.020/0.000 ms

It works, but we wouldn’t have any internet access from the network namespace. We would need a network bridge or something similar for that and a default route from the namespace. I leave this task up to you, for now let’s go on to the next namespace.

User ID (user)

With Linux 3.5 (2012) the isolation of user and group IDs was finally possible via namespaces. Linux 3.8 (2013) made it possible to create user namespaces even without being actually privileged. The user namespace enables that a user and group IDs of a process can be different inside and outside of the namespace. An interesting use-case is that a process can have a normal unprivileged user ID outside a user namespace while being fully privileged inside.

Let’s give it a try:

> id -u
1000
> unshare -U
> whoami
nobody

After the namespace creation, the files /proc/$PID/{u,g}id_map expose the mappings for user and group IDs for the PID. These files can be written only once to define the mappings.

In general each line within these files contain a one to one mapping of a range of contiguous user IDs between two user namespaces and could look like this:

> cat /proc/$PID/uid_map
0 1000 1

The example above translates to: With the starting user ID 0 the namespace maps to a range starting at ID 1000. This applies only to the user with the ID 1000, since the defined length is 1.

If now a process tries to access a file, its user and group IDs are mapped into the initial user namespace for the purpose of permission checking. When a process retrieves file user and group IDs (via stat(2)), the IDs are mapped in the opposite direction.

In the unshare example (we did above) we implicitly call getuid(2) before writing an appropriate user mapping, which will result in an unmapped ID. This unmapped ID is automatically converted to the overflow user ID (65534 or the value in /proc/sys/kernel/overflow{g,u}id).

The file /proc/$PID/setgroups contains either allow or deny to enable or disable the permission to call thesetgroups(2) syscall within the user namespace. The file was added to address an added security issue introduced with the user namespace: It would be possible to an unprivileged process to create a new namespace in which the user had all privileges. This formerly unprivileged user would be able to drop groups via setgroups(2) to gain access to files he previously not had.

In the end the user namespace enables great security additions to the container world, which are essential for running rootless containers.

Control Group (cgroup)

Cgroups started their journey 2008 with Linux 2.6.24 as dedicated Linux kernel feature. The main goal of cgroups is to support resource limiting, prioritization, accounting and controlling. A major redesign started with version 2 in 2013, whereas the cgroup namespace was added with Linux 4.6 (2016) to prevent leaking host information into a namespace. The second version of cgroups were released there too and major features were added since then. One latest example is an Out-of-Memory (OOM) killer which adds an ability to kill a cgroup as a single unit to guarantee the overall integrity of the workload.

Let’s play around with cgroups and create a new one. By default, the kernel exposes cgroups in /sys/fs/cgroup. To create a new cgroup, we simply create a new sub-directory on that location:

> sudo mkdir /sys/fs/cgroup/memory/demo
> ls /sys/fs/cgroup/memory/demo
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks

You can see that there are already some default values exposed there. Now, we are able to set the memory limits for that cgroup. We are also turning off swap to make our example implementation work.

> sudo su
# echo 100000000 > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
# echo 0 > /sys/fs/cgroup/memory/demo/memory.swappiness

To assign a process to a cgroup we can write the corresponding PID to the cgroup.procs file:

# echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs

Now we can execute a sample application to consume more than the allowed 100 megabytes of memory. The application I used is written in Rust and looks like this:

pub fn main() {
    let mut vec = vec![];
    loop {
        vec.extend_from_slice(&[1u8; 10_000_000]);
        println!("{}0 MB", vec.len() / 10_000_000);
    }
}

If we run the program, we see that the PID will be killed because of the set memory constraints. So our host system is still usable.

# rustc memory.rs
# ./memory
10 MB
20 MB
30 MB
40 MB
50 MB
60 MB
70 MB
80 MB
90 MB
Killed

Composing Namespaces

Namespaces are composable, too! This reveals their true power and makes it possible to have isolated pid namespaces which share the same network interface, like it is done in Kubernetes Pods.

To demonstrate this, let’s create a new namespace with an isolated PID:

> sudo unshare -fp --mount-proc
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.6  18688  6904 pts/0    S    23:36   0:00 -bash
root        39  0.0  0.1  35480  1836 pts/0    R+   23:36   0:00 ps aux

The setns(2) syscall with its appropriate wrapper program nsenter can now be used to join the namespace. For this we have to find out which namespace we want to join:

> export PID=$(pgrep -u root bash)
> sudo ls -l /proc/$PID/ns

Now, it is easily possible to join the namespace via nsenter:

> sudo nsenter --pid=/proc/$PID/ns/pid unshare --mount-proc
# ps aux
root         1  0.1  0.0  10804  8840 pts/1    S+   14:25   0:00 -bash
root        48  3.9  0.0  10804  8796 pts/3    S    14:26   0:00 -bash
root        88  0.0  0.0   7700  3760 pts/3    R+   14:26   0:00 ps aux

We can now see that we are member of the same PID namespace! It is also possible to enter already running containers via nsenter, but this topic will be covered later on.

Demo Application

A small demo application can be used to create a simple isolated environment via the namespace API:

#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/msg.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACKSIZE (1024 * 1024)
static char stack[STACKSIZE];

void print_err(char const * const reason)
{
    fprintf(stderr, "Error %s: %s\n", reason, strerror(errno));
}

int exec(void * args)
{
    // Remount proc
    if (mount("proc", "/proc", "proc", 0, "") != 0) {
        print_err("mounting proc");
        return 1;
    }

    // Set a new hostname
    char const * const hostname = "new-hostname";
    if (sethostname(hostname, strlen(hostname)) != 0) {
        print_err("setting hostname");
        return 1;
    }

    // Create a message queue
    key_t key = {0};
    if (msgget(key, IPC_CREAT) == -1) {
        print_err("creating message queue");
        return 1;
    }

    // Execute the given command
    char ** const argv = args;
    if (execvp(argv[0], argv) != 0) {
        print_err("executing command");
        return 1;
    }

    return 0;
}

int main(int argc, char ** argv)
{
    // Provide some feedback about the usage
    if (argc < 2) {
        fprintf(stderr, "No command specified\n");
        return 1;
    }

    // Namespace flags
    const int flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS | CLONE_NEWIPC |
                      CLONE_NEWPID | CLONE_NEWUSER | SIGCHLD;

    // Create a new child process
    pid_t pid = clone(exec, stack + STACKSIZE, flags, &argv[1]);

    if (pid < 0) {
        print_err("calling clone");
        return 1;
    }

    // Wait for the process to finish
    int status = 0;
    if (waitpid(pid, &status, 0) == -1) {
        print_err("waiting for pid");
        return 1;
    }

    // Return the exit code
    return WEXITSTATUS(status);
}

Purpose of the application is to spawn a new child process in different namespaces. Every command provided to the executable will be forwarded to the new child process. The application terminates, when the command execution is done. You can test and verify the implementation via:

> gcc -o namespaces namespaces.c
> ./namespaces ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> ./namespaces ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.0  0.1  36524  1828 pts/0    R+   23:46   0:00 ps aux
> ./namespaces whoami
nobody

This is truly not a working container, but it should give you a slight feeling about how container runtimes leverage namespaces to manage containers. Feel free to use this example as a starting point for your own little experiments with the namespace API.

Putting it all Together

Do you remember the rootfs we extracted from the image within the chroot section? We can use a low level container runtime like runc to easily run a container from the rootfs:

> sudo runc run -b bundle container

If we now inspect the system namespaces, we see that runc already created mnt, uts, ipc, pid and net for us:

> sudo lsns | grep bash
4026532499 mnt         1  6409 root   /bin/bash
4026532500 uts         1  6409 root   /bin/bash
4026532504 ipc         1  6409 root   /bin/bash
4026532505 pid         1  6409 root   /bin/bash
4026532511 net         1  6409 root   /bin/bash

I will stop here and we will learn more about container runtimes and what they do, in upcoming blog posts and talks.

Conclusion

I really hope you enjoyed the read and that the mysteries about containers are now a little bit more fathomable. If you run Linux it is easy to play around with different isolation techniques from scratch. In the end a container runtime nicely uses all these isolation features on different abstraction levels to provide a stable and robust development and production platform for containers.

There are lots of topics which were not covered here because I wanted to stay at a stable level of detail. For sure, a great resource for digging deeper into the topic of Linux namespaces is the Linux programmers manual: NAMESPACES(7).

Feel free to drop me a line or get in contact with me for any questions or feedback. The next blog posts will cover container runtimes, security and the overall ecosystem around latest container technologies. Stay tuned!

You can find all necessary resources about this series on GitHub.[/vc_column_text][/vc_column][/vc_row]

TOC Approves CNCF SIGs and Creates Security and Storage SIGs

By | Blog

Earlier this year, the Technical Oversight Committee (TOC) voted to create CNCF Special Interest Groups (SIGs). CNCF SIGs are currently being bootstrapped in various focus areas and primarily led by recognized experts and supported by contributors. They report directly to the TOC and we encourage developers and end users to get involved in the formation:

Name (to be finalised) Area Current CNCF Projects
Traffic networking, service discovery, load balancing, service mesh, RPC, pubsub, etc. Envoy, Linkerd, NATS, gRPC, CoreDNS, CNI
Observability monitoring, logging, tracing, profiling, etc. Prometheus, OpenTracing, Fluentd, Jaeger, Cortex, OpenMetrics,
Governance authentication, authorization, auditing, policy enforcement, compliance, GDPR, cost management, etc SPIFFE, SPIRE, Open Policy Agent, Notary, TUF,  Falco,
App Delivery PaaS, Serverless, Operators,… CI/CD,  Conformance, Chaos Eng, Scalability and Reliability measurement etc. Helm, CloudEvents, Telepresence, Buildpacks, (CNCF CI)
Core and Applied Architectures orchestration, scheduling, container runtimes, sandboxing technologies, packaging and distribution, specialized architectures thereof (e.g. Edge, IoT, Big Data, AI/ML, etc). Kubernetes, containerd, rkt, Harbor, Dragonfly, Virtual Kubelet

The TOC and CNCF Staff will start drafting an initial set of charters for the above SIGs, and solicit suitable chairs. Visit the CNCF SIG page for more information.

Security SIG

Approved by the TOC earlier this month, the Security SIG’s mission is to reduce risk that cloud native applications expose end user data or allow other unauthorized access.

While there are many open source security projects, security has generally received less attention than other areas of the cloud native landscape. The visibility of these projects’ internals has been limited, and their integration into cloud native tooling as well. There is also a lack of security experts focused on the ecosystem. All of this has contributed to an uncertainty on how to securely set up and operate cloud native architectures.

It is essential to design common architectural patterns to improve overall security in cloud native systems.

The TOC has defined three objectives for this SIG. This will complete what is currently being done by CNCF’s security-related projects:

  • Protection of heterogeneous, distributed and fast changing systems, while providing needed access
  • Common understanding and common tooling to help developers meet security requirements
  • Common tooling for audit and reasoning about system properties.

Security must be addressed at all levels of the stack and across the entire ecosystem. As a result, the Security SIG is looking for participation and membership from a diverse range of roles, industries, companies and organizations. See the Security SIG Charter for more information.

TOC Liaisons: Liz Rice and Joe Beda

Co-Chairs: Sarah Allen, Dan Shaw, Jeyappragash JJ

Storage SIG

The Storage SIG was approved in late May, and aims to enable widespread and successful storage of persistent state in cloud native environments. The group focuses on storage systems and approaches suitable for and commonly used in modern cloud native environments, including:

  • Storage systems that differ significantly from systems and approaches previously commonly used in traditional enterprise data center environments,
  • Those that are not already adequately covered by other groups within the CNCF
  • Block stores, file systems, object stores, databases, key-value stores, and related caching mechanisms.

The Storage SIG strives to understand the fundamental characteristics of different storage approaches with respect to availability, scalability, performance, durability, consistency, ease-of-use, cost and operational complexity. The goal then is to clarify suitability for various cloud native use cases.  

If you are interested in participating in the Storage SIG, check out the Charter for more information.

TOC Liaisons: Xiang Li

Co-Chairs: Alex Chircop, Quinton Hoole


TOC 批准 CNCF SIG 并创建安全和存储 SIG

今年早些时候,技术监督委员会 (TOC) 投票决定创建 CNCF 特别兴趣小组 (SIG)。CNCF SIG 目前正在各个重点领域稳步发展,主要由知名专家领导,并得到了贡献者的广泛支持。他们直接向 TOC 报告,我们鼓励开发人员和最终用户积极参与小组组建:

Name (to be finalised) Area Current CNCF Projects
名称(待敲定) 区域 当前 CNCF 项目
Traffic networking, service discovery, load balancing, service mesh, RPC, pubsub, etc. Envoy, Linkerd, NATS, gRPC, CoreDNS, CNI
流量 网络、服务发现、负载均衡、服务网格、RPC、pubsub 等 Envoy、Linkerd、NATS、gRPC、CoreDNS、CNI
Observability monitoring, logging, tracing, profiling, etc. Prometheus, OpenTracing, Fluentd, Jaeger, Cortex, OpenMetrics,
可观察性 监控、记录、跟踪、分析等 Prometheus、OpenTracing、Fluentd、Jaeger、Cortex、OpenMetrics
Governance authentication, authorization, auditing, policy enforcement, compliance, GDPR, cost management, etc SPIFFE, SPIRE, Open Policy Agent, Notary, TUF,  Falco,
治理 认证、授权、审计、策略执行、合规、GDPR、成本管理等 SPIFFE、SPIRE、开放策略代理、Notary、TUF、Falco
App Delivery PaaS, Serverless, Operators,… CI/CD,  Conformance, Chaos Eng, Scalability and Reliability measurement etc. Helm, CloudEvents, Telepresence, Buildpacks, (CNCF CI)
应用交付 PaaS、无服务器、运营商……CI/CD、合规、混沌引擎、可扩展性和可靠性衡量等 Helm、CloudEvents、Telepresence、Buildpack、(CNCF CI)
Core and Applied Architectures orchestration, scheduling, container runtimes, sandboxing technologies, packaging and distribution, specialized architectures thereof (e.g. Edge, IoT, Big Data, AI/ML, etc). Kubernetes, containerd, rkt, Harbor, Dragonfly, Virtual Kubelet
核心和应用架构 编排、调度、容器运行时、沙盒技术、封装和分发、专业架构(例如 Edge、物联网、大数据,人工智能/机器学习 等)。 Kubernetes、containerd、rkt、Harbour、Dragonfly、Virtual Kubelet

TOC 和 CNCF 员工将开始为上述 SIG 起草一套初步章程,并招募合适的主席。如需了解更多信息,请访问 CNCF SIG 页面。

安全 SIG

本月初,安全 SIG 通过了 TOC 审批。其使命是降低云原生应用 泄露最终用户数据或允许其他未授权访问的风险。

尽管有许多开源安全项目,但安全重视程度通常低于云原生环境的其他领域。这些项目内部结构的可视性受到限制,并集成至云原生工具中。此外,它们还缺少专注于生态系统的安全专家。上述所有因素均造成了如何安全设置并运行云原生架构的不确定性。

设计通用架构模式来提高云原生系统的整体安全性至关重要。

TOC 为此 SIG 设定了以下三个目标。这将完成 CNCF 安全相关项目目前正在进行的工作:

  • 保护异构、分布式、快速变化的系统,同时提供所需访问权限
  • 达成共识,并确定通用工具,以帮助开发人员满足安全要求
  • 确定用于审计和推理系统属性的通用工具。

必须在所有堆栈层级及整个生态系统中解决安全问题。因此,安全 SIG 正在寻求不同角色、行业、公司和组织成员的积极参与。更多信息,请参阅安全 SIG 章程

TOC 联络人:Liz RiceJoe Beda

联合主席:Sarah AllenDan ShawJeyappragash JJ

存储 SIG

存储 SIG 于 5 月底获批,致力于在云原生环境中广泛、成功地实现持久状态存储。该小组专注于适合并常用于现代云原生环境的存储系统和方法,包括:

  • 与以前常用于传统企业数据中心环境的系统和方法显著不同的存储系统
  • CNCF 内其他小组尚未充分涉及的系统和方法
  • 数据块存储、文件系统、对象存储、数据库、键值存储及相关缓存机制。

存储 SIG 致力于了解不同存储方法在可用性、可扩展性、性能、耐用性、一致性、易用性、成本和运营复杂性方面的基本特征。其目标是阐明各种云原生用例的适用性。  

如果您有兴趣参加存储 SIG,请查看章程了解更多信息。

TOC 联络人:Xiang Li

联合主席:Alex ChircopQuinton Hoole

Virtual Cluster – Extending Namespace Based Multi-tenancy with a Cluster View

By | Blog

Guest post by Fei Guo and Lei Zhang of Alibaba

Abstract:

In this guest post, the Kubernetes team from Alibaba will share how they are building hard multi-tenancy on top of upstream Kubernetes by leveraging a group of plugins named “Virtual Cluster” and extending tenant design in the community. The team has decided to open source the these K8s plugins and contribute them to Kubernetes community in the upcoming KubeCon.

Introduction

In Alibaba, the internal Kubernetes team is using one web-scale cluster to serve large number of business units as end users. In this case, every end user actually become a “tenant” to this K8s cluster which makes hard multi-tenancy as a strong need.

However, instead of hacking Kubernetes APIServer and resource model, the team in Alibaba tried to build a “Virtual Cluster” multi-tenancy layer without changing any code of Kuberentes. With this architecture, every tenant will be assigned a dedicated K8s control plane (kube-apiserver + kube-controller-manager) and several “Virtual Node” (pure Node API object but no corresponding kubelet) so there’s no worries for naming or node conflicting at all, while the tenant workloads are still mixed running in the same underlying “Super Cluster” so resource utilization is guaranteed. This design is detailed in [virtual cluster proposal] which has received lots of feedback.

Although a new concept of “tenant master” is introduced in this design, virtual cluster is simply an extension built on top of the existing namespace based multi-tenancy in K8s community, which is referred to as “namespace group” in the rest of the document. Virtual cluster fully relies on the resource isolation mechanisms proposed by namespace group, and we are eagerly expecting and pushing them to be addressed in the on-going efforts of Kubernetes WG-multitenancy.

If you want to know more details about Virtual Cluster design, please do not hesitate to read the [virtual cluster proposal] , while in this document, we will focus on the high level idea behind virtual cluster and elaborating how we extend the namespace group with “tenant cluster” view and why this extension is valuable to Kubernetes multi-tenancy use cases.

Background

This section briefly reviews the architecture of namespace group multi-tenancy proposal.

We borrow a diagram from the K8s Multi-tenancy WG Deep Dive presentation, as shown Figure1, to explain the high level idea of using namespaces to organize tenant resources.

                         Figure 1. Namespace group multi-tenancy architecture

In namespace group, all tenant users share the same access point, the K8s apiserver, to utilize the tenant resource. Their accounts, assigned namespaces and resource isolation policies are all specified in tenant CRD objects, which are managed by tenant admin. Tenant user view is limited in the per tenant namespaces. The tenant resource isolation policies are defined to disable the direct communication between tenants and to protect tenant Pods from security attacks. They are realized by native Kubernetes resource isolation mechanisms including RBAC, Pod security policy, network policy, admission control and sandbox runtime. Multiple security profiles can be configured and applied for different levels of isolation requirements. In addition, resource quotas, chargeback and billing happen at tenant level.

How Virtual Cluster Extends the View Layer

Conceptually, virtual cluster provides a view layer extension on top of the namespace group solution. Its technical details can be found in [virtual cluster]. In virtual cluster, tenant admin still needs to use the same tenant CRD used in namespace group to specify the tenant user accounts, namespaces and resource isolation policy in the tenant resource provider, i.e., the super master.

                         Figure 2. View Layer Extension By Virtual Cluster

As illustrated in Figure 2, thanks to the new virtual cluster view layer, tenant users now have different access points and tenant resource views. Instead of accessing super master and view the tenant namespaces directly, tenant users interact with dedicated tenant masters to utilize tenant resources and are offered complete K8s master views. All tenant requests are synchronized to super master by the sync-manager, which creates corresponding custom resources on behalf of the tenant users in super master, following the resource isolation policy specified in the tenant CRD. That being said, virtual cluster primarily changes tenant user view from namespaces to an APIserver. From super master perspective, the same workflow is triggered by the tenant controller in respect to the tenant CRD.

Benefits of Virtual Cluster View Extension

There are quite a few benefits of having a Virtual Cluster view on top of the existing namespace view for tenant users:

  • It provides flexible and convenient tenant resource management for tenant users. For example, a nested namespace hierarchy, as illustrated in Figure 3(a), can easily resolve some hard problems like naming conflicts, namespace visibility, sub-partition tenant resources in namespace group solution [Tenant Concept].  However, it is almost impractical to change native K8s master to support nested namespaces. By having a virtual cluster view, the namespaces created in the tenant master, along with the corresponding namespace group in super master, can achieve similar user experiences as if nested namespaces are used.

As shown in Figure 3(b), tenant users can do self-service namespace creation in tenant master without worrying about naming conflict with other tenants. The conflict is resolved by sync-manager when it adds the tenant namespaces to super master namespace group. Tenant A users can never view tenant B users’ namespaces since they access different masters. It is also convenient for tenant to customize policy for different tenant users which only takes effect locally in tenant master.

  • It provides stronger tenant isolation and security since it avoids certain problems due to sharing the same K8s master among multiple tenant users. For example, DOS attacks, API access rate control among tenants and tenant controllers isolation are not concerns any more.
  • It allows tenant users to create cluster scope objects in tenant masters without affecting other tenants. For instance, a tenant user can now create CRD, ClusterRole/ClusterRoleBinding, PersistentVolume, ResourceQuota, ServiceAccount, NetworkPolicy freely in tenant master without worrying about conflicting with other tenants.
  • It alleviates the scalability stress on super master. First, RBAC rules, policies, user accounts managed in super master can be offloaded to the tenant masters, which can be scaled independently. Secondly, tenant controllers and operators access to multiple tenant masters instead of a single super master, which again can be scaled independently.
  • It is much easier to create users for tenant user. Nowadays, if a tenant user want to expose its tenant resources to other users (for example, a team leader wants to add team members to use the resources assigned to the team), tenant admin has to create all the users. In case a tenant admin needs to serve hundreds of such teams in a big organization, creating users for tenant user can be a big burden. Virtual cluster completely offloads such burden to tenant users from the tenant admin.

Limitations

Since virtual cluster mainly extends the multi-tenancy view option and prevents problems due to sharing apiserver from happening, it inherits the same limitations/challenges faced by namespace group solution in making kubernetes node components tenant-aware. The node components need to be enhanced include but not limited to:

  • Kubelet and CNI-plugin. They need to be tenant-aware to support strong network isolation scenarios like VPC.
    • For example, how readiness/liveness probe works if a pod is isolated in different VPC from the node? This is one of the issues we’ve already started to cooperate with SIG-Node on upstream.
  • Kube-proxy/Kube-dns. They need to be tenant-aware to make cluster-IP type of tenant services work.
  • Tools: For example, monitoring tools should be tenant-aware to avoid leaking tenant information. Performance tuning tools should be tenant-aware to void unexpected performance interference between tenants.

Of course, virtual cluster needs extra resources to run tenant master for each tenant which may not be affordable in some cases.

Conclusions

Virtual cluster extends the namespace group multi-tenancy solution with a user friendly cluster view. It leverages the K8s under-line resource isolation mechanisms and existing Tenant CRD & controller in community, but provides uses with experience of dedicated tenant cluster. Overall, we believe virtual cluster together with namespace based multi-tenancy can offer comprehensive solutions for various Kubernetes multi-tenancy use cases in production clusters, and we are actively working on contributing this plugin to the upstream community.

See ya at KubeCon!

Linkerd Benchmarks

By | Blog

Originally published on linkerd.io by William Morgan

glaciers

Update 5/30/2019: Based on feedback from the Istio team, Kinvolk has re-run some of the Istio benchmarks. The results are largely similar to before, with Linkerd maintaining a significant advantage over Istio in latency, memory footprint, and possibly CPU. Below, we’ve noted the newer numbers for Istio when applicable.

Linkerd’s goal is to be the fastest, lightest, simplest service mesh in the world. To that end, several weeks ago we asked the kind folks at Kinvolk to perform an independent benchmark. We wanted an unbiased evaluation by a third party with strong systems expertise and a history of benchmarking. Kinvolk fit this description to a T, and they agreed to take on the challenge.

We asked Kinvolk for several things:

  • A benchmark measuring tail latency, CPU usage, and memory consumption—the three metrics we believe are most indicative of the cost of operating a service mesh.
  • A comparison to the baseline of not using a service mesh at all.
  • A comparison to Istio, another service mesh. (We’re frequently asked how the two compare.)
  • A realistic test setup for an application “at load” and “at scale”, including an apples-to-apples comparison between features, and controls for variance and measurement error.
  • A downloadable framework for reproducing these tests, so that anyone can validate their work.

Today, Kinvolk published their results. You can see the full report here: Kubernetes Service Mesh Benchmarking. Kinvolk measured Linkerd 2.3-edge-19.5.2 and Istio 1.1.6, the latest releases that were available at the time of testing. They measured performance under two conditions: “500rps” and “600rps”, representing effectively “high” and “very high” load for the test harness.

Here’s a summary of their results. (Note that for Istio, Kinvolk tested two configurations, “stock” and “tuned”. We’re looking purely at the “tuned” configuration below.)

Latency

500rps latency chart600rps latency chart

Latency is arguably the most important number for a service mesh, since it measures the user-facing (as opposed to operator-facing) impact of a service mesh. Latency is also the most difficult to reason about, since it is best measured as a distribution.

Kinvolk measured latency from the perspective of load generator, which means that these latency numbers are a function of the application they tested—if the call graph was deeper, we’d see additional latency, and if it were shallower, these numbers would be less. Thus, raw numbers are not as important as the comparisons—how did Linkerd do versus the baseline, and versus Istio?

In the 500rps condition, Linkerd’s p99 latency was 6.7ms, 3.6ms over the baseline p99 of no service mesh of 3.1ms. (In other words, 99% of the time, a request without a service mesh took less than 3.1 ms, and 99% of the time, a request with Linkerd took less than 6.7ms.) At the p999 level (the 99.9th percentile), Linkerd’s latency was significantly worse, at 675ms above the baseline’s p999 of 4ms. The worst response time seen over the whole test with Linkerd was a full 1.8s of latency, compared to the baseline’s worst case of 972ms.

By comparison, Istio’s p99 latency in the 500rps case was 643ms, almost 100x worse than Linkerd’s p99. Its p999 was well over a second, compared to Linkerd’s 679ms, and its worst case was a full 5s of latency, 2.5x what was measured with Linkerd.

(Update: Kinvolk’s re-tuned Istio benchmarks dropped Istio’s p99 from 100x that of Linkerd’s to 26x and 59x that of Linkerd’s across two runs. It also dropped Istio’s p999 to just under a second, though still double Linkerd’s p999.)

In the 600rps condition, the difference between the two service meshes is exaggerated. While Linkerd’s p99 elevates from 6.7ms to 7ms, 4ms over the “no service mesh” baseline, Istio’s p99 was a full 4.4 minutes (!). While Linkerd’s p999 climbed to 850ms, compared to the baseline of 3.8ms, Istio’s p999 is almost 6 minutes. Even Istio’s p50 (median) latency was an unacceptable 17.6 seconds. In short, Istio was not able to perform effectively in Kinvolk’s 600rps condition.

(Update: Kinvolk’s re-tuned Istio benchmark showed similar performance in the 600rps condition, with p99 latency for Istio remaining in the minutes and median latency between 10 and 20 seconds.)

Summary: Linkerd had a latency advantage over Istio. In the 500rps condition, Istio’s p99 was 100x of Linkerd’s. In the 600rps condition, Istio’s latency was unacceptable throughout. However, both meshes introduced significant latency at the 99.9th percentile compared to the baseline in the 500rps condition.

Memory consumption

600rps memory chart

At 500rps, Linkerd’s memory usage was 517mb across all data plane proxies (averaging 5.7mb per proxy), and a little under 500mb for the control plane itself, for a total of ~1gb of memory. By comparison, Istio’s memory usage was 4307mb across all data plane proxies (averaging 47mb per proxy), and 1305mb for the control plane, for a total of almost 5.5gb.

The situation was almost identical in the 600rps condition. Of course, both meshes suffer greatly when compared to the baseline usage of 0mb!

(As a side note, 25% of Linkerd’s control plane memory usage in these runs was its Prometheus instance, which temporarily stores aggregated metrics results to power Linkerd’s dashboard and CLI. Arguably, this should have been excluded, since Prometheus was disabled in Istio.)

Summary: Linkerd had a clear memory advantage. Istio consumed 5.5x as much memory as Linkerd. Linkerd’s data plane, in particular, consumed less than an 1/8th of the RAM that Istio’s did.

CPU consumption

500rps cpu chart600rps cpu chart

When measuring CPU consumption, the two meshes product comparable results. In the 500rps run, Linkerd’s data plane proxies took 1618mc (millicores) cumulatively, and its control plane consumed 82mc, for a total of 1700mc. Istio’s data plane proxies consumed 1723mc and its control plane consumed 379mc, for a total of 2100mc, a 23% increase over Linkerd. However, in the 600rps run, the results were flipped, with Linkerd taking 1951mc vs Istio’s 1985mc. In this run, Linkerd’s data plane at 600RPS condition consumed 15% more CPU than Istio’s. (Though it should be noted that, since Istio was not able to actually return 600rps, it’s not entirely a fair comparison to Linkerd.)

Summary: CPU usage was similar Linkerd and Istio. Linkerd’s data plane CPU utilization was higher than Istio’s in the 600rps condition, though it’s hard to know how real this result was because Istio was not able to fully perform in this condition.

(Update: Kinvolk’s re-tuned Istio benchmarks showed a “massive increase in CPU usage of Istio’s proxy sidecar”. While no numbers were reported, from this description it is clear that Linkerd’s CPU usage was less than Istio’s in this configuration.)

Conclusion

Overall we are happy with Linkerd’s performance in this test, and we’re very happy to have a thorough quantification of the relative cost of introducing a service mesh, and a publicly available, reproducible harness for running these benchmarks.

We’re pleased to see that Linkerd has a very significant advantage in latency and memory consumption over Istio. While CPU usage was comparable, we feel that Linkerd can do better. We suspect there is low-hanging fruit in linkerd-proxy, and we’re eager to see if a little profiling and tuning can reduce CPU consumption over the next few releases. (And we’d love your help – hop into the Linkerd Slack.)

In the future, we hope that projects like Meshery can provide an industry standard approach for these sorts of benchmarks in the future. They’re good for users and good for projects too.

We’re impressed by the thoroughness of Kinvolk’s tests, and we’d like to thank them for taking on this sizeable effort. The full report details the incredible amount of effort they put into building accurate test conditions, reducing variability, and generating statistically meaningful results. It’s definitely worth a read!

Finally, we’d also like to extend a huge THANK YOU to the kind folks at Packet, who allowed Kinvolk to use the CNCF community cluster to perform these experiments. The Linkerd community owes you a debt of gratitude.


Linkerd is a community project and is hosted by the Cloud Native Computing Foundation. If you have feature requests, questions, or comments, we’d love to have you join our rapidly-growing community! Linkerd is hosted on GitHub, and we have a thriving community on Slack, Twitter, and the mailing lists. Come and join the fun!

Image credit: Bernard Spragg. NZ

Diversity Scholarship Series: My KubeCon & CloudNativeCon Europe Experience 2019

By | Blog

Guest post by Ines Cheikhrouhou, DevOps and Cloud consultant, Agyla originally published on Medium

Hi, My name is Ines and I’m one of the lucky people who was sent an invite as a diversity scholar to Kubecon + CloudNativeCon in Barcelona.

First of all, I want to thank CNCF for this great and life-changing opportunity for me.

I was very happy to receive the email and I was very excited to meet the kind of people who love staying in front of computers. It motivates me a lot to know more technical aspects of Kubernetes, mainly the core project. However, little did I know that this event was much more than that. It gave me all of the motivation and also all of the technical information and knowledge that I wanted to have.

My first day at KubeCon was the AWS container day co-located event which was a perfect choice for me since I work with AWS and I wanted to dig deeper into it and other cloud-native projects.

Throughout the day, I learned a lot about what’s new in AWS with relation to Kubernetes or simply other cloud-native tools.

One of the best discoveries for me was the APP mesh which is based on Envoy Proxy, as well as some advanced services for observability such as cloudwatch container insights.

In addition to that, I learned about the huge benefit of doing machine learning development workflow on Kubernetes, as well as the advantages of kubeflow.
And mostly, the famous eksctl CLI that helps provision your cluster in an easy way.

And here are my favorite pictures for the first day.

 

 

The end of the first day was very successful and it made me feel like I belonged with these smart and motivating people. I also got the chance to tour the beautiful Barcelona city (Thanks CNCF for the great choice).

The second day was in a different place which is the main event place at the Fira Gran Via, which was a HUGE building. It was very organized and you would find the planning of the day in each corner.

The day started with a perfect keynote where we learned about almost all of the CNCF projects with details from people who work daily on these projects and I got to listen to some of the best speakers such as Dan Kohn, Bryan Liles and especially Cheryl Hung, a woman who inspired me a lot. Seeing all of these women who participated as speakers made me want to work hard and be there on stage one day.

And last but not least, the famous presentation from Lucas and Nikhita that marked a starting point in my life which is a contribution.

I always thought it was just some smart people coding in a shared GitHub account on a project that I would probably not understand. But it’s more than that, it’s not even about coding, it’s the family that it creates, a family that is composed of people who encourage you, help you and inspire you to show your best. It’s about sharing and improving.

And here is the famous picture that presented the CNCF projects.

After the keynote, there were so many sessions that I wouldn’t want to miss and there was a huge place for sponsor showcase which was the perfect place for you if you had a question in mind or request of a demo or a sticker and a t-shirt. I got to meet so many people and I got to see a lot of demos and know about new CNCF projects that I didn’t hear about before.

And It was also such a pleasure meeting talented people such as Joe Beda, Ali Saad, Arun Gupta, Janet Kuo and all of the speakers that taught us so many things in a short period of time.

I also participated in the Networking and Mentoring session the next day which was the best part of the whole event for me. I was lucky to share a table with 3 of the greatest people I met in the event, Nikhita, Hippie Hacker and carolynvs. They introduced me to the world of contribution, the steps to follow, they taught me where to start and I even made my first PR that day.

These kinds of people and all of the community’s love of Open Source is what makes it fun and interesting to be a part of. And I hope everyone like me whom at a certain point were scared or doubtful, knows that it’s really a safe place to be around.

In the end, the event was a SUCCESS for me and I enjoyed the 5th-anniversary party of Kubernetes and the big Poble Espanyol party with free food and drinks.

Overall, Kubecon was a dream experience for me. Before this conference, I wouldn’t have been able to talk about Kubernetes or other Cloud Native Projects with confidence. But after this event, I gained a lot of knowledge and met so many people who offered me help. The whole experience offered me great opportunities to improve my personal and professional development. I’m excited to share this experience with my friends and I’m inspired to start being an active member of the community.

With this, I look forward to KubeCon + CloudNativeCon Europe 2020. Thank you KubeCon + CloudNativeCon, Europe 2019 and thank you CNCF for the amazing opportunity.

Here is a quick summary about what you’ve probably missed during this event, and let’s start with OpenTelemetry that is the next major version of the openTracing and openCensus projects, CNAB that allows one to package up multiple formats and their toolchains into a single artifact, different Kuberentes operators, how bezel helps with streamlining Kubernetes application CI/CD, containerd, crio-o, Autoscaling multi-cluster observability using Prometheus, and linkerd, building docker on Kubernetes using build kit and the great cloud-native storage orchestrator Rook.

In addition to Jaeger, its agents and its scaling, enhancing security made by Envoy SDS, the new fluentbit after fluentd for extending your logging pipeline with Go, the amazing Grafana Loki for logs and its integration with existing observability tools, multiple types of load balancing such as gRPC load balancing and its benefits with service mesh and Kubert VMs that provides networking functions to Kubernetes objects.

And lastly, the superstar Helm, Prometheus and custom metrics of k8s, Calico and SPIRE and their integration with Envoy, the new trending GitOps strategies and the serverless future of cloud computing.

1 3 4 5 36