How to Overcome the Day 2 Kubernetes Skills Gap

Posted on September 2, 2020 by Emily Omier

CNCF projects highlighted in this post

Guest post originally published on the Nirmata blog by Emily Omier

In the ‘old’ days of enterprise IT, most engineers were hyper-specialized. Each individual would be a specialist in networking or storage, for example, but most didn’t have generalized knowledge about the parts of the software stack outside their domain of expertise. Building system-wide and cross-functional skills was reserved for the senior architect role. This was true both for developers and for operations engineers as well as storage, networking, and security specialists.

One of the main goals of the DevOps movement has been to break down silos — and encourage individuals to develop a broader skills set, one that might include very deep, specialized knowledge in one domain but also a basic understanding of everything else that goes into making an enterprise application work. This is important at every stage of the application life cycle, but especially so for Day 2 operations. Once an application is in production, avoiding downtime is essential — and that requires the knowledge to troubleshoot effectively and quickly.

This broad skill set is critical for Kubernetes platform engineers (or cluster admins or platform operators, however you want to call the Kubernetes experts in your organization). Because Kubernetes is a platform that manages networking, security, storage, and compute, the person or people responsible for configuring and managing Kubernetes need at least a working understanding of how each of those things work, and how they work specifically in a cloud native environment.

The reality, though, is that most organizations have trouble finding ‘DevOps engineers’ or anyone with a wide enough skill set to successfully manage Kubernetes. This is part of what leads to the persistent Kubernetes skills gap: It’s not just that individuals need to learn more about Kubernetes, but that organizations have to build deep knowledge about the ways that Kubernetes interacts with and manages other aspects of the infrastructure.

What do you need?

Not only does Kubernetes force engineers to build skills outside of their usual area of expertise, it also completely changes the paradigm for many domains, so even experts have to re-learn how to work within Kubernetes.

Let’s talk about storage first. A Kubernetes admin who is debugging or troubleshooting a storage issue needs to understand not just how storage works in a legacy environment, but how Kubernetes connects to and orchestrates storage through persistent volumes and persistent volume claims. Those concepts are specific to Kubernetes, so even an experienced storage specialist would have to re-learn this to successfully manage storage issues on Kubernetes.

Networking in Kubernetes is also different, and also something that a Kubernetes administrator would need to understand. Administrators need to understand how DNS works within the Kubernetes cluster as well as how to connect the cluster with the central networking using CNI. It’s also important to understand how network policies work, what their ramifications are for security as well as resiliency and what types of policies the organization should enforce.

Security for Kubernetes and containers is very different from security in legacy environments. The security focus can’t be on maintaining a secure perimeter around the application, but rather focuses on ensuring container images are free of vulnerabilities, ensuring configurations are as secure as possible and preventing applications from running with root privileges.

The ability to build and operate clusters effectively depends on teams being able to understand not just storage, networking and security in general, but how they related to Kubernetes specifically. That requires a huge amount of expertise that most individuals and organizations lack.

Exploding complexity

In addition to forcing engineers to develop expertise in a wider set of domains, microservices, containers and Kubernetes also dramatically increase the complexity of the system. Not only do engineers have to become familiar with networking, storage, and security, but they have to handle them for ephemeral containers that are continually spinning up and spinning down. They have to manage monitoring, logging, troubleshooting, and updates for hundreds or thousands of these containers, often on multiple cloud environments as well as on-premises.

Many companies think that because their proof of concept was successful, they’ve figured out how to run Kubernetes in production. Organizations often underestimate the complexity of Kubernetes and containers at scale, and underestimate the amount of both expertise and tooling needed to operate Kubernetes.

Closing the skills gap

Reducing the operational skills gap requires organizations to do two things.

Centralize expertise

Organizations can build small, central teams of Kubernetes experts who are responsible for configuring and operating Kubernetes as well as supporting developers who need assistance, acting as both platform engineers and internal consultants. This reduces the number of people who need to become Kubernetes experts but still gives the organization access to Kubernetes expertise.

Centralize security and infrastructure management

Creating a small team of experts only works if those Kubernetes admins are able to control the entire organization’s Kubernetes infrastructure, ideally through a single platform. This lets the central team handle everything related to Kubernetes for the organization while application development and operations can be safely de-centralized to engineers with less Kubernetes expertise. The central platform allows the Kubernetes team to create and enforce governance policies, so that developers don’t need to know the details of how Kubernetes should be configured.

A central and open platform like Nirmata helps central teams automate as much as possible, enforce guardrails on the rest of the engineering organization and overcome the skills gap when it comes to Day 2 operations. To learn more, check out our features video for an overview.

Hyderabad, India

What do you need?

Exploding complexity

Closing the skills gap

Centralize expertise

Centralize security and infrastructure management