How to security harden Kubernetes in 2022

Posted on June 7, 2022 by Elastisys team

CNCF projects highlighted in this post

Guest post originally published on the Elastisys blog by the Elastisys team

The NSA/CISA guidelines summarized, with Elastisys hands-on advice and real-world recommendations.

Kubernetes is now the most popular container orchestration platform. Practically gone are the Mesoses and Docker Swarms of the world, and honestly, I’m not going to miss them. But the downside to its dominant market position is that Kubernetes is also heavily targeted by bad actors that want to compromise its security.

Some sad facts:

Container security is in an abysmal state, with 56% of developers currently not even scanning their containers!
Gartner projects that more than 70% of companies will be running containerized applications by 2023.
Misconfigurations make up 59% of all Kubernetes security incidents, according to a 2022 Red Hat report.

We, as a community, need to do something about this!

The 2022 NSA/CISA Kubernetes Hardening Guide Summarized

The technical report “Kubernetes Hardening Guide” originally published on August 3, 2021 and then updated on March 15, 2022 by the NSA and CISA is here to help! It is a very nice document for organizations that rely on Kubernetes as a container platform. It provides both detailed information and hands-on examples of how to secure the platform. But, the problem I see with it is that the gap between the executive summary and the highly detailed dense information in the 66 pages is significant.

To address this, I will here (a) summarize the tech report’s main takeaway messages, and (b) provide additional insights, based on my personal experience in cloud security. I’ve been working in cloud computing since 2008 in both industry and academia, and have seen the entire evolution of this field. It is my intention that this article is an informative read for decision makers, and that it provides actionable recommendations to be implemented by a team of experts.

Scan containers and Pods for vulnerabilities or misconfigurations

// difficulty: 1
// impact: 4

Why we love containers is that the images are an immutable package of a piece of software and all its dependencies. Immutability is an asset in the sense that the very same container can be subjected to quality assurance processes and get promoted from development to production without any change at all. But it is also a liability, because container images are software time capsules: they do not automatically get updates as new vulnerabilities are discovered.

Scanning container images for known vulnerabilities is a security best practice (although only done by only 44% of developers). But most just scan images when they are initially pushed to the registry. And this creates a problem. Because the more stable the application, the less frequently it gets updated, and thus pushed to the registry.

Ironically, stability makes container images more likely to become vulnerable between updates.

As a mitigation, the NSA/CISA report recommends using a Kubernetes Admission Controller, which will request a scan upon Pod deployment. But when you think about it, this suffers from the same problem: if an infrequently updated application is deployed for a long time, this additional deploy-time check will not suitably protect the long-running application.

That is why I strongly recommend that a process to regularly determine which containers are deployed to your cluster is put in place, and that those images are scanned regularly. Just schedule to loop over the Pods once a day and have the registry scan the container images in them. This way, your scan results are up to date and accurate.

Run containers and Pods with the least privileges possible

// difficulty: 3
// impact: 4

Kubernetes and container runtimes have been very lax with their default security posture since day one. And in a world where 2/3 of insider threats are caused by negligence, letting software or users have too broad permissions is by definition negligent!

The default user in containers is the system administrator “root” user. You have to manually opt out of that. Kubernetes imposes little to no additional restrictions either on what the containerized application can do. Thus, if a cyber attack against an application in a Kubernetes container platform is successful, the set of privileges and permissions that the actor is granted is very broad.

To mitigate the risks, policies should be put in place to ensure and enforce that:

containers do not run as the “root” user (and, if possible, that the container runtime itself also does not – the default one does), so as to limit the permissions of the application, and hence bad actor in case of hacking;
container file systems are immutable, to prevent a bad actor from erasing tracks of their attack;
the most restrictive Pod Security Policy (Kubernetes up to v1.21) or Pod Security Standard (Kubernetes v1.22+) is in use to, e.g., run as a non-root user and disallow privilege escalation to essentially become the root user and access to the container host OS; and that
default Service Account tokens are not needlessly made accessible to Pods, because they might give far more access to the Kubernetes cluster’s API than you intended. Your application probably never even needs this, so why is it there by default?

I think it’s a no-brainer that these baseline policies should always be in place. But you probably have more policies, too. Ones that are particular to your organization. And for this purpose, I whole-heartedly recommend the use of configuration available in Kubernetes in addition to Open Policy Agent (OPA). With configuration inspired by the official library or third-party policies, it can enforce all matters of policies, upon each request.

Use network separation to control the amount of damage a compromise can cause

// difficulty: 1
// impact: 5

The default networking settings in Kubernetes allow Pods to freely connect to each other, regardless of the namespace they are deployed in. This free-for-all approach to networking means that a bad actor only needs to get into a single Pod to have unfettered access to the others. So the entire platform is only as secure as the least secure component, and all a bad actor has to do is get in via your least secure component. And then, the rest is history.

Kubernetes Network Policies impose configurable limitations to networking. How these are implemented differ depending on the Container Networking Interface (CNI) provider used, but essentially turn into Kubernetes resource aware firewall rules. This makes it easy to specify that only “the backend component” gets to call “the database”, and nothing else. So then, a weakness in your API gateway doesn’t mean that an attack can easily be launched against any component in your platform anymore.

Use firewalls to limit unneeded network connectivity and encryption to protect confidentiality

// difficulty: 1
// impact: 3

Kubernetes container platforms consist of a control plane and a set of worker nodes. The control plane nodes host components that control the entire cluster. So a bad actor that manages to control the control plane can therefore make arbitrary follow-up attacks and command the cluster fully to do their bidding.

A network perimeter defence via firewalls can help mitigate against this type of attack from (external) malicious threat actors. No component of the control plane (Kubernetes API, etcd, controller managers, …) should be more exposed than absolutely necessary to meet the organization’s needs.

Also note that network traffic within Kubernetes clusters is typically not encrypted. This means that sensitive information could be picked up and exploited by software that a bad actor has managed to place inside the container platform. To prevent this class of attacks, all traffic in the cluster can be encrypted. This is a rather trivial and fully transparent change if the cluster uses a CNI provider that provides encryption as a configuration option. For instance, Calico can do this by leveraging WireGuard. I definitely recommend doing that if you cannot trust the underlying network sufficiently for the information security demands you have.

Use strong authentication and authorization to limit user and administrator access as well as to limit the attack surface

// difficulty: 2
// impact: 4

The Kubernetes container platform has role-based access control features in its API server. However, for some reason, these must be explicitly activated. Further, typical Kubernetes installations provide a never-expiring system administrator “token” to whoever installed the cluster. Use of this token gives full and perpetual access to the cluster. Guess how I feel about that? 🤯

Although not enabled by default, Kubernetes supports authentication via various methods. There are various ones, but I strongly recommend that OpenID Connect tokens are used. You can integrate with many identity provider services, and most support emitting such tokens. They can also contain information about which group a user is in, and therefore, make it possible to set role-based access control rules on a group level. For the ones that don’t, Keycloak or Dex IdP can probably integrate with them.

And in what can hopefully (and generously) be seen as a misguided attempt to be easy to use, Kubernetes also supports anonymous requests by default. This should of course without question be turned off.

Role-based access control should be both enabled and configured to adhere to the principle of least privilege. As in, only the smallest set of privileges should be granted to both software and users, and any additional privilege requests should be reviewed upon request.

As you can tell, I truly recommend that (a) the administrator token is disabled, (b) OpenID Connect is enabled, (b) anonymous access is disabled, and (d) that role-based access control is enabled. And, (e) that you actually restrict permissions as much as possible.

Use log auditing so that administrators can monitor activity and be alerted to potential malicious activity

// difficulty: 3
// impact: 3

The Kubernetes control plane has audit logging capabilities built in. But, again (notice the theme here?), they must be explicitly enabled via configuration. Like the Kubernetes Hardening Guidance technical report, I of course also recommend enabling these, so operators can gain insight into what is happening in their cluster.

However, merely enabling a stream of very frequent logs (all automated requests against the Kubernetes API also leave an audit trail) merely provides the haystack. Finding the needles in the haystack requires actually parsing and using these logs. This can either be done via filtering expressions in your log storage solution (e.g. Opensearch, Splunk, or DataDog) or via automated and policy-aware parsing by an automated system, such as the CNCF project Falco. It can, together with a log handling service, act as an automated Security Incident and Event Management (SIEM) system. Please do that.

Periodically review all Kubernetes settings and use vulnerability scans to help ensure risks are appropriately accounted for and security patches are applied

// difficulty: 2
// impact: 4

Kubernetes releases new versions of the container platform about three times per year. Security updates are only provided for the current version and the two before it. So, to stay up to date with security, operators must install a new version at least once per year. Preferably, they would follow every new version, as what I would graciously call the rather naive security features of the past are being gradually improved upon.

As I hope I’ve made clear by now: Kubernetes is not secure by default. The amount of disabled security features shows that security is intentionally not a consideration by default. New security threats are constantly being developed. Therefore, the use of automated vulnerability scanning of the entire platform, including the control plane and the worker nodes themselves, is highly recommended. Both by the report and by me.

There’s a catch, though. Does automated testing catch everything? No. Not by a long shot. But it does catch some of the more glaring errors, which if found by bad actors, indicate that the platform is likely poorly configured in other ways, too. In my experience, not even taking care of the low-hanging fruit serves as a huge “welcome” sign.

Are automated vulnerability tools sufficient?

Many tools promise automated vulnerability scanning, both of container images, and of the configuration of the Kubernetes cluster or the resources managed within it. These provide an appealing offering, in that they will highlight misconfigurations. But they are limited in scope and functionality. They do not (and technically can not) cover everything that the NSA/CISA Kubernetes Hardening Guidance recommends.

ARMO, a security company, has released Kubescape. It claims to be the first tool for verifying a cluster against the best practices from the NSA/CISA tech report. And indeed, at the time of writing, it does contain a nice set of automated tests for parts of it.

It uses the Kubernetes API and now, as of 2022, also a host-level inspection feature to perform its checks. This gives it access to a considerable amount of configuration that it can inspect. However, limitations exist: it cannot in the general case verify, e.g., whether container image vulnerability scanning policies in the container registry are in force, whether audit logs are automatically vetted, firewalls are in place between nodes, or if the strictest privilege limits for your organization are in place on the virtual machine or cloud level. All of these require more of an understanding of organiational policies and processes than it would be reasonable to expect a tool to have.

Recent updates in 2022 have extended the scope of Kubescape considerably since its inception, e.g. with its own image scanning features, RBAC visualizer and investigator, assisted remediation, and much more. It also integrates with Google cloud, so it can glean information from there. Given that ARMO successfully raised $30 million in Series A funding in April 2022, it seems very likely that the breadth and depth of what Kubescape can do will only increase significantly.

(Huge thank you to the team at ARMO for helping me in getting a more up to date understanding of the current state of Kubescape!)

Aqua Security have similarly released kube-bench. It can check how the control plane is configured, by inspecting the running processes on a control plane host. Unfortunately, it is unable to check the security features that are not part of the Kubernetes configuration.

Therefore the answer is “no”. One cannot merely run automated checks and claim to have perfect (or even a good) security posture. Actual understanding of security policies, and a broader view than merely the cluster itself, is also needed.

Going Beyond the NSA/CISA Kubernetes Hardening Guide

Prevent Misconfiguration, Don’t Just Check for It

// difficulty: 2
// impact: 4

Role-based access control (RBAC) can determine who gets to do what and in what context. But just because a rule says that Lars gets to “update configuration” in the “production environment”, doesn’t mean he can have unfettered access—Lars must also be prevented from making mistakes. After all, two-thirds of all insider threats are due to negligence.

Kubernetes, like most cloud systems, only ships with a system to enforce RBAC, but nothing to enforce reasonable policies that limit what the user can actually do.

Checking after the fact for misconfiguration is a capability offered by some systems; AWS Config is seeing some traction now for this purpose.

But I’d much rather have a system that prevents misconfiguration altogether. Policies should be encoded in an automatically enforceable form. The CNCF project Open Policy Agent (OPA) can do just that. It can act as a Kubernetes Admission Controller and can therefore ensure that policies cannot be violated. OPA is very versatile; you can learn from the official library or pick and choose from other ready-made policies and base your own off of them.

Beware: Any Permission Given to an Application is Also Given to Bad Actors

// difficulty: 2
// impact: 4

If a bad actor manages to compromise your application, they will have the exact same permissions as the application. Perhaps this seems obvious. But my experience tells me that this is not taken seriously in practice. The bad actor will have all the capabilities within your Kubernetes container platform, within your network, within your cloud, within your third-party SaaS integrations, within your VPN-connected back-office location—all of the capabilities.

This is how bad stuff like ransomware manages to spread, after all. They infect just one point within a networked application and then keep going. The chain really is only as strong as its weakest link.

If we truly think this through, it seems ridiculous that your REST API component should have any permissions at all except to process requests and send back responses. And you know what? That’s because it definitely is and always was.

Keep Cloud Resources in Mind, too

// difficulty: 1
// impact: 3

We have all seen the headlines. Whether it’s S3 buckets that have been misconfigured to allow anonymous access or a master key for Microsoft Azure CosmosDB making it possible to access any customer’s database, the message is clear. Whenever we use cloud resources, we must always keep them and their configuration in mind.

There are various controllers in the Kubernetes ecosystem that make cloud integrations simple, and simplicity is great! But simplicity must never be allowed to compromise security. So you have to make sure that these controllers don’t put näive security settings on the resources they manage. You should flat-out reject tools that don’t clearly advertise how they manage security. Non-starters include tools that do not specify what IAM permissions they need and those that do not expose a way to configure which permissions they will put in place.

Does your Application Unintentionally have Permissions in your Cloud?

// difficulty: 2
// impact: 3

Have you given the cloud servers access and permissions to modify your cloud resources? What AWS calls instance profiles (other cloud providers have the same concept under different names) grant permissions to a virtual machine to modify cloud resources. By virtue of running inside that virtual machine, a containerized application can have that, too. All it needs to do is a series of network calls to the cloud’s metadata service and it has the same level of access as you gave the server. Because it is running on the server, the cloud sees it as “the server.”

What I’ve seen over and over is that people will add a few permissions here and there to the server’s instance profile to make whatever they wanted to do work. Create a load balancer, update some DNS records, modify an autoscaling group; that sort of thing. But they only did it to support their intended use case, not with the realization that all unintended use cases would have the same permissions.

Regularly Scan all Deployed Container Images

// difficulty: 3
// impact: 2

Many container image registries support scanning images when they are pushed. This is great! And the NSA/CISA Kubernetes Hardening Guidance recommends that an admission controller request a scan upon deployment.

But what if the image stays deployed for weeks or months because the software is stable? The initial scan weeks ago may have been clean, but a new one today would show vulnerabilities. Oops.

Instead, I am a firm proponent of regular scans of all container images that are actively deployed to your Kubernetes container platforms. Automate this check by determining all deployed container image versions and scan them daily.

This means you gain the benefits of up-to-date vulnerability databases against old container images of stable software that gets infrequent updates. If the scan finds an issue in dependencies, you can rebuild the image with fresher dependencies and deploy it with (hopefully) few or no other changes, since the code itself didn’t change.

Regularly Security Test your Entire System

// difficulty: 1
// impact: 4

Your software engineers have something external threats don’t: Access to source code. If you also arm them with the time and blessing to security test your system, magic can happen.

During a past project, I fondly remember discovering that creating a certain type of resource in an automated way brought the entire system to a screeching halt. A single laptop could successfully launch a denial-of-service attack on the entire system while doing nothing obviously malicious.

Even if your engineers may not be trained for security per se, the main idea is to instill a security-first mindset. And that can be the difference between smooth sailing and a security disaster.

Have a Disaster Recovery (DR) Plan and Practice It

// difficulty: 5
// impact: 5

The amount of companies I talk to that think disaster recovery only means “backups” is mind-blowing. Hint: It’s really not. Backups are necessary but not sufficient.

Recovering from a disaster means the ability to stand up your entire tech stack elsewhere within a certain time frame. And while disasters are usually thought to mean the outage of entire cloud regions, I think that a security incident definitely counts as a disaster! Since you can no longer trust your deployed applications, you need to answer the question, how quickly can you destroy your entire infrastructure and get back to where it was before an incident?

Companies that still think that DR just equals “backups” will, when asked the uncomfortable question, often admit that they don’t even regularly try to restore from those. If information technology is at the core of what you do, please take this aspect seriously.

Use an Intrusion Detection System (IDS) and a Security Information and Event Management (SIEM) System

// difficulty: 3
// impact: 3

The Kubernetes Hardening Guidance mentions these, but it doesn’t actually tell you what to do with them or how to use them.

IDS records and monitors the normal behavior for applications and constantly checks activity against these baselines. If an application starts to behave in new ways, it’s a possible sign that it has been exploited by a bad actor. For instance, if it starts to attempt reading or writing files when it usually doesn’t, that’s a pretty good sign—it’s not like it started doing that on its own!

The CNCF project Falco also is here to help you follow this guidance. Specifying rules is, of course, cumbersome, but it is essential to provide the guardrails your application needs. There are community-provided ones you can start off with.

Falco can, in combination with, for example, Elasticsearch, inspect your (audit) logs and, in that way act as a SIEM. I recommend using it in this way, too. If you already have a different system in place, then, by all means, use that. But use something. Because many regulations these days require that you inform users about data breaches, so you really want a system that helps manage security information and events. The amount of security log data is too great to process manually, especially if you are currently under attack.

Information security is not a one-time box to tick—it’s an ongoing process. The threats are constantly evolving, so the responses must, as well. By putting guardrails into our platforms and constantly striving for giving only the least amount of privileges to our applications and servers, we can reduce our attack surface. And that is especially important in the cloud.

The inherent complexities and dynamic nature of the cloud offer many places for a bad actor to both carry out attacks and to hide. It’s up to us to limit those opportunities.

Closing thoughts

Kubernetes is neither secure by default, nor by itself. You absolutely can, and must, harden its configuration. The so-called “managed Kubernetes services” offered by cloud providers puts very little, if any, of the advice contained in this article in place for you. And they certainly don’t add any of the other tools needed for security in depth. Nor are they interested in doing so, as per their “shared responsibility models” (here’s AWS’, without the intention of singling them out), in which they are responsible for security of the cloud, but you are responsible for security in the cloud.

Please keep this in mind as you embark on your cloud-native journey.