Originally published on Alcide Blog by Nitzan Niv

In the security world, one of the most established methods to identify that a system was compromised, abused or mis-configured is to collect logs of all the activity performed by the system’s users and automated services, and to analyze these logs.

Audit logs as a security best practice

In general, audit logs are used in two ways:

  1. Proactively identifying non-compliant behavior. Based on a configured set of rules, that should faithfully filter any violation to the organization’s policies, an investigator finds in the audit log entries that prove a non-compliant activity has taken place. With automated filters, a collection of such alerts is periodically reported to compliance investigators.
  2. Reactively investigating a specific operational or security problem. The known problem is traced back to the responsible party, root causes or contributing factors by a post-mortem investigation, following deduced associations from state to causing action and previous state.

Kubernetes audit logs

Let us examine how audit logs are configured and used in the Kubernetes world, what valuable information they contain, and how they can be utilized to enhance the security of the Kubernetes-based data center.

The Kubernetes audit log is intended to enable the cluster administrator to forensically recover the state of the server and the series of client interactions that resulted in the current state of the data in the Kubernetes API.

In technical terms, Kubernetes audit logs are detailed descriptions of each call made to the Kubernetes API-Server. This Kubernetes component exposes the Kubernetes API to the world. It is the central touch point that is accessed by all users, automation, and components in the Kubernetes cluster. The API server implements a RESTful API over HTTP and performs all the API operations.

Upon receiving a request, the API server processes it through several steps:

  1. Authentication: establishing the identity associated with the request (a.k.a. principal). There are several mechanisms for authentication.
  2. RBAC/Authorization: The API server determines whether the identity associated with the request can access the combination of the verb and the HTTP path in the request. If the identity of the request has the appropriate role, it is allowed to proceed.
  3. Admission Control: determines whether the request is well formed and potentially applies modifications to the request before it is processed.
  4. Validation: ensures that a specific resource included in a request is valid.
  5. Perform the requested operation. The types of supported operations include:
    • Create resource (e.g. pod, namespace, user-role)
    • Delete a resource or a collection of resources
    • List resources of a specific type (e.g. pods, namespaces), or get a detailed description of a specific resource
    • Open a long-running connection to the API server, and through it to a specific resource. Such a connection is then used to stream data between the user and the resource. For example, this enables the user to open a remote shell within a running pod, or to continuously view the logs of an application that runs inside a pod.
    • Monitors a cluster resource for changes.

A request and its processing steps may be stored in the Kubernetes audit log. The API server may be configured to store all or some of these requests, with varying degrees of details. This audit policy configuration may also specify where the audit logs are stored. Analysis tools may request to receive these logs through this hook.

Challenges of auditing a Kubernetes cluster

While the principles of audit logs collection and analysis naturally apply to the cloud, and specifically to data centers built on Kubernetes, in practice the scale, the dynamic nature and the implied context of such environments make analyzing audit logs difficult, time consuming and expensive.

The scale of the activities in the cluster means any analysis that relies solely on manual inspection of many thousands of daily log entries is impractical. A “quiet” cluster with no human-initiated actions and no major shifts in applicative activity, is still processing thousands of API calls per hour, generated by the internal Kubernetes mechanisms that ensure that the cluster is alive, its resources are utilized according to the specified deployments, and failures are automatically identified and recovered from. Even with log filtering tools, the auditor needs a lot of experience, intuition and time to be able to zoom in on a few interesting entries.

The dynamic nature of a system like a Kubernetes cluster means that workloads are being added, removed or modified at a fast pace. It’s not a matter of an auditor focusing on access to a few specific workloads containing a database – it’s a matter of identifying which workloads contain a sensitive database at each particular instant in the audited time period, which users and roles had legitimate reason to access each of these database-workloads at what times, and so on.

Furthermore, while finding some interesting results is just a matter of finding the specific entries in the log that are known in advance to correlate to undesirable activity, finding suspicious but previously unknown activity in the logs requires a different set of tools and skills, especially if this suspicious behavior can only be understood from a wider context over a prolonged period, and not just one or two related log entries. For example, it is rather simple to detect a user’s failure to authenticate himself to the system, as each login attempt shows up as a single log entry. However, identifying a potential theft of user credentials may only be detected if the auditor connects seemingly different entries into a whole pattern, for example access to the system using a specific user’s credentials from a previously unknown Internet address outside the organization, while the same user’s credentials are used concurrently to access the system from within the organization’s network.

Making log auditing a viable practice again

In order to make the audit of a large, complex Kubernetes cluster a viable practice, we need to adapt the auditor’s tools to this environment. Such tools would automatically and proactively identify anomalous and problematic behavior, specifically in the context of Kubernetes control security, in order to:

  1. Detect security-related abuse of Kubernetes cluster, especially behavior that can only be detected from observing extended context over multiple activities.
  2. Focus compliance investigations on Kubernetes misuses that are beyond detection by simple log filtering rules.

Of course, in order to achieve these goals, such a tool must be able to:

  1. Automatically analyze Kubernetes Audit logs, detecting anomalous behavior of users and automated service accounts and anomalous access to sensitive resources.
  2. Summarize the detected anomalies as well as important trends and statistics of the audit information for user-friendly understanding. At the end of the day, the auditor should have enough information that she can understand, qualify or ignore the results of the automatic analysis.

Let us describe some of the more complex threat scenarios that we would like the envisioned audit log analyzer to detect automatically:

Obviously, detection of such scenarios is way beyond a simple filtering of the audit log using predetermined rules. Sophisticated machine learning algorithms must be used to achieve this automatically, at scale and within a reasonable time after actual threatening activity.

The analyzer tool should collect features of the operational and security activity of the cluster as they are presented in the log and feed them to the machine learning algorithm for measurement, weighting and linking together. At the end of this process, possible patterns of suspicious behavior can be presented to the auditor for validation and further investigation.

Summary

Auditing of system logs is a well-established practice to identify security threats to the system, whether before these threats have a detrimental effect on after the threat is realized. In fact, some forms of periodic audit are mandated by laws and regulations.

Nevertheless, identifying suspicious patterns in audit logs of complex, large-scale, dynamic systems like modern day Kubernetes clusters is a formidable task. While using some sort of automation is mandatory for such and analysis, most existing audit-tools are just mindless filters, hardly assisting the auditor in the deeper challenges of her task.

In this article we presented a vision for an automated Kubernetes audit log analyzer that goes far beyond that. Using machine learning, such a tool can autonomously detect potential threatening pattern in the log that the auditor can focus on, even in real time. Additionally, summarizing the information in the audit log in a way that is user-digestible lets the auditor quickly validate the identified patterns as well as helps her investigate additional hidden suspicious activities.