An application, composed of one or more containers as dictated by system architecture, that operates either independently or as part of a distributed collaboration—interacting with at least one other entity (container) or achieving quorum-based consensus. It leverages AI or Machine Learning capabilities to reason and execute actions within event-driven systems, where behavior is triggered or modulated by signals. Its defining attributes: encompass differing levels of autonomy in executing system or user tasks, coupled with the ability to plan, orchestrate, and govern the continuation or completion of its own execution. In cloud-native environments, these components are commonly packaged and deployed as containerized microservices.

Overview

Within the cloud native ecosystem, there has been an explosion in agentic AI. Rapid prototyping and adoption suggest potential for businesses to accelerate time to value for products and services for organizations across all sectors and technology verticals. While this interest is very promising for this burgeoning field, challenges exist in terms of standardization and interoperability, which are currently lacking.

Agentic systems provide the means to perform multi-hop reasoning, and subsequent action calling based upon signals to augment and provide dynamism to conventional programming sequences.

This paper explores four key areas where standardization is needed to ensure interoperability, security, and observability from the outset. The focus of this document is not on how specific agentic protocols are implemented, which programming languages are used, or their execution efficiency. Instead, it provides an agnostic view of best practices that enable deployments in this space to scale securely while remaining observable and explainable through a common foundational framework.

The recommendations described are exclusively focused on cloud native environments built on Kubernetes. This extends to scenarios where Kubernetes may be deployed in public, private, hybrid or edge compute type scenarios, as there are nuances in the domain of security associated with these environments and systems.

This document provides a foundational checklist for agentic standards, but is not intended to be exhaustive and will continue to evolve as practices and the tools improve.

General

This section outlines foundational container and observability best practices for cloud native workloads, including agentic AI systems. Evolving challenges include the rapid advancement of agent environments and capabilities, which require governance frameworks to adapt continuously. Future research and standardization efforts should focus on nuanced reward functions, layered reasoning architectures with built-in controls, and robust safety and alignment techniques to manage increasingly capable and autonomous systems.

General best practices for containers:

The containerization principles outlined below include emerging definitions of agentic services (e.g., autonomous, signal-driven, reasoning-capable container systems deployed within microservice-oriented architectures). The recommendations are not specific to agentic use cases and apply to any containerized or serverless environment.

General best practices include: Security, which covers minimizing attack surface and safeguarding container integrity; Observability, which focuses on collecting actionable metrics, logs, and traces to understand system behavior; and Availability and Fault Tolerance, which outlines strategies for maintaining service continuity and resilience under failure conditions.

Security

Observability

Availability and fault tolerance (general)

NOTE: The above items are general in nature, and while applicable to smart load-balancing to inference models, does not pertain to more comprehensive MCP / Agent to Agent or LLM tooling.

Availability and fault tolerance (inference specific)

Sample request flow with Kubernetes Gateway API Inference Extensions InferencePool endpoints running a model server framework

Sample request flow with Kubernetes Gateway API Inference Extensions InferencePool endpoints running a model server framework

Source: https://gateway-api-inference-extension.sigs.k8s.io/

PLEASE NOTE: The General Section is not an exhaustive list of every best practice, rather it is included as a primer on a number of adjacent foundation topics relevant to the main body of this document that is focused towards agents. Links to more exhaustive overviews of such topics and practices can be found in the footnote section. Additional current literature, white papers, and documentation from the CNCF and Linux Foundation should also be considered to ensure the best decisions are made around this constantly moving technology space.

Footnotes:

Control and communication

Microservice architectures have long followed the principles and practices of information hiding, minimal endpoint exposure for only the requisite and needed services, and clear contract-based communications. These same principles should be followed in the context of agentic architectures. As multi-agent systems grow, the intricacy of ensuring effective coordination and communication rises significantly. While Kubernetes provides the platform, the complexities of inter-agent communication protocols (like MCP, A2A) and managing “tool sprawl issues”, require specific attention within the agent application layer itself. Furthermore, pre-empting the unpredictable nature of agent behaviour, requires increased operational rigor, and focus around secure communications.

Communication related attributes

The list below provides a non-exhaustive overview of communication-related attributes that should be considered when creating Agent-to-X deployments (where “X” may refer to tools, services, models, or other agents).

It should be noted, that for this section, the change revision is very high; this is largely due to numerous protocol specifications that are yet to be formally adopted and standardized in the field.

Orchestration flow, safety, and fault tolerance

Tools and services

Agent connectivity to AI Models

Agents to other Agents

Filtering and input/output schema validation

Protocols today (MCP,  A2A, etc…)

Authentication and Authorization Focused Protocols

To address the problem space of workload and task based security a number of further mechanisms are needed to ensure that needed trust boundaries can be achieved. In order to achieve these outcomes, there are several options which exist in the cloud native space today, focused towards both workload security and task based authorization.

Message and communication design considerations with REST, GRPC & Kafka

Discovery/agent registries

Footnotes:

Observability

Observability in the context of agentic microservices manifests in a number of ways. These consist of general container health (as described in the earlier general section) to ensure that CPU, Memory, and GPU resources are ample to perform the requisite functions of the service.

Observability metrics for agentic services extends beyond basic container health metrics architectures in a number of ways. Metrics can be used as a means to identify the precision of requests handled, the time taken to complete a particular task, dwell time in a multi-agent architecture per function, and even as a comparative value to assess if a given tool exposure is more efficient from Service A or Service B.

Metrics

The use of observability traces allows for an end-to-end waterfall view of communications which take place between microservices including agents. In the context of agentic microservice deployments, deploying the right levels of traceability, can support a clear and concise view of the communication flows between agents, databases, and other ancillary components which build up the end-to-end application flow. Traceability is becoming an important consideration for agentic architectures as a means to support requisite regional explainability mandates (EU AI Data Act, etc…).

Traces and spans

Logs

Continuous monitoring and adaptive control

Footnotes:

Governance

This section defines the critical governance mechanisms necessary to ensure the responsible, reliable, and secure operation of LLM-based multi-agent systems within a Kubernetes ecosystem. Effective governance spans the entire lifecycle of an agent, from initial development and pre-deployment validation to continuous monitoring and adaptation in production.

Agentic governance foundations

Critical steps and methodologies required before an agent is deployed to ensure its fitness for purpose, including adherence to defined policies and robustness against known failure modes, should be considered for production deployments.

Evaluation factors should be considered beyond the siloed metrics of task completion accuracy or job completion time, to consider multifaceted attributes as part of a comprehensive assessment.

Evaluation approach

The instrumentation of testing structures which allow for the correct levels of evaluation is a key requirement in both benchmarking of what a viable target evaluation state should be, and where improvements or reductions in performance are evident. Setting clear structures and evaluation criteria, through the use of a flexible framework, ensures that the system is tailored to meet the needs of the target application and use case being evaluated.

Synthetic data generation for test execution

Granular and trajectory based assessment

Veracity of testing (precision of execution)

This sub-section addresses the ongoing post-deployment governance requirements for agents once they are deployed, ensuring continuous safety, performance, and compliance.

Data privacy and minimization

Explainability and auditability of agent decisions

Beyond traditional Kubernetes self-healing, design agentic applications to inherently handle failures without cascading impact. This involves implementing agent-level retry logic, circuit breakers, and graceful degradation strategies specific to agent communication and tool interactions, ensuring the overall system remains resilient even when individual agents or external dependencies experience transient failures.

Footnotes:

Security

This section defines the security considerations involved in building agentic systems. Three primary goals should guide the design: authentication, authorization, and trust. Agents and their components must be able to securely authenticate and should only be granted the minimum permissions necessary to function. Trust boundaries must be clearly established to prevent privilege escalation, data leakage, or unauthorized behavior within the system.

Keeping these goals in mind user access, agent identity, tenancy, and data access must be deliberately designed and enforced. Each agent should have a unique, verifiable identity to support traceability, accountability, and secure communication across system boundaries. Strong tenancy isolation is critical, especially in multi-tenant environments, to prevent cross-agent interference and ensure that agents operate within their own scoped contexts. Finally, access to data must be controlled by explicit policies that define which data an agent can access, under what conditions, and for how long.

Agent identity

Identity management for AI agents must go beyond simply extending the user’s identity. In a zero-trust architecture, both user identity and agent (workload) identity must be authenticated, authorized, and isolated by clear trust boundaries.

When building agents, it’s critical to evaluate whether user identity propagation is sufficient for use cases such as short-lived, user-initiated tasks or whether the agent needs a dedicated identity. Agents that act autonomously, operate outside the user’s permission scope, or persist beyond the user session require a distinct identity to ensure secure, auditable, and least-privilege access to data and tools.

When to use user identity alone?

Source: Diagram created by the authors using Excalidraw.

Source: Diagram created by the authors using Excalidraw.

If an agent’s existence and capabilities are strictly tied to the user being actively logged in or connected, the user identity alone is sufficient for the agent identity. Once the user logs out or their session expires, the agent should also cease functioning or lose access. In the scenario, the agent’s identity and permissions mirror exactly those of the user.

When is agent identity required?

Source: Diagram created by the authors using Excalidraw.

Source: Diagram created by the authors using Excalidraw.

Use agent-specific identities when the agent performs actions beyond the user’s permissions (for example, accessing cross-department data or sensitive information). This is also needed if the agent can make autonomous decisions (initiating workflows, API calls, placing orders, etc.) or if it can interact with other agents and trigger downstream processes.

By clearly distinguishing between user and agent identities and enforcing authentication, authorization, and trust boundaries, systems can minimize the risk of overprivileged agents, prevent lateral movement, and better support auditing and policy enforcement.

The following practices for agent identity should followed:

Agent tenancy

Agent tenancy spans service-to-service exposure, access to hardware resources (e.g., GPUs), permission scopes, and agent-to-agent interaction. To maintain secure, predictable, and fair behavior in multi-agent environments, tenancy controls must be enforced at both identity and execution layers.

Permissions tied to agent identity should enforce the principle of least privilege and use mechanisms such as Just-in-Time (JIT) to request access only when needed and Attribute-Based Access Control (ABAC) and Policy-Based Access Control (PBAC) to define flexible and secure access policies. As agents introduce probabilistic behavior, adopting these controls based on the agent identity is essential to maintaining trust, traceability, and security at scale.

The following practices for agent tenancy should be adhered to:

Agent data access

Agents often interact with diverse data stores, including those shared across multiple agents or tenants. This requires careful design to enforce strong authentication, fine-grained authorization, and clear trust boundaries to prevent data leakage, tampering, and privilege escalation. Proper access control must be implemented to uphold least-privilege principles, especially when agents operate autonomously. This section outlines key security concerns related to agent data access, emphasizing unique threats such as prompt injection, tool hijacking, and runtime memory vulnerabilities, and provides recommendations to mitigate these risks.

Source: Diagram created by the authors using Excalidraw.

Source: Diagram created by the authors using Excalidraw.

Footnotes/Links:

Contributing

This document is subject to a high-change revision cycle due to the rapid pace of evolution in the agentic AI space. Community contributions are welcome and encouraged.

For details on how to propose changes, draft updates, request reviews, and follow the versioning and governance process, please see the Contributing Guide.