AI workloads are increasingly running on Kubernetes in production, but for many teams, the path from a working model to a reliable system remains unclear. The cloud native ecosystem – its projects, patterns, and community – offers a growing set of building blocks that help teams bridge these two worlds.
From model to systems
AI Engineering is the discipline of building reliable, production-grade systems that use AI models as components. It goes beyond model training and prompt design into the operational challenges that teams running inference at scale will recognize: serving models with low latency and high availability, efficiently scheduling GPU and accelerator resources, observing token throughput and cost alongside traditional infrastructure metrics, managing model versions and rollouts safely, and enforcing governance and access policies across multi-tenant environments.
These are infrastructure problems, and they map closely to capabilities the cloud native ecosystem has been developing for years.
The cloud native stack for (Gen) AI
If you’re a platform engineer or SRE being asked to support AI workloads, the good news is that much of what you need already exists in the CNCF landscape.
Orchestration and scheduling: Kubernetes is the orchestration layer for AI inference and training. The 2025 CNCF Annual Survey found that 82% of container users run Kubernetes in production, and the platform has evolved well beyond stateless web services. A key development is Dynamic Resource Allocation (DRA), which reached GA in Kubernetes 1.34. DRA replaces the limitations of device plugins with fine-grained, topology-aware GPU scheduling using CEL-based filtering and declarative ResourceClaims. For teams managing GPU clusters, DRA is a significant step forward.
Inference routing and load balancing: The Gateway API Inference Extension (Inference Gateway), which has reached GA, provides Kubernetes-native APIs for routing inference traffic based on model names, LoRA adapters, and endpoint health. This enables platform teams to serve multiple GenAI workloads on shared model server pools for higher utilization and fewer required accelerators. Building on this work, the newly formed WG AI Gateway is developing standards for AI-specific networking capabilities: token-based rate limiting, semantic routing, payload processing for prompt filtering, and integration patterns for retrieval-augmented generation.
Observability: OpenTelemetry and Prometheus remain essential. AI workloads introduce new metrics: tokens per second, time to first token, queue depth, cache hit rates. All of them need to live alongside traditional infrastructure telemetry. The inference-perf benchmarking tool, part of some metrics standardization effort, reports key LLM performance metrics and integrates with Prometheus to provide a consistent measurement framework across model servers.
ML workflows: Kubeflow has grown into a top-30 CNCF project with hundreds of active contributors, providing the pipeline orchestration, experiment tracking, and model serving components that ML teams need. Kueue handles job queuing and fair scheduling for batch and training workloads.
Policy and security. Open Policy Agent (OPA) and SPIFFE/SPIRE provide the governance primitives that production AI deployments require, from controlling which teams can access which models to establishing workload identity across inference services.
GitOps and deployment. Argo and Flux bring the same declarative, version-controlled deployment patterns to model serving that platform teams already use for application delivery. Safe rollouts matter even more when a bad model version can produce incorrect or harmful outputs.
Bridging the gap
There’s a real gap between AI practitioners and cloud native practitioners today. Despite their infrastructure-heavy workloads, only 41% of professional AI developers currently identify as cloud native, according to the CNCF and SlashData State of Cloud Native Development report. Many come from data science backgrounds where managed notebook environments abstracted away operational concerns. Meanwhile, cloud native practitioners sometimes view AI workloads as architecturally foreign with stateful, GPU-hungry, and different from the services Kubernetes was originally designed for.
Both perspectives contain some truth, and both communities benefit from closing this gap.
If you’re an AI engineer moving to Kubernetes, start with the inference serving stack. Deploy a model server (vLLM or similar) behind the Inference Gateway, use DRA to manage your GPU resources declaratively, and instrument with OpenTelemetry from the start. The patterns will feel familiar if you’ve worked with any request-response service at scale.
If you’re a platform engineer supporting AI teams, understand the new workload patterns. Inference services need autoscaling based on token throughput, not just CPU. Training jobs are long-running and may span multiple nodes with specialized interconnects. Model artifacts are large and benefit from caching strategies. The CNCF Platform Engineering Maturity Model provides a useful framework for building self-service golden paths that include AI capabilities.
I know, it is easier said than done. And I’m not saying that is just a simple shift from A to B. AI, in which way every, feels the same to what we know so far, but it is on its own also different. With a different complexity, scale and behaviour. As an engineer, I find this exciting. Something new to the ever-same application demands.
Why open source matters here
AI systems are becoming critical infrastructure. The 2025 CNCF Annual Survey found that the top challenge organizations face in cloud native adoption is now cultural, not technical. Team dynamics and leadership alignment outpace tool complexity and training as blockers. This signals maturity: the technology works, and the hard problems are organizational.
For AI infrastructure specifically, open source and vendor-neutral governance provide three things that proprietary stacks cannot easily replicate:
Composability. No single project solves the full AI production stack. It requires a container runtime, a scheduler, a policy engine, an observability pipeline, a workflow orchestrator, an inference gateway, and model serving frameworks, all composed together. The CNCF landscape enables this composition through interoperability and shared standards.
Portability. Organizations are running AI workloads across hyperscalers, GPU-focused cloud providers, and on-premises infrastructure. Kubernetes and the cloud native ecosystem provide the abstraction layer that prevents lock-in to any single provider’s AI stack.
Community-driven evolution. The speed at which the Kubernetes community has responded to AI workload requirements demonstrates how open governance enables rapid adaptation. DRA, different AI-focused work groups, the Inference Gateway, the AI Gateway, the AI Conformance Program… These are not top-down product decisions; they’re community responses to real practitioner needs, shaped through KEPs, working groups, and public design discussions.
Getting started
If you want to explore the intersection of cloud native and AI Engineering, here are some entry points:
- Try the Inference Gateway: The getting started guide walks you through deploying an inference-aware load balancer on your cluster.
- Explore DRA for GPU scheduling: The Kubernetes documentation on DRA covers the concepts and API objects. Start by understanding ResourceClaims and DeviceClasses.
- Join the community: There are multiple active work groups developing proposals for new AI standards. The #wg-ai-gateway channel on Kubernetes Slack is a good starting point.
- Attend KubeCon + CloudNativeCon: The AI tracks and co-located events provide hands-on exposure to these patterns and a chance to connect with practitioners solving the same problems.
Looking ahead
The cloud native ecosystem didn’t emerge specifically for AI, but it is increasingly well-suited to supporting AI systems in production. The model may drive innovation, but the platform determines how reliably that innovation reaches users.
Much of the important work ahead sits at the intersection of these communities. Whether you’re building models or platforms, there is an opportunity to shape how AI systems are operated in practice, through open collaboration, shared standards, and real-world experience.