The great migration: Why every AI platform is converging on Kubernetes

Posted on March 5, 2026 by Vara Bonthu, Amazon Web Services Inc.

CNCF projects highlighted in this post

When Kubernetes launched a decade ago, its promise was clear: make deploying microservices as simple as running a container. Fast forward to 2026, and Kubernetes is no longer “just” for stateless web services. In the CNCF annual survey released in January 2026, 82% of container users report running Kubernetes in production, and 66% of organizations hosting generative AI models use Kubernetes for some or all inference workloads.

The conversation has fundamentally shifted from stateless web applications to distributed data processing, distributed training jobs, LLM inference, and autonomous AI agents. This isn’t just evolution, it’s platform convergence driven by a practical reality: running data processing, model training, inference, and agents on separate infrastructure multiplies operational complexity while Kubernetes provides a unified foundation for all of them.

Three eras, one platform

The Kubernetes journey mirrors how software has evolved.

Microservices era (2015–2020): hardened stateless services, rollout patterns, and multi-tenant platforms.
Data + GenAI era (2020–2024): brought distributed data processing and GPU-heavy training/inference into the mainstream.
Agentic era (2025+): shifts workloads from request/response APIs to long-running reasoning loops.

Each wave builds on the last, creating a single platform where data processing, training, inference, and agents coexist.

Foundation: Data processing at scale

Before models train, data must be prepared. Kubernetes is now the unified platform where data engineering and machine learning converge, handling both steady-state ETL and burst workloads scaling from hundreds to thousands of cores within minutes. According to the 2024 Data on Kubernetes community report, nearly half of organizations now run 50% or more of their data workloads on Kubernetes in production, with leading organizations surpassing 75%.

Apache Spark remains the gold standard for large-scale data processing. The Kubeflow Spark Operator enables declarative Spark management within Kubernetes. Organizations run Spark at massive scale: thousands of nodes, 100k+ cores on single clusters, spanning hundreds of clusters. Spark preprocesses petabytes of training data and triggers downstream training jobs, all orchestrated through native Kubernetes primitives.

Orchestration: Connecting the pipeline

With petabytes of prepared training data and models needing retraining on schedules, coordinating multi-step workflows becomes critical. A typical ML pipeline involves Spark preprocessing, distributed training across thousands of GPUs, model validation, and model deployment. Running these manually doesn’t scale.

Kubeflow Pipelines provides portable ML workflows with experiment tracking. Argo Workflows enables complex DAGs spanning Spark jobs, PyTorch training, and KServe deployments. The orchestration layer transforms ad-hoc scripts into production pipelines that trigger retraining when data drift is detected

Training: Gang scheduling and resource coordination

Once orchestration triggers a training job, distributed training’s fundamental challenge emerges: resource coordination. Request 120 GPUs but only 100 available? Those 100 sit idle, burning money and blocking work. This is the default state in shared clusters where multiple teams compete for GPUs.

Gang scheduling became table stakes. Projects like Volcano and Apache Yunikorn pioneered the pattern where multi-node training jobs only start when all requested resources are available.

Kueue is emerging as the community standard for batch workload management on Kubernetes. It brings quota management, fair-share scheduling, and multi-tenancy controls, solving the problem of multiple teams competing for limited GPU resources. JobSet complements Kueue by providing a native API for managing distributed Job groups with coordinated failure handling.

Serving: Inference at scale

After training completes, serving predictions to users requires a fundamentally different approach. Training is batch and GPU-saturated. Inference is online, latency-sensitive, cost-critical, and must handle unpredictable traffic.

vLLM and SGLang became standards for high-throughput LLM serving using PagedAttention and continuous batching for Inference workloads on Kubernetes.

KServe provides a standardized model serving layer with autoscaling, versioning, and traffic splitting. KServe integrates with Knative for scale-to-zero GPU workloads. For multi-host inference serving models with 400B+ parameters, LeaderWorkerSet treats pod groups as a single unit.

Agentic workloads: Building the agent operating system

With inference serving predictions at scale, the newest pattern emerges: autonomous agents. Unlike single predictions, agents make chains of LLM calls, maintain conversation state, access external tools, and run for minutes or hours. They’re long-running reasoning loops needing orchestration, state management, and security boundaries.

Can you build and orchestrate AI agents on Kubernetes? Absolutely. Frameworks like LangGraph provide stateful agent orchestration with durable execution. KEDA enables event-driven autoscaling, critical when 100 user requests need 100 agent pods, scaling to zero when idle. StatefulSets provide persistent volumes for agent state, while vector databases handle semantic memory.

Security requires defense in depth. Workload identity viaSPIFFE/Spire gives every agent a verifiable identity. Sandboxed execution using gVisor or Kata Containers helps isolate untrusted code paths. Policy enforcement with OPA or Kyverno defines runtime guardrails enforced at the pod admission layer.

Optimizing for the GPU economy

Across all these workloads, GPU availability and cost dominate. The bottleneck isn’t CPU or memory, it’s accessing GPUs when needed and maximizing utilization.

GPU sharing evolved. Multi-Instance GPU (MIG) partitions GPUs into isolated instances. Time-slicing interleaves execution. Multi-Process Service (MPS) enables concurrent kernels. Dynamic Resource Allocation (DRA) in Kubernetes moves beyond Device Plugins, allowing runtime GPU partitioning and reassignment.

At the infrastructure layer, Karpenter (kubernetes-sigs) provisions exact node types and aggressively deprovisions idle capacity to optimize costs. Container image acceleration using Seekable OCI (SOCI) reduce startup time for large images—especially relevant for model-serving containers.

Multi-cluster orchestration and AI conformance

As AI workloads scaled, even optimized single clusters hit limits. Teams now run hundreds of clusters for batch processing, distributed training, and inference. When a 100-GPU training job saturates capacity, inference queues back up and data processing stalls.

Multi-cluster scheduling became critical. Solutions like Armada (CNCF Sandbox) treat multiple clusters as a single resource pool with intelligent workload distribution, global queue management, and gang scheduling across boundaries.

As Kubernetes becomes the AI substrate, the ecosystem is also formalizing portability expectations. The CNCF community has launched work on Kubernetes “AI conformance,” aiming to define baseline capabilities for running AI workloads consistently across conformant clusters.

What’s next: Innovations driven by AI scale

AI scale is pushing innovation into areas few anticipated. Control plane scalability is being reimagined as standard etcd becomes a bottleneck at ultra-scale. Cloud providers are innovating beyond etcd with custom replication systems and in-memory storage. While upstream etcd v3.6.0 delivered 50% memory reduction, 100k+ node clusters require rethinking the control plane datastore.

Unified agent operators are emerging to simplify agent deployment with built-in scaling, security, and lifecycle management. Multi-cluster workload-aware scheduling is evolving to treat hundreds of clusters as a single intelligent resource fabric where workloads land based on GPU availability, network topology, and cost.

The path forward

Platform metrics are changing. Success is increasingly tokens-per-second-per-dollar, not pod density. Reliability includes detecting output drift and degraded model quality. Observability must trace reasoning loops, tool calls, and prompt/context paths.

The good news: much of this is being built in the open—across CNCF and Kubernetes SIG projects—turning Kubernetes into the platform where AI teams build end-to-end systems, not just deploy containers.

—-

Ready to explore these patterns? The community maintains hands-on workshops and reference architectures for Data and AI workloads on Kubernetes. Check out the Kubeflow documentation,CNCF landscape guides, and cloud-specific examples for your platform.