CNCF On-Demand: Cloud Native Inference at Scale - Unlocking LLM Deployments with KServe

The demand for scalable and cost-efficient inference of large language models (LLMs) is outpacing the capabilities of traditional serving stacks. Unlike conventional ML workloads, LLM inference brings unique challenges: long prompts, token-by-token generation, bursty traffic patterns, and the need for consistently high GPU utilization. These factors make request routing, deterministic scheduling, and autoscaling significantly more complex than typical model serving scenarios. In this presentation, we will discuss Kserve, an open source project for LLM Inference at scale.

Scalable model serving on Kubernetes using KServe.

Seamless integration with Kubernetes, enabling reproducible, resilient, and cost- efficient deployments.

Deterministic scheduling and token-aware request handling through the Kubernetes inference scheduler (using Gateway Inference Extension) and execution strategies.

Distributed and Disaggregated Inferencing with LLM Inference Service for advance serving.

Amsterdam, Netherlands

CNCF On-Demand: Cloud Native Inference at Scale – Unlocking LLM Deployments with KServe