2025 brought significant developments in the cloud native landscape, with a strong focus on AI but new projects and end user reports in many other areas. As always, KubeCon is one of the key places we see the progress coming from project reports, end users showcasing their stacks and use cases or vendors in the sponsor hall.

In this blog post, we start 2026 with a look back at the latest KubeCon+ CloudNativeCon North America in Atlanta, highlighting selected sessions from members of the CNCF End User Technical Advisory Board (TAB).

Alolita Sharma

So many interesting sessions make it hard to short-list my favorite talks 😀 Talks that made it to my list included:

Session Title: Benchmarking GenAI Foundation Model Inference Optimizations on Kubernetes

Minimizing inference costs for LLM deployments is a critical component of cost observability, cost management and infra capacity management for all organizations operating inference pipelines. Speakers Sachin Mathew Varghese of Capital One and Brendan Slabe of Google focused this talk on optimization techniques to measure and benchmarked inference performance in a standardized way and walked through a Kubernetes SIG project to benchmark GenAI foundation model inference.”

Session Title: Discover Cortex: High Scalability Metrics in 2025

“Cortex is a key technology in the CNCF observability project ecosystem which provides a highly scalable, multi-tenant storage solution for OpenTelemetry and Prometheus metrics observability. This maintainer talk by Friedrich Gonzalez, Charlie Le and Alolita Sharma from Apple and Anand Rajagopal from AWS zoomed in on key features in Release 1.19, what’s next on the project roadmap and community.”

Session Title: Project Lightning Talk: Perses: Update

“Perses is an exciting sandbox project for observability visualization to watch out for in the cloud-native observability landscape. Core maintainer Augustin Husson provided a progress update about new features and community growth in this session.”

Session Title: Retrofitting OTEL Collectors & Prometheus – How To Overcome Scale/Design Limitations

“Scaling observability data collection is key to handling trace span cardinality from Kubernetes clusters across multiple regions. Using multiple OpenTelemetry connectors and processors for each Collector and yet maintaining an efficient memory footprint is more art than science.This session, by Vijay Samuel and Sandeep Raveesh of EBay, explored some solutions to achieve this optimization leveraging long-term retention examples.”

Chad Beaudin

Session Title: Maintainers Summit

 “Yes, I know this is a bit of a cheat since it wasn’t a session.  However, this was my first time attending the Maintainers Summit and it was an exciting experience.  Hearing the conversations from the maintainer community on things they liked, and more importantly, the struggles they were dealing with was great.  Additionally, the famous Hallway Track was a lot of fun just getting to meet the various project maintainers. Lots of passion all around.”

Kenta Tada

KubeCon + CloudNativeCon North America 2025 presented a compelling lineup of sessions tackling the evolving challenges of networking, runtime customization, and long-term sustainability in large-scale cloud-native environments. As Kubernetes operations mature, themes such as IPv6 adoption, runtime extensibility, and Long-Term Support (LTS) are becoming pivotal for scalability, reliability, and operational resilience. The following three sessions stood out as particularly valuable for end‑user practitioners and platform engineers:

Session Title: TikTok’s IPv6 Journey To Cilium: Pitfalls and Lessons Learned

“Migrating to IPv6‑only environments is a milestone for hyperscale operators, and TikTok’s move to Cilium provides a rare, first-hand look at this transformation. This session detailed the technical journey of deploying Cilium on IPv6‑only Kubernetes clusters, highlighting pitfalls and engineering workarounds that enabled production readiness. Attendees gained insights into debugging Cilium network policies, handling IPv6‑specific DNS and NDP traffic behaviors, and overcoming kernel‑level challenges such as NodePort timeouts — essential knowledge for teams modernizing their networking stack.”

Session Title: Container Runtime Customization at Netflix: A Case Study With NRI and OCI Hooks

“Runtime extensibility remains one of the most powerful yet underexplored aspects of Kubernetes operations. In this case study, Netflix engineers revealed how they evolved *Titus*, their global-scale container platform, by integrating ContainerD’s Node Resource Interface (NRI) and OCI hooks to support complex, specialized workloads—while preserving Kubernetes compatibility. The session offered an  inside view of custom lifecycle management, network and storage adaptations, and sidecar orchestration at scale. This was essential for platform teams building or maintaining custom runtime layers and seeking the balance between flexibility and standardization.

Session Title: Shaping LTS Together: What We’ve Learned the Hard Way

“As Kubernetes penetrates regulated and mission‑critical environments, LTS is emerging as a key concern for platform teams and vendors. This cross‑vendor panel gathered members to share operational lessons from maintaining Kubernetes over extended timelines. Discussions covered defining LTS scope (security vs. stability), managing upgrade paths, aligning dependencies, and fostering ecosystem coordination. For attendees tasked with maintaining clusters beyond the upstream support window—or interested in the future of Kubernetes lifecycle management—this was a strategic session.”

Ricardo Rocha

Session Title: Benchmarking GenAI Foundation Model Inference Optimizations on Kubernetes

“Model inference is slowly becoming a possible bottleneck in many environments. Consistent benchmarking is a key part to optimize all the different components and layers involved, this is an effort worth following. This was one of many sessions in this KubeCon + CloudNativeCon around inference optimization which helpedl end users struggling in this area.”

Session Title: Slurm Bridge: Slurm Scheduling Superpowers in Kubernetes

“In the age of the GPU, interfacing cloud native environments (on-premises and public cloud) with existing supercomputers and other HPC environments is becoming essential. This effort comes from SchedMD, the company behind SLURM, making it worth following.”

Session Title: DRA is GA! Kubernetes WG Device Management – GPUs, TPUs, NICs and More With DRA

DRA will be key to interface new generation accelerators and other specialized environments with the cloud native infrastructures we all have become used to. Extremely happy to see this in GA and great to see all the updates and developments.” 

Session Title: Managing a Million Infra Resources at Spotify: Designing the Platform To Manage Change at Scale

“Spotify is a previous winner of the Top End User award and has always been one of the key end users in our community, also bringing forwards new tools such as Backstage. Listening to how changes are managed at their scale helped everyone, no matter the size of the deployment.”

Joseph Sandoval

Session Title: Evolving Kubernetes Scheduling


The scheduler is where the rubber meets the road for AI/ML on Kubernetes. Right now, teams are cobbling together third-party extensions to make GPU workloads actually work: managing gang scheduling, topology awareness, pre-emption policies. Tune and Tyczyński are core contributors who show the community’s roadmap for moving this functionality upstream. The key insight from the talk: “The Kubernetes scheduler is designed to schedule one pod on one node at a time, and we now need it to schedule groups of pods on groups of nodes at a time.” That fundamental architectural shift is what separates batch/ML workloads from the microservices patterns K8s was built for, and signals where the community is investing to make Kubernetes the invisible substrate for AI workloads.”


Session Title: Keynote: Supply Chain Reaction: A Cautionary Tale in K8s Security


“Supply chain attacks are the nightmare scenario because your cluster passes every security scan while the build pipeline itself is compromised. Stacey and Adolfo ditched the standard slide deck for a live skit that dramatized exactly how this happens: an engineer watches helplessly as cryptomining malware spreads through their secure cluster. Firewalls worked. mTLS was configured correctly. Vulnerability scanners came back clean. None of it mattered because the compiler itself was tainted. The theatrical format worked because it made the abstract threat tangible and kept you engaged through what could have been another dry security talk. Their demonstration of the OSPS Baseline with Sigstore, SLSA attestation, and gittuf shows the defensive playbook: verify provenance at every step, not just the final artifact. If you’re building platforms that developers trust, this is why supply chain security has moved from a nice-to-have to a table-stakes requirement.”

Henrik Blixt

Session Title: The Evolution of Platform APIs in the Age of LLMs

“As a platform product manager at Intuit this session stood out because it tackled how platform APIs are changing in the era of large language models (LLMs),  shifting from rigid definitions toward more dynamic, conversational, and “wizard-style” interactions where devs and automation agents can consume the platform in new ways, which  is highly relevant to me: it highlighted how platform engineering is evolving to offer model-driven interfaces rather than only static APIs, making service consumption, management, and troubleshooting far more intuitive.”

Session Title:  Progressive Configuration Delivery for Zero-Downtime Cloud Workloads

“Being an Argo maintainer, this session hit a soft spot, because it addressed progressive delivery of configurations (not just code or images), with a CRD-based approach that decouples image versioning from configuration delivery and enables batch/traffic-based rollout of config changes. This kind of capability is directly aligned with our need to manage configuration drift, rollout risk, and seamless service updates . I was thrilled when they detailed how this can reduce the blast radius of config changes across multiple services and clusters, which resonated with many other audience members as well!”

Session Title: Creating and Maintaining Ephemeral Runtime Environments for 18,000 Developers

“This session resonated strongly since it featured a real-world case of scaling ephemeral environments to 18,000 developers, which directly touches on the kind of demand and features we see at Intuit and it aligns with our focus on developer experience and velocity: providing on-demand, isolated, production-like runtime environments supports feature velocity, reduces risk, and improves confidence in deployments.”