Reaching for Strategic Autonomy by Building On-Premises Kubernetes with Cluster API and ArgoCD
Challenge: Closing the Agility Gap Between Public Cloud and Datacenters
As France’s national railway operator, SNCF runs critical transportation systems serving millions of passengers daily. Ensuring sovereignty, reliability, and operational resilience for the platforms powering these systems is a strategic priority.
After years of successfully operating Kubernetes in public clouds (Azure and AWS), SNCF faced a major hurdle: providing that same “Managed Kubernetes” experience within its own datacenters. Their first attempt at an on-premises platform was functional but “naïve”—it relied on manual processes and custom tooling.
The pain points were clear:
- Sluggish provisioning: Delivering a new cluster took up to a month.
- Stagnant growth: Only 14 clusters were deployed in four years due to operational overhead.
- Day 2 friction: Upgrades, patching, and scaling were constant struggles, leading to “snowflake” clusters and configuration drift.
- Feature gap: Essential cloud features like Node Autoscaling were missing on-premises.
SNCF needed a platform that could compete with Azure AKS or AWS EKS while maintaining full control over critical business data.
Solution: Modern problems require modern infrastructure
SNCF decided to stop applying incremental fixes to a legacy foundation and instead rebuild a state-of-the-art platform from the ground up.
“We soon concluded that applying incremental fixes on unsteady foundations would prove time-consuming without any guarantee of fixing the underlying issues. We needed to start again entirely.”
Yann Rotilio, Senior Staff Engineer – Kubernetes Specialist, SNCF
The Technical Stack: “Non-Toxic” and Interchangeable
The new production platform was designed using a modular, cloud-native stack:
- IaaS Layer: OpenStack (Canonical) provides the base compute, networking, and storage primitives.
- Immutable OS: Talos Linux (SideroLabs) was chosen as the API-driven, security-hardened OS. By eliminating SSH and manual configuration, SNCF removed the human error factor.
- Declarative Lifecycle: Cluster API (CAPI) is the heart of the architecture. It treats clusters as Kubernetes resources, allowing the team to manage them using a reconciliation loop.
- Networking & Security: Cilium (eBPF-based) provides high-performance networking and observability, while Kyverno enforces policy-as-code for compliance across the fleet.
The Game Changer: Node Autoscaling On-Premises
Cluster API was the central architectural decision. Beyond eliminating configuration drift, it allowed SNCF to activate node autoscaling in their own datacenters—a capability previously exclusive to public cloud providers. By using the CAPI OpenStack provider, SNCF achieved a unified workflow across hybrid environments.
“Cluster API was a game-changer. It gave us node autoscaling in our datacenters—something we thought was only possible with AKS or EKS.”
Yann Rotilio, Senior Staff Engineer – Kubernetes Specialist, SNCF
Extending GitOps Across the Hybrid Fleet
SNCF already operated hundreds of clusters in the public cloud via ArgoCD. By extending this existing GitOps implementation to the new datacenter clusters, they created a unified management layer. ORAS was integrated into the supply chain to manage Cluster API’s providers as OCI artifacts, ensuring seamless lifecycle management of every high-level primitive through GitOps.
Impact: Transformational Velocity
The shift from manual management to a declarative, CNCF-aligned stack resulted in an exponential increase in efficiency.
Notable Metrics:
| Metric | Legacy On-Premises | New Cloud-Native Platform |
| Cluster Provisioning | 1 Month | 30 Minutes |
| Fleet Growth | 14 clusters in 4 years | 10 clusters in 6 months |
| Day 2 Operations | Manual & Challenging | On-demand & Automated |
| Configuration Drift | High | Zero-drift guarantee |
Every Kubernetes cluster in the fleet is now updated monthly. This “zero-drift” posture ensures that the production environment is always in sync with the desired state, providing the reliability required for national rail operations.
“We set out to build a platform that could compete with AKS or EKS, but running in our own datacenters. With Cluster API, ArgoCD, and the CNCF ecosystem, we achieved exactly that—and the metrics speak for themselves.”
Thomas Comtet, Head of Container and Cloud Native Platforms, SNCF
Alignment and Future Horizons
SNCF is now fully aligned with the Linux Foundation and Open Infrastructure Foundation ecosystems. By choosing projects with transparent governance, they have secured their long-term technological sovereignty.
What’s Next?
The journey doesn’t end here. SNCF continues to expand its on-premises Kubernetes fleet and is exploring :
- Deeper integration with the CNCF and Open Infra ecosystems
- Next-Gen Abstraction: Integrating KCP and Crossplane to further simplify how infrastructure is consumed across the entire organization.
SNCF has proven that with the right CNCF building blocks, the “managed cloud experience” isn’t a location—it’s an operational standard that can be achieved anywhere.