As we shared in our earlier post on FluxCD, RBC Capital Markets has been on a deliberate journey to modernize our Kubernetes platform. GitOps with FluxCD gave us a solid deployment foundation. But as our platform grew, today we operate over 50 clusters spanning on-premises VMware environments and multiple clouds, we hit a set of problems that no single off-the-shelf tool was designed to solve together: How do you manage the lifecycle of the clusters themselves? How do you ensure every node is reproducible and tamper-evident at boot? And how do you integrate Kubernetes service discovery with enterprise DNS infrastructure without every record change going through a ticket queue?

This post is about the several projects that answered those questions for us, and what we learned building with them inside a regulated financial institution.

The challenge: Platform engineering at scale in a regulated environment

Managing 50+ Kubernetes clusters across hybrid infrastructure is not just an operational challenge, in capital markets it is also a compliance challenge. SOX, PCI-DSS, and Basel III create real requirements around auditability, configuration drift prevention, and network segmentation. Our platform teams cannot afford to have snowflake nodes, undocumented cluster state, or manual DNS records that accumulate over years.

When we stepped back and looked at what we were spending engineering effort on, three gaps stood out:

  1. Node configuration drift: VM-based nodes that had been patched and mutated over time were becoming impossible to reason about.
  2. Cluster provisioning: spinning up new clusters for trading desks or risk teams was a multi-day manual exercise with no single source of truth.
  3. DNS integration:  every new service or ingress endpoint required a manual ticket to our network team, creating a bottleneck and an audit trail that lived outside our GitOps workflow.

We decided to solve each of these from the ground up, using cloud-native projects where they existed and building our own where they did not.

Kairos: Immutable OS for nodes you can trust

The first piece of the puzzle was node immutability. We evaluated several approaches, but Kairos, a CNCF Sandbox project, aligned most directly with what we needed: a Linux distribution designed from first principles to be immutable, declaratively configured, and reproducible.

With Kairos, every node in our fleet boots from an OCI image. That image is built from a known base (in our case RHEL-derived), baked with our approved security configuration, and published to our internal registry. The cloud-config model lets us define node behaviour, SSH keys, network configuration, SSSD authentication against our Active Directory, Kubernetes agent registration, all as versioned YAML that flows through FluxCD just like any other platform component.

A CI/CD pipeline for operating system images

One of the less-discussed challenges of immutable infrastructure is the discipline it demands around image build and validation. We treat our Kairos images exactly like application container images: every change triggers a GitHub Actions pipeline that builds the image, runs integration tests against a live VM, and publishes a new OCI tag only on a clean pass. Nightly builds catch upstream regressions in base packages or the Kairos framework itself before they reach production.

This means our node image pipeline has the same properties we expect from application CI:

Kubernetes-native VM provisioning with VirtRigaud

The other half of the VMware story is how we actually provision VMs from our Kairos images. Rather than reaching for imperative vSphere tooling, we use VirtRigaud, a Kubernetes operator that provides declarative VM management across multiple hypervisors (vSphere, Libvirt/KVM, and Proxmox) through a unified CRD API.

The model is straightforward: our Kairos-built OCI image is registered as a VMImage CRD, and VMs are expressed as VirtualMachine CRDs referencing that image. FluxCD reconciles these manifests like any other platform resource. The result is that provisioning a new Kairos node on vSphere is semantically identical to deploying a workload, it is a pull request, reviewed, merged, and reconciled automatically.

VirtRigaud’s remote provider architecture also fits our security requirements well: provider credentials are isolated to their own pods, and the controller communicates with them over gRPC/TLS rather than embedding hypervisor credentials centrally.

The operational shift this created was significant:

The learning curve was real: getting kernel modules, NetworkManager, and enterprise authentication (SSSD/AD) right inside an immutable image took iteration. But once solved, the result is a node foundation we can genuinely trust, which matters when regulators ask questions.

k0rdent: Cluster lifecycle management as a platform

Immutable nodes solved the “what is running” problem. But we still needed to answer “how do clusters get created, updated, and decommissioned?” consistently across our entire fleet.

k0rdent, built on Cluster API (CAPI), gave us a Kubernetes-native control plane for managing Kubernetes clusters. Rather than treating cluster provisioning as a bespoke scripting exercise, k0rdent models clusters as CRDs. Combined with k0smotron for in-cluster control planes, we can now express our entire cluster topology declaratively, and FluxCD reconciles that state continuously.

Our choice of Kubernetes distribution for workload clusters was k0s, a CNCF Sandbox project. k0s is a fully self-contained, single-binary Kubernetes distribution with no host OS dependencies beyond the kernel. That property matters a great deal when your nodes are running an immutable OS: k0s installs cleanly into a Kairos image without requiring package managers, systemd unit file manipulation at runtime, or any of the host-level assumptions that distributions like kubeadm make. The combination of Kairos and k0s gives us a full node-to-cluster stack where every component is declaratively expressed, OCI-packaged, and reproducible from a clean boot.

k0smotron extends this further by allowing Kubernetes control planes to run as workloads inside the management cluster, meaning even the control plane is expressed as a CRD, reconciled by FluxCD, with no out-of-band state.

The architecture we settled on organizes clusters into a hub-and-spoke model:

Beyond day-one provisioning, this approach transformed how we handle day-two operations:

We are also using k0rdent as the foundation for a spot-computing scheduler that allows donated physical server capacity to be absorbed dynamically into our platform, a capability we plan to share more about in a future post.

bindy: Kubernetes-native DNS operations

The last gap, and the one where no existing project fully covered our requirements, was DNS. In capital markets, DNS is not a commodity concern. Our trading applications, market data feeds, and risk systems use DNS extensively, and the enterprise infrastructure that serves them has been built and maintained over decades.

At RBC Capital Markets, that infrastructure is Infoblox, an enterprise DDI platform that is deeply integrated into our network operations. The integration model, however, was built for a world before Kubernetes: every DNS record request went through a ticketing workflow, routed to the network team, and processed on a timescale measured in hours or days. As our platform scaled to 50+ clusters, each spinning up dozens of services and ingress endpoints, that provisioning lag became a genuine operational bottleneck, and the paper trail for DNS changes lived entirely outside our GitOps audit trail.

bindy was built by Erick Bourgeois to bridge this gap, a Kubernetes operator, written in Rust using kube-rs, that manages DNS zones and records as first-class Kubernetes resources. The core design philosophy was to make DNS a GitOps citizen, with the same reconciliation guarantees we apply to everything else on the platform:

The impact has been immediate. DNS records for new services are created automatically as part of the same GitOps workflow that deploys the service itself, provisioning time drops from hours to seconds, and the audit trail is Git history, not a ticket system. The rigid integration boundary that previously required human coordination on every DNS change is replaced by a reconciliation loop.

bindy is currently being expanded to support compliance scoring (a CRD-based model for zone health) and a future MCP server interface for integration with AI-driven platform tooling.

How the three fit together

What makes this stack coherent is that each layer builds on the same foundational principle: everything is code, reconciled continuously, with no manual state.

Git (source of truth)
  └── FluxCD (reconciliation engine)
        ├── k0rdent / CAPI manifests → cluster lifecycle
        ├── Kairos cloud-config → node configuration
        └── bindy CRDs → DNS records

Kairos ensures every node boots from a known, auditable image. k0rdent ensures every cluster is expressed and managed declaratively. bindy ensures every DNS record is a versioned artefact. FluxCD ties them together as the single reconciliation plane. The result is a platform where drift, at the node, cluster, or network level, is structurally prevented rather than operationally managed.

Challenges and lessons learned

Building this platform taught us several things we wish we had known earlier:

Looking ahead

Our platform continues to evolve. Some of the areas we are actively developing:

We are proud to be building on and contributing back to the CNCF ecosystem, and we look forward to continuing to share what we learn. If you are working through similar challenges in a regulated environment, we would love to connect, find us in the Kairos, k0rdent, and FluxCD Slack communities, or reach out directly on LinkedIn.

Erick Bourgeois is Director and Head of Kubernetes Platform Engineering at RBC Capital Markets, managing 50+ Kubernetes clusters across multi-cloud and on-premises environments. He is a KubeCon and FluxCon speaker, FINOS Common Cloud Control member, and open-source developer at github.com/firestoned.