Scaling Kubernetes Operations to 100+ Clusters with GitOps and Argo CD
Galaxy FinX provides a digital banking platform serving customers globally, beginning with the launch of a pure-play digital bank in Vietnam. To support rapid growth and regulatory requirements, we adopted a microservices architecture early and standardized on Kubernetes using Amazon EKS. Today, our platform engineering team supports hundreds of developers across multiple regions, operating more than 100 Kubernetes clusters, ensuring our infrastructure scales quickly and reliably.
The challenge: The struggle of managing 100+ manual clusters
As our business grew, our infrastructure footprint expanded from a few small environments to over 100 Kubernetes clusters. However, our management tools didn’t evolve at the same pace. We found ourselves trapped in a cycle of manual work that created several major hurdles.
The primary issue was the ‘human error’ factor. Because we were setting up clusters manually using scripts and command-line tools, no two clusters were exactly the same. This led to configuration drift, where a fix applied to one cluster was forgotten on the next 99. This inconsistency made global security updates, or version upgrades, a high-stakes gamble every time.
Furthermore, we were hampered by slow deployment speeds. Setting up a single production-ready cluster, complete with security policies, monitoring, and networking, took nearly 4 hours of dedicated engineer time. When multiplied by 100 clusters, this manual overhead made scaling physically and financially impossible.
Finally, we suffered from operational blind spots. With clusters scattered across different regions, our team lacked a central source of truth. We had no way to instantly verify which clusters were up-to-date and which were falling behind. We were essentially “flying blind”, which significantly increased our exposure to security vulnerabilities and operational outages. We realized that to continue growing, we had to stop treating our clusters like individual projects and start treating them as a single, automated fleet.
By the numbers
100+
Kubernetes clusters managed across multiple global regions
< 20
minutes to bootstrap a production-ready cluster from scratch
90%
reduction in manual engineering hours per cluster deployment
The solution: A unified GitOps framework

The turning point came when we decided to stop managing clusters and start engineering a platform. We chose Argo CD as the centerpiece of our strategy, moving toward a declarative, GitOps-driven architecture.
Instead of manual kubectl commands or disparate Terraform runs, we established a single “infrastructure-live” Git repository to serve as the definitive source of truth. We adopted the “App of Apps” pattern, a powerful architectural choice that allowed us to treat an entire cluster’s configuration—from service meshes to ingress controllers—as a single manageable unit.
Our new workflow begins with Terraform for the initial cloud provisioning. However, the moment the cluster is live, it automatically installs Argo CD and points it toward our Git repository. Each cluster follows an identical bootstrap sequence: cloud provisioning → Argo CD installation → Git-driven synchronization of platform services, security policies, and application configurations.
From that point forward, the cluster bootstraps itself. It pulls down all necessary add-ons, security configurations, and application manifests without further human intervention. This shift meant that our engineers no longer spent their time configuring YAML; they spent their time defining the desired state of the entire fleet.
The impact: Reliability through automation
The transition to a GitOps stack built on CNCF projects — centered on Kubernetes and Argo CD — completely redefined our operational capabilities. The most immediate win was the dramatic increase in velocity. The time required to bring a new, production-ready cluster online dropped from 4 hours of manual labor to under 20 minutes of automated synchronization.
Reliability also reached new heights. Because Argo CD constantly monitors the live state against the Git repository, configuration drift is now a thing of the past. If a manual change is made at the cluster level, Argo CD detects the discrepancy and automatically reconciles it, ensuring our security and compliance standards are enforced 24/7.
Perhaps the most significant business impact is our improved SRE-to-cluster ratio. A small team of three engineers now effortlessly manages over 100 clusters. This efficiency has allowed the team to pivot away from repetitive maintenance and toward high-value platform features that empower our developers to ship code faster.
Looking ahead
Our journey with CNCF projects doesn’t end here. The success of Argo CD has laid the foundation for us to explore even deeper automation. We are currently testing Crossplane (a CNCF Incubating project) to bring cloud resources under the same GitOps workflow. For Galaxy FinX, GitOps is more than just a technical choice; it is the foundation of our ability to scale with confidence.