How a Cloudchipr’s client reduced Cloud costs by 45%
Challenge
The customer’s focus on growth led to a rapid increase in cloud spend. The team lacked unified cost visibility, attribution, anomaly detection, and automations for FinOps. Engineers spent valuable time manually hunting down idle resources instead of shipping features. To scale efficiently, they needed to modernize their infrastructure by adopting Cloud Native Computing Foundation (CNCF) projects including Kubernetes for orchestration, Prometheus and Grafana for observability, Argo CD for GitOps, and Helm for package management—while ensuring this transformation would reduce costs rather than increase them.
Solution
The customer migrated to Amazon EKS (Kubernetes) as their orchestration foundation, adopting a full CNCF stack: Argo CD for GitOps-based deployments, Helm for package management, Prometheus and Grafana for monitoring, OpenTelemetry for distributed tracing, KEDA for event-driven autoscaling, Cilium for networking and security policies, and Containerd as the container runtime.
Cloudchipr deployed automation workflows to handle routine FinOps tasks that engineers previously did manually—catching idle resources, managing off-hours schedules, and enforcing cost policies across the new cloud native stack. DevOps and FinOps engineers partnered with the customer to set up both the platform and the cost automation, sharing best practices that enabled the team to stay focused on building and growth.
Impact
By combining their CNCF stack migration—Kubernetes (EKS), Argo CD, Helm, Prometheus, Grafana, OpenTelemetry, KEDA, Cilium, and Containerd—with Cloudchipr’s automated FinOps platform, the customer achieved measurable results across cost, visibility, and efficiency. They improved cost visibility and accountability across teams, attributed 95% of all cloud costs to the right dimensions, implemented real-time anomaly detection, and automated routine FinOps hygiene work. The cloud native transformation delivered a 40% compute efficiency gain through better bin-packing and autoscaling, while automation reduced overall AWS spend by 45% and helped prevent end-of-month surprises.
By the numbers
45%
AWS cloud savings
95%
of cloud costs attribution across 26 different SaaS and cloud providers
> 95
anomalies detected and prevented with real-time anomaly detection
How an enterprise-grade AI copilot company saved 45% on cloud costs, achieved 95% cost attribution, and prevented over 90 anomalies
“We needed a FinOps automation platform that could track and act on routine engineering tasks, allowing our team to focus on shipping products and scaling our infrastructure.”
Artem, CEO, Cloudchipr’s Client
Building cloud native resilience with Kubernetes and GitOps
The customer needed to scale their workflow automation platform without unpredictable cloud costs eating into their growth. Their existing Amazon ECS setup was reaching its limits, and they knew a migration to Kubernetes would give them better control over both performance and spend. The question was how to adopt cloud native infrastructure without creating a maintenance nightmare or losing visibility into costs.
They decided to build on CNCF projects because of the strong community support, battle-tested patterns, and portability. Instead of building custom tooling or betting on proprietary solutions, they chose projects with proven adoption and active ecosystems.
Why Kubernetes: Orchestration that scales predictably
The team migrated from Amazon ECS to Kubernetes on Amazon EKS to gain declarative infrastructure, better workload isolation, and more granular control over resource allocation. Kubernetes gave them the foundation to run stateless services, background workers, and scheduled jobs with consistent patterns across all environments.
They chose Containerd as the container runtime for its stability and lighter resource footprint compared to alternatives. For networking and security, they adopted Cilium, which not only handled the network layer but also enforced network policies between services, giving them zero-trust communication and better visibility into how traffic flowed through the platform.
Multi-AZ deployments and gradual rollouts kept the platform resilient during updates. PodDisruptionBudgets and anti-affinity rules ensured capacity stayed available during node maintenance.
Why KEDA: Event-driven autoscaling to cut waste
One of the biggest cost drains was running background workers at full capacity even when queues were empty. The team adopted KEDA to handle event-driven autoscaling based on external signals like queue depth and stream lag. Services now scale out during bursts and scale down to zero when idle, cutting waste without manual intervention.
For general application scaling, Kubernetes’ built-in Horizontal Pod Autoscaler handled CPU and memory-based scaling. For node provisioning, they added Karpenter to spin up nodes when needed and consolidate underused capacity automatically, mixing on-demand and spot instances while respecting disruption budgets.
Why Argo CD and Helm: GitOps for consistent deployments
The team wanted deployments to be consistent, traceable, and repeatable across dev, staging, and production. They chose Argo CD for GitOps and Helm for package management.
All configuration now lives in Git as the single source of truth. Argo CD watches the clusters and automatically syncs them to match what’s in Git. If something drifts, it catches it.
Each environment has its own branch and Helm values file, so changes move through environments in a controlled way. If a deployment goes wrong, the team can roll back to a known-good version automatically.
This setup eliminated manual deployments and made every release predictable and auditable.
Why Prometheus, Grafana, and OpenTelemetry: Observability for performance and cost
CNCF observability projects helped to keep tabs on both performance and spending.
They deployed Prometheus for metrics collection and Grafana for dashboards, giving the team real-time visibility into service performance, autoscaling activity, and cluster health. Engineering, finance, and leadership all share the same dashboards, so everyone can see how infrastructure decisions affect both reliability and cost.
Key services are instrumented with OpenTelemetry, sending metrics and traces into the monitoring stack. This shows the team exactly how requests flow through the system, where latency comes from, and how code or infrastructure changes affect performance and cost. Cilium adds another layer of observability by showing which services communicate and where network policies might impact availability.
The team set up alerts for latency, error rates, and resource usage that tie directly to their service-level objectives. Anomaly detection, daily and weekly spend reports, and tag-based insights were added, so teams know who owns what. With operational metrics and cost data in one place, the team can make smarter scaling decisions and keep cloud spend predictable.
Automated cost hygiene across the stack
While the CNCF stack handled orchestration, autoscaling, and observability, Cloudchipr automated the routine FinOps work that kept costs under control. The team connected the customer’s AWS accounts, set idle thresholds and grace periods, defined working hours and exception lists, and turned on automations that watch resources, compare them against policies, and notify owners or stop resources when idle.
Automated workflows find underutilized EC2 instances and shut them down, power off non-production resources during nights and weekends, catch unattached EBS volumes and unused load balancers, and clean up idle Elastic IPs and expensive NAT gateways. Resource owners get notified, and after approval, receive snapshots while automatic cleaning takes place. Every automation runs on a schedule with approval flows, maintenance windows, and exception handling built in.
Open source learnings and community engagement
Throughout this journey, the team leaned heavily on the CNCF ecosystem and open source community, learning from others who had walked similar paths.
What the team learned from the community:
- CNCF project documentation and architecture patterns guided their migration strategy
- Community Slack channels and GitHub discussions helped troubleshoot complex issues with Cilium networking and KEDA autoscaling
- Conference talks and blog posts from other teams informed their approach to GitOps, cost optimization, and observability
- Real-world examples from the Kubernetes community showed them what worked and what to avoid
How they engaged with the community:
- Shared their migration experience and lessons learned through internal documentation and team retrospectives
- Participated in CNCF community discussions, answering questions and sharing insights where they could help others
- Documented their architecture decisions and challenges to help future team members and inform other teams facing similar problems
- Built relationships with other practitioners through meetups and online forums, creating a network they could learn from
The team found that adopting well-documented CNCF projects made hiring easier. Engineers were excited to work with familiar, battle-tested tools rather than proprietary systems. And when problems came up, there was a whole ecosystem to learn from.
Results
- 45% AWS cloud savings
- 95% of cloud costs attribution across 26 different SaaS and Cloud Providers
- Over 95 Anomalies detected and prevented with Real-time anomaly detection
- 40% Compute efficiency gain after migrating to AWS EKS (better bin-packing + auto-scaling reduced idle compute and over-provisioning)
Timeline and Status
- Deployment: Oct–Dec 2024
- Production: Dec 2024–Present
Contact / CTA
Questions about the approach? Contact Cloudchipr — FinOps Platform & DevOps Services https://cloudchipr.com