Community post by Saqib Jan

Cloud has given on-demand access to compute resources, but high availability also makes cost a much more dynamic problem to forecast. This reverberates as companies continue to expand their cloud footprints and adopt more cloud technologies — the potential for waste also increases.

The 2024 State of FinOps report underscores, organizations are focused on reducing waste. And with efficiencies top of mind, it is imperative that savings and cost optimization is the top priority for engineering leaders considering a FinOps model in Kubernetes. Because understanding how to estimate cost and optimization is a black hole for platform engineering and finance teams, the biggest challenge for stakeholders is figuring out where costs originate. So, to address this, some understanding of ownership costs is important.

Big tech engineering teams with mature finance and product management practices are solving for these challenges with cost models that help measure the total cost of ownership of their services and applications. Laurent Gil, Cloud Neutrality Advocate and Co-Founder of Cast AI, exposits that the cost model often isn’t sufficient for anything but an informed starting point—a good enough point must be reached to avoid over-investment.

The first thing to consider is the cost drivers—elements that contribute to the overall cost and how to incorporate them into calculations. There is CPU, memory, and storage allocated to each service and execution in Kubernetes. Workloads also grow larger over time, and there are a variety of costs involved in hosting, integrating, running, managing, and securing cloud workloads. While some charges directly relate to compute, data transfer, and storage consumption, other factors add complexity. And there are also toolings as well as integrations with other cloud services that must be factored into the total cost of ownership (TCO) calculations.

Architect for Efficient Cloud Usage

If you run Kubernetes yourself, you need to have a strong engineering team. And unless you are in a business close to containerization and microservices technology, it’s essentially just a cost center and an inefficient use of resources.

It’s a very challenging task to build out Kubernetes yourself and being able to understand all the different nuances of all the different components that need to be set up and configured properly. And anyone who is not in the business of running infrastructure will basically benefit from managed Kubernetes. You don’t have to hire a team for anything which you want to run at any reasonable level of reliability.

Richard Hartmann, Director of Community for Grafana Labs, shares two fundamental ways to efficient cloud usage. One is to “go all-in on bespoke services, leveraging whatever you can to reduce undifferentiated heavy lifting and focus on solving problems that drive your business forward.” Alternatively, you can “use as few bespoke services as possible, relying solely on the baseline across all providers.” This approach allows you to maintain control over how your platform operates and facilitates easy migration between clouds. “Both approaches have merit,” but he cautions being in between is usually not ideal as it exposes you to the drawbacks of both tradeoffs.

Both solutions have similar problems. Cloud is expensive, and there is zero incentive for cloud providers to offer great cost controls. Hartmann points out the inherent conflict of interest, underscoring, “that would literally enable users to pay less,” something no cloud provider would want, particularly in the current macroeconomic uncertainties, adding more pressure to engineering leaders already contending with shrinking budgets, and heightened expectations for cost efficiency.

And so, there’s also a lot of interest from companies about trying to figure out what is the right model and how to find that happy balance between knowing not just what’s running in production but also how to set up a dev-test environment effectively.

“A lot of our customers today are looking at different models internally for their projects that are running in the cloud through managed Kubernetes offerings, whether it be a showback or chargeback type model,” commented GitLab Field CTO Lee Faus. And he mentioned, “We’ve had a few customers who tried to implement quotas around what they’re allowed to spend using high water, low water marks. But in doing so, they’ve realized that because of the way most managed Kubernetes clusters work, they incentivize you to build things like auto-scaling.”

There are more reasons why organizations end up in situations with over-provisioned clusters, which not only lead to poor cycle times from a CPU and memory perspective but also ultimately result in a negative experience for end-user interactions with the applications.

To counteract the risk of uncontrolled spending, Hartmann says “we have implemented deep control over what specifically we do and built our cost controls for self-managed clusters and as well managed platforms.” This approach helps scrutinize operations, ultimately enforcing a chargeback program that encourages a shared sense of accountability across stakeholders.

Both Hartmann and Faus highlight challenges in managing costs and finding the right balance between control and cost efficiency. FinOps practices, they affirm, help organizations to anticipate, control, check, and optimize their cloud investments on a proactive and reactive basis.

FinOps for Kubernetes Cost Control Strategies

Cost management on the cloud side can get out of control, but a lot of that stems from not having good rigor in the software development lifecycle — where things are pushed into production before they’re actually ready, or when they haven’t been adequately tested from a performance perspective. And considering the business value, there has never been a more important time to adopt FinOps principles ‘inform, operate, and optimize’ because existing solutions do not capture the nuances needed to economically achieve the perfect balance between cost and performance.

FinOps is the discipline to exhort shared responsibility and bring together all stakeholders (tech, business, and finance people) to establish policies and best practices for usage that are programmatically enforced. Adopting a FinOps approach can help platform engineering teams dramatically increase their visibility necessary to find ways to reduce costs without affecting performance.

When DevOps, say for instance, gives developers tools and guardrails to build, deploy, and fully own an application, it’s important to also educate them about overall cost management. This is because empowering teams to take action is the top challenge. And it’s usually not until the bill comes due at the end of the month that finance teams realize there is an issue with sudden spikes in costs.

I can tell you from supporting clients that most organizations leveraging Kubernetes struggle to manage their cloud expenses because there is no proper review and refining cycle for their processes, and also the pool of skilled workers in this segment is very dry.

Faus, in our conversation, stipulated, “There’s a term that we’re starting to see a lot of companies use, which revolves around value streams.” Value streams allow us to map back to key performance indicators (KPIs). And, these KPIs are defined at the CEO, CIO, and CFO level, where budgets are drawn, resource hiring is planned, and new product lines are decided. “This provides a high-level mapping back to those elements and around those value streams. When we drive throughout the given year, we need to have a way to ensure that we are actively tracking these aspects throughout the SDLC and in our cost management.”

Whatever you may call it, empowering development teams becomes imperative when using Kubernetes. Taking responsibility to make informed decisions will, in turn, make Kubernetes cost management timely, proactive, and cost-effective. As budgets get tighter, there is a great need for cost control strategies — to build from a knowledgeable foundation your own cost controls and implement a third-party solution, whether commercial or open source, to avoid linear cost increases. They are effective for everyone — for any cloud provider and even Kubernetes on bare metal infrastructure.

Case Study: LambdaTest

A perfect example to showcase the lasting impact of FinOps practices can be seen with LambdaTest. This young company, providing infrastructure as a cloud platform for online browser and operating system testing, quickly scaled up its services after securing initial funding but then encountered challenges with sudden spikes in cloud costs during subsequent rounds of funding.

As the senior DevOps engineering leader, Shahid Ali Khan led the responsible development of LambdaTest’s Kubernetes infrastructure and overall infrastructure system. He shared invaluable insights on navigating the exhaustive platform engineering challenges and adopting FinOps principles, imperative in the process, for optimizing cloud resources and saving cloud costs.

This case study highlights LambdaTest’s journey to FinOps maturity, emphasizing cost optimization. And outlines a systematic approach with insights from notable leaders, to proactively navigate these hurdles. The study discusses technology solutions, strategies, outcomes, and lessons from my one-on-one interviews with these leaders.

— The Challenges of Managing Infrastructure

As LambdaTest expanded their offerings, the complexity of their infrastructure also increased, relying on AWS and self-managed Kubernetes to support their data-heavy customers. This architecture allowed them to scale rapidly, and Mudit Singh, Head of Growth & Marketing, reflects their initial decision. “When we started off with Kubernetes, no cloud provider was offering a static, stable solution for it. And at that time, in around 2017, AWS released Managed Kubernetes, which remained in testing for an extended period. As a startup with a talent shortage, we were unsure about managing our own cluster.”

As their usage increased, each month ended with sudden cost spikes that created more questions around spend, like ‘How much are we spending?’ ‘Is this normal?’ ‘How should cost be divided between teams, applications, and business units?’ and ‘What is the problem, over-provisioning, or using too much compute or memory?’ remained unanswered. This situation, Singh shares, “drained our DevOps leaders, platform engineering and Finance teams (including the Founders) to invest a significant amount of time and attention in understanding the hefty incoming invoices.”

These issues even escalated over time — as usage grew, the risk of losing cost control also increased. The company sourced tools to produce cloud consumption but struggled to identify and address cost drivers. Like many, LambdaTest also faced challenges in the balancing act of trying to build a team with a FinOps culture and striving for enhanced cost visibility.

Singh stressed how identifying and addressing the underlying problems driving up costs proved to be a struggle for them while managing data centers across continents. And Khan expounds on the cross-functional initiatives they took to gain clarity on cost drivers, achieving visibility and transparency into spending and cloud usage.

— Create a Tagging Framework

It is difficult to align reports to business context without insight into workload allocations, and the industry has seen the adoption of more structured approaches to resource management.

Khan detailed, “We have multiple products that are running, and there are services shared among those products. It was getting hard for us to identify which service is contributing—or which particular service is contributing—to the cost of each product.” Tagging with labels also helps identify over-provisioned resources. And, “we began by implementing tagging and labeling, along with utilizing different node pools. This enabled us to precisely determine the cost allocation for each product in terms of specific resources, understand how their requirements have scaled over time, and effectively address those needs.”

It has now become a common practice to leverage namespaces for each product or service in Kubernetes, with a clear bifurcation of services. This approach not only lays the groundwork for resource management but also supports isolation, resource quotas, and simplified access control—enhancing operational efficiency. And most importantly makes cost analysis, reporting, and optimization easier for individual teams, services, or business lines.

— From Data to Action

The volume of data subject to analysis for cost optimization is always considerable. Being able to vectorize that data and understand where there are errors, where there might be memory spikes, CPU spikes—these are areas where you can not only optimize for the cost structure of how to manage applications but also provide feedback to the engineering teams. Faus underlines, “This involves actions like automatically promoting an issue or a ticket on the product side to ensure that something is going to be done about that cost as part of a current sprint.”

This process should also involve analyzing time-series data, which exceedingly helps to identify inefficiencies, make informed decisions about resource allocation and find potential automation opportunities. There are other strategies and optimization tactics you could adopt, but in general, at a basic level, it’s good to consider first what you are optimizing for.

So what is optimizing for cost? It’s really just another metric to consider. Say, for example, I already manage CPUs, pods, memory, storage, and compute capabilities. Each one, then, is a piece of the larger puzzle I’m piecing together. So, adding cost into the mix doesn’t change the fundamental approach; it’s just integrating another element into the array of resources we’re already balancing.

Khan emphasizes the key focus is on enabling “our tech personnel to efficiently extract and manipulate data to align with our business objectives, rather than the other way around. This strategy, which we also implement internally at LambdaTest, underscores the critical importance of fostering internal collaboration and knowledge exchange to effectively bridge the technological and business divides within our organization.”

— Cost Visibility

The very important aspect of optimizing the cost and running the cluster without impacting the performance and usage is monitoring. It is the core pillar toward building awareness and informing FinOps objectives for optimization strategies.

“We tried a lot of solutions and multiple plugins, but we could not get a clear understanding of the volume of requests, the performance of the cluster, or the overall system status,” Khan specified. “We implemented distributive tracking (and distributed tracing) inside the cluster to monitor each and every request, which helped us to identify how services are being used and pinpoint optimization opportunities within the system. This tremendously helped us to identify inefficiencies, which increased accountability – informing service owners to take action, while also enabling things like internal chargeback and showback models.”

Visibility underpins FinOps metrics (idle resources, under-optimized infrastructure) for tracking progress. But the key metric to consider is normalized cost, which, when adjusted for your operating business metrics, provides a more holistic view of your cloud spending relative to your business activities.

— Drill-Down Granularity

What you do next will depend on where your baseline is. But allowing issues to persist over time makes controlling costs at a later stage difficult and challenging. Even if you attempt to control them later, the level of effort required from your team would be very difficult, diverting focus from implementing features.

It is here that a tagging framework with a Kubernetes cost management tool becomes helpful. This helps you drill down into the layers of your environment so you can see exactly how each application is impacting your costs, enabling proactive recommendations for cost savings.

Khan shared their approach, “We began observing all attributes that influence pricing, and based on that, we looked deeper into why there has been an increase, what could have caused it, and then took rather difficult — educational path to show individual teams how their environment impacts their department’s resources.”

The ideal solution, according to Khan’s recommendation, “should provide time-saving features.” An example would be a prioritized list of your environment’s most expensive components, ranked by cost. This allows you to focus on the areas that will yield the most significant cost savings first. “We realized this and implemented a proactive approach across teams, ensuring work could proceed in a manner that does not affect production.”

Real-Time Alerting

Now, in cloud-native environments with auto-scaling enabled, your cluster or nodes can scale up or down. Therefore, implementing budgets and alerts within the cloud system is imperative, as non-tracking can lead to significant expenses that won’t justify the solution you are providing. This is why applying custom rules programmatically allows you to receive notifications when costs increase, enabling you to take corrective actions for specific requests.

Khan remarked, “FinOps practices have significantly changed the way we work.” Through extensive data analysis, “we measure costs to a significant extent and set budgets for each product. For example, each product has clusters, and some of them share services. With simple tagging, we can set specific budgets for each product. And when we allocate a certain amount to a product, we get alerts if spending goes over a predetermined threshold.” A small increase triggers an amber alert, and a big jump triggers a red alert.

The optimal solution should also alert you to abnormal cost spikes in real-time so you can examine the issue right away and remediate it, rather than waiting for weekly or monthly reports. And sometimes, these spikes may serve as an early warning sign of a cyberattack, which requires an immediate and proactive response to safeguard your infrastructure and data integrity.

Encourage FinOps Practices

The more intentional approach you take to plan for change, you’ll target places where change will be the most effective soonest. But the worst thing you could do as an organization is to say, ‘we’re going to inform’ without understanding the extent to which overspending is ingrained in your Ops and, importantly, where some of the key drivers are coming from.

The best way to manage Kubernetes at scale is to take a holistic and intentional approach, which also helps in calculating the total cost of ownership and allocating budgets, but it is not something most companies are doing. Bifurcation of resources is not what most companies are doing either. A lot of companies are managing huge infrastructures, but what they lack is a dedicated FinOps team for such instances. And the reactive approach that they are taking for incidents, in terms of cost management, lead to significant financial burdens.

Cloud lets you accelerate, but it can also be a double-edged sword without a proactive approach. According to the CNCF microsurvey report, over provisioning or having more resources than necessary, is one of the most common factor leading to over-provisioning.

“We’ve analyzed usage data on thousands of applications, and there are three primary reasons companies overspend: over-provisioning, pod requests being set too high, and low usage of Spot instances,” Gil enumerated. The biggest source of overspend, however, is an overestimation of the real CPU/memory usage. “For more than 97% of the applications we analyzed, the pre-optimized utilization of CPU is only 12%. That means that, on average, nearly 90% of compute is paid for, but goes unused.” And these percentages, he laid out, “are consistent across application sizes and cloud providers.”

The underlying reason that causes most companies to overspend is the lack of education and empowerment. Tech, DevOps, and infrastructure teams often lack cost awareness. And change is not easy because to build the culture of transparency and openness requires sharing pricing information with engineers and creating a safe space for open communication.

This is difficult for nearly all companies because people are first concerned about not stepping on anyone else’s toes. And it can be very unhelpful if fewer people are bold enough to be involved. The key is to get everyone on the same page regarding the business objectives. This means sharing the plan, how things are looking in the near future, what kind of services are planned for wider rollout, and even the company’s gross margin. “It is transparency that spurs on shared understanding,” Hartmann remarks. When everyone sees the bigger picture, they can feel the “real pain” of overspending and how their work directly impacts it. This shared understanding empowers team members to contribute to cost control strategies.

Building on the strategies discussed, a Kubernetes governance platform can serve as an initial step to gain clarity into resource utilization and enable you to drill down into the various layers of your environment. It can also provide policy-based control for cloud-native environments, empowering teams to make informed financial decisions regarding Kubernetes by allowing them to grasp and adopt cost-control strategies.

Author: Saqib Jan



BIO: Saqib Jan is a freelance analyst with experience in application development, cloud technologies, and consulting.