Automation is the future of cloud cost optimization

Posted on September 29, 2021 by Laurent Gil

CNCF projects highlighted in this post

Guest post by Laurent Gil, Co-Founder, Chief Product Officer at CAST.AI

Managing cloud costs is a balancing act. Scaling cloud resources is so easy that it often leads companies to lose control over their cloud spend. And while engineers get used to overprovisioning as the price to pay for the cloud’s scalability, they let even 30% of resources go to waste.

Is there a way out? Fortunately, there is – automated cloud cost optimization. Keep on reading to learn how automation is already helping companies to cut their cloud bills

How to control cloud costs? 4 approaches

To gain back control, companies are now applying cost management tactics that we can roughly divide into three parts:

Cost visibility – Knowing where the costs come from thanks to a plethora of cost allocation, monitoring, and reporting tools. Real-time cost monitoring is particularly valuable since it notifies you when your cloud starts spiraling out of control. One of Adobe’s teams once generated an unplanned cloud bill of over 500k because of a computing job left running on Azure. One alert could have prevented this.

Cost forecasting – Once you have enough historical data to crunch and an idea of your future requirements, you can estimate how many cloud resources you’ll need and plan your budget. Sounds easy, right? Trust me, it’s a bit of guesswork – even for cloud-native tech giants. During one holiday season, Pinterest’s bill went way beyond the initial estimates because people used the platform so much. By then, Pinterest had already committed to paying $170 million to AWS – and then had to get extra capacity at a much higher pricing tier.

Legacy cost optimization – This is when you bring all your findings from previous steps together to create a detailed snapshot of your cloud spend and identify potential candidates for optimization. There are plenty of tools out there that help with this like Cloudability or CloudHeath by VMware, but most of the time, they only give static recommendations for engineers to implement.

Cloud native cost optimization – Optimizing cloud costs is often a point-in-time activity that requires a lot of time and expertise to balance cost vs. performance just right. Tools like CAST AI have the capability to react to changes in resource demands or provider pricing immediately, opening the doors to greater savings.

Should we let engineers keep on doing all the work manually? To answer this question, let’s zoom in on the optimization process.

Manual vs. automated approach to optimization

The typical manual cost optimization sequence looks something like this:

Take a snapshot of your cloud infrastructure costs at a specific point in time
Understand where costs come from by allocating them to teams/departments
Analyze usage and growth patterns to decide which of the costs are here to stay
Explore your infrastructure in detail to check whether any costs can be eliminated (think resources for abandoned shadow IT projects or unused instances left running)
Examine the instances used by your teams and check for overprovisioning
Once you come up with a plan, reach out to engineering and have them approve it
At the same time, try to convince engineers that cloud costs are just as important as performance
Implement the infrastructure changes to cut your costs
Analyze your future requirements and plan how you’re going to get that extra capacity
Make some reservations or negotiate volume discounts with the cloud provider
Establish governance to make sure that teams use those discounted resources to the fullest
And then hope that you don’t get cloud bill shock at the end of the month!

Allocating, understanding, analyzing, and forecasting cloud costs takes time. And your job isn’t done yet because then you need to apply infrastructure changes, research pricing plans, spin up new instances, and do many other things to build a cost-efficient infrastructure.

Automation can solve this for you. Here’s how many tasks it takes off your plate:

Legacy consulting services can solve only a part of this equation. There are some great tools on the market that cover points 1-4 for you.

But only cloud native cost optimization can solve it all. With a good tool, you always have full control over what happens but don’t have to do anything proactively to save costs.

An automated solution:

Selects the most cost-efficient instance types and sizes to match the requirements of your applications,
Autoscales your cloud resources to handle spikes in demand,
Removes resources that aren’t being used,
Takes advantage of Spot Instances and handles potential interruptions gracefully.
And does so much more to help you avoid expenses in other areas – it automates storage and backups, security and compliance management, and changes to configurations and settings.
And most importantly – it applies all of these changes in real time, mastering the point-in-time nature of cloud optimization.

Here’s an example of automated optimization

We were running our application on a mix of AWS On-Demand instances and Spot Instances. We used CAST AI to analyze our setup and look for the most cost-effective Spot Instance alternatives. We needed a machine with 8 CPUs and 16 GB.

The platform decided to run our workload on an instance called INF1, which has a powerful ML-specialized GPU. It’s a supercomputer that is usually quite expensive.

Why did CAST AI pick this instance?

We checked the pricing and it turned out that at that time, INF1 just happened to be cheaper than the usual general-purpose compute we used. We would have never guessed to look for Spot Instances in this category and missed out on this gem.

If you’re still doubting whether you really need automation to make your cloud budget work, here are a few more reasons why it’s time to say goodbye to manual configuration.

4 reasons why manual cost optimization just doesn’t work in the cloud

1. Cloud billing is just too complicated for humans

Start with the cloud bill and you’re bound to get lost

Take a look at a bill from a cloud vendor and we guarantee that it will be long, complicated, and hard to understand. Each service has a defined billing metric, so understanding your usage to the point where you can make confident predictions about it is just overwhelming.

Now try billing for multiple teams or clouds

If several teams or departments contribute to one bill, you need to know who is using which resources to make them accountable for these costs. And cost allocation is no small feat, especially for dynamic Kubernetes infrastructures.

Now imagine doing it all manually for more than one cloud service. A report released this August showed that 76% of companies already work in multi cloud environments, so this is a valid problem that will only grow in the future.

2. Forecasting often relies on guesswork

To estimate your future resource demands, you need to do a few things:

Start by gaining visibility – analyze your usage reports to learn about patterns in spend,
Identify peak resource usage scenarios – you can do that using periodic analytics and run reports over your usage data,
Consider other sources of data like seasonal customer demand patterns. Do they correlate with your peak resource usage? If so, you might have a chance of identifying them in advance,
Monitor resource usage reports regularly and set up alerts,
Measure application- or workload-specific costs to develop an application-level cost plan,
Calculate the total cost of ownership of your cloud infrastructure,
Analyze the pricing models of cloud providers and accurately plan capacity requirements over time,
Aggregate all of this data in one location to understand your costs better.

Many of the tasks we listed above aren’t one-off jobs, but activities you need to engage in on a regular basis. Imagine how much time they take when carried out manually.

3. Picking the right instance types and sizes is a nightmare

AWS has almost 400 different instances. Good luck trying to analyze them all manually.

To choose the best instance for the job, you usually need to do the following:

Define your minimum requirements across all compute dimensions including CPU (architecture, count, and the choice of processor), Memory, SSD, and network connectivity.
Select the right instance type from various combinations of CPU, memory, storage, and networking capacities.
Choose the size of your instance to make sure that it can scale your resources to match your workload’s requirements.
Once you know which instances you’d like to get, consider different pricing models – for AWS, you’ll be looking at On-Demand, Reserved Instances, Savings Plans, Spot Instances, and Dedicated Hosts. Each comes with its pros and cons – and your choice here will have a massive impact on your cloud bill (which is mostly compute).

4. Using Spot Instances manually is risky

Buying unused capacity from AWS and other major cloud providers is a smart move. Spot Instances are up to 90% off the On-Demand price. But there’s a catch – the vendor might reclaim these resources at any time. Your application needs to be prepared for that.

Managing spot instances manually may look like this:

Check whether your workload is ready for a Spot Instance. Can it handle interruptions? How long does the job take to finish? Is the workload mission-critical? Answering these and other questions helps to qualify a workload for Spot Instances.
Next, take a look at the cloud provider’s offer. It’s smart to consider less popular instances because they usually have a lower chance of interruptions and can run stable for a longer time.
Before deciding on an instance, check its frequency of interruption.
Now it’s time to bid. Set the maximum price you’re willing to pay for that Spot Instance – it will run only as long as the marketplace price matches your bid (or is lower). The rule of thumb here is setting the maximum price at the level of On-Demand pricing, so you need to check that as well. If you set a custom amount and the price of that Spot Instance rises, your application will get interrupted.
Manage Spot Instances in groups to request multiple instance types at the same time and boost your chances at snatching a Spot Instance.

But to make it all work, prepare for a lot of manual configuration, setup, and maintenance tasks.

Automated cost optimization – case study

La Fourche is an e-commerce startup that chose to develop its infrastructure on Kubernetes and Amazon EKS to have the scalability in place for sudden surges in traffic.

In the beginning, the company’s cloud bill hovered around $1k but soon started growing. By March 2021, it was more than $10k – with $5,708.47 of compute costs.

“When I joined the company, the AWS bill was less than $1k. But seeing it gradually rise made me realize that we need to get interested in cost optimization. We were overprovisioning our Kubernetes nodes and fixing that would be difficult. If we managed to save up on the cloud, I could hire an additional developer and grow our team,” said Martin Le Guillou, CTO at La Fourche.

Le Guillou’s search led him to the CAST AI Savings report – he created an account and ran the read-only CAST AI agent in the company’s EKS infrastructure. The Savings report showed that moving to different virtual machines would result in lower costs and higher performance.

La Fourche was using 15 t3.2xlarge and 2 t3.xlarge instances. At the time of running the analysis, they generated a cost of $4,349.95. CAST AI suggested moving workloads to 5 c5a.2xlarge instances to cut the bill by 69.9%, ($1,310.40).

This shows automated cost optimization in action. To generate this recommendation, CAST AI carried out extensive testing of over 100 instance types across cloud service providers to identify the best performance vs. price combinations.

Just imagine how much time this task would take for a human to complete. Not to mention the fact that CAST AI implements its recommendations automatically, so there’s no added optimization workload.

Conclusion: Automation is becoming the new normal

Manually scaling, deploying, and configuring cloud services may lead to mistakes that compromise your availability or performance.

By reducing the amount of human work required to manage cloud-based activities, you’re bound to accelerate your processes thanks to less time spent on diagnosing and debugging.

Automated optimization brings results immediately and guarantees a certain level of savings. At the same time, it’s not a black box, and engineers have full control over its workings. There’s no reason why they should waste their precious time on low-level tasks when they could be developing new products and driving innovation.

Mumbai, India