Guest post originally published on Elastisys’s blog by Lars Larsson, Senior Cloud Architect and Branch Manager at Elastisys

All companies that use cloud services do so for a reason. But those reasons may change. Whether motivated by the need for a multi-cloud strategy, expenditure minimization, legislative or regulatory demands, or simply to get closer to end users, many organizations find themselves migrating from one cloud to another. Cloud-to-cloud migration for a non-trivial application contains a lot of unknown unknowns. This causes stress and uncertainty for a CTO. To help shed some light based on years of experience in the field, we have asked senior cloud architect Lars Larsson at Elastisys, to list some of these issues.

Executive summary

1. Pricing models vary wildly

Most companies have been hit by that one cloud bill that surprised them. Some services or features wound up costing far more than originally anticipated. 

Are you considering a cloud-to-cloud migration to reduce operational costs? If so, please take a long and hard look at the various costs in each of the clouds. Because each service has a different cost model, making it very difficult to compare accurately. You really need to get to the bottom of it.

Compute and storage costs are often easy enough to compare, because those are the most obvious two. But what about other services? How much are you spending on, e.g., log handling or monitoring? Good services such as AWS CloudWatch come at a cost, especially if you use it for log handling too (AWS CloudWatch Logs). Are you heavily using a managed database service? A queueing or pub/sub service?

Avoid feeling like that one time the surprise cloud bill hit you in the face by doing your homework this time. You are wiser from the experience, after all.

Network transfers in and out of the cloud can also vary by quite a large amount. The three major providers will give you incoming network traffic for free, but charge you for the outgoing traffic. Smaller, regional, cloud providers will often have higher compute and storage costs, but not charge you for network traffic. Or include a much larger amount of it in a free tier offering.

Overwhelming? I get it. It sure looks that way at first glance! But it doesn’t have to be. My trick is to look at your past few detailed billing statements, and map those costs to your new cloud provider options.

2. Cloud vendor-specific integrations

Tell me, did you adopt Kubernetes to make yourself less dependent on cloud providers? Reduce vendor lock-in? Are you now surprised at finding out that you are still locked in, but on a different level?

What I’ve seen is that most organizations will make sure they have highly portable application definitions. By relying on Kubernetes, the application definitions work across cloud providers. 

But what I’ve also seen is that if you are using a managed Kubernetes service, your users and permissions handling is perhaps tied not to Kubernetes role-based access control (RBAC) features, but rather, to cloud-specific offerings. Like AWS Identity and Access Management (IAM). Great service, but ties you to the AWS platform.

Fully-managed Kubernetes services, if offered by cloud vendors themselves, serve to offer a highly integrated experience. The cost of that integration is that migration to another cloud provider becomes just that certain amount more difficult.

As a community, we’ve tried to fix this. But as a community of engineers, those fixes are technical. Kubernetes dictates standards for certain components or aspects. Networking has to work according to the Container Network Interface (CNI) standard. Storage according to Container Storage Interface (CSI). And so on. Great. But the business people have put in much more clever ways of locking you to the platform. This means that other less obvious aspects are more difficult to freely migrate from one cloud to another.

So what is the option? Do you have to manage Kubernetes yourself? No. Of course not. But it may make sense to investigate managed Kubernetes offerings that are not tied to a particular cloud provider to reduce the risk of vendor lock-in. Without having to take on the task of day-to-day operations on your own, of course.

3. The devil is in the (technical) details

Almost all organizations I’ve talked to say the same thing. They went to the cloud because they wanted to get infrastructure or platform functionality on an as-a-Service basis. That is the whole point of the cloud, is it not?

So, of course, all cloud providers will offer certain services. It used to be just infrastructure as a service (virtual machines, network, and storage) but we are starting to also take, e.g., object storage and queuing services for granted. Many such services will claim to offer an “S3 compatible API” or similar. And that is a great starting point! But beware, because what does such a compatibility claim really mean? 

AWS S3, for example, was, until December 2020, merely eventually consistent. Since their announcement, it is read-after-write consistent (strong consistency). Which level of consistency would a service that is “S3 compatible” have? The old one? Or the new one? And do you have aspects of your applications that depend on that answer being one or the other? Would you know off-hand?

If you don’t, by the way, you’re in excellent company. Most people’s eyes gloss over when consistency guarantees are discussed. But then again, you don’t want to get bitten by a bug caused by wrongful assumptions, so somebody has to stay awake to figure this stuff out. Hopefully not past midnight!

Ready for another unexciting example? Queuing services, such as AWS SQS, offer certain delivery guarantees. SQS standard queues offer “at least once” guarantees. That means that a message can be delivered to your application more than once, and must have logic in place for dealing with duplicates. An application that is not prepared for this will start showing strange behavior. Especially under heavy load, because that is when the risk for multiple deliveries is higher. This is because high load means there is less time to do housekeeping for the queuing service. (Note that while SQS offers FIFO queues that have “exactly once” processing guarantees, however, that does not imply exactly once delivery.) So confusing for an application that was coded with assumptions of RabbitMQ’s “at most once” delivery guarantees!

The point I am making here is that you must make an inventory of all the cloud services you use, and what features in these are key to your applications working the way they are intended. Because your use case is the one that matters.

Cloud providers offer services that make certain trade-offs to ensure the scalability and availability of their service. From the perspective of individual customers and applications, the ideal trade-off choice might have been different. If you use a software such as RabbitMQ, you can configure it perfectly for your use case and requirements. Not the ones of the cloud providers. There are companies that offer managed services in a cloud-agnostic way. Especially on top of Kubernetes, which deserve your consideration.

4. Tools supporting your processes (may) change

Industry wisdom and rule of thumb says that about 70% of software costs are in maintenance. Not development. Just keeping the thing running as intended. How do you address that? My take on it is to rely on smart tools and automation as much as possible. The less your staff has to work on rote menial tasks, the better. Everything they do is a process. So let’s talk about those processes.

Operating your mission-critical cloud application? A bunch of processes. And most, if not all, of these are supported by tools. Continuous integration and deployment, monitoring, notifications and alerting. These are the tools your operations staff is using to analyze and optimize your application deployment.

Great tools like AWS CloudWatch offer insight into monitoring, logging (CloudWatch Logs), and containerized workloads (CloudWatch Container Insights). Your team probably depends on them. But they are specific to a particular cloud vendor.

If you did like many of the organizations I’ve talked to over the years, you may have taken the deep dive into them. And now your processes are dependent on cloud provider-specific services. Feel familiar? If so, you may have a more difficult time migrating all those processes over to new tooling.

So then, yet another deep inventory of all services you use is required. This time, for the role they play in supporting your operational processes. And once you have that list, you have to decide where to go from there.

My advice is to choose a future-proof solution to the problem. That is to use tools that work across all cloud vendors. Store application logs not with the cloud vendor, but in a hosted Elasticsearch environment. Monitor your application with Prometheus and Grafana, rather than what the cloud vendor offers. The cloud native landscape has tons of observability tools that help you approach this problem in a way that will work just as well at a mega cloud, small regional one, or even in your own data center in the company basement. Since the addressed problem is the same, why wouldn’t the tooling be, as well?

5. Beware the “hidden” costs of the migration

If you took my caution seriously about costs varying wildly, you may feel like you’re in shape to deal with the total cost of ownership question. Namely, what will all these inventories cost to establish? What will it cost to investigate exactly which features your applications need?

Like the answer to any sufficiently good question, the answer begins with “it depends”. In this case, mainly on the size of your development team and complexity of your application. Only you can answer those.

But don’t forget that there are technical sources of costs to the migration as well. One-time costs such as data migration: how much data will have to move from one cloud to the other? Are we talking giga, terra, or petabytes?

If you need to center your operations around a new set of tools to support your processes, will there be training involved? Custom software development? Creation of new data exporters or dashboards in your brand new Prometheus and Grafana-based observability solution?

Summary

Cloud-to-cloud migration can be a significant undertaking. And it can be a bit scary, too, because it is full of unknown unknowns. My advice is to face that slightly scary situation head first. Establish an inventory. Get a good and clear understanding of what services and features your application needs. And which support your organizational processes around operating and maintaining the application.

Armed with such an inventory, you can start making good decisions. Well-informed ones. It always helps to walk into this type of strategic move with your eyes open, rather than closed.

And the most important advice of all? To the extent your new cloud home will require change, consider moving toward tools that are not tied to a particular cloud provider. That way, you have less to worry about if you need to migrate cloud-to-cloud yet again at a later time. You have a wealth of cloud native tooling to support you. With a workflow based on such tools, you can have a modern cloud deployment in any hosting setting. You’ve got this.

Read more by Lars by following him on LinkedIn or by heading to the Elastisys blog, covering both high-level cloud topics and engineering HOWTO.

Elastisys is a CNCF Silver member, and creators of the open source CNCF certified Kubernetes distribution Compliant Kubernetes.