“Multi-cloud is the worst practice” – is it, really?
Earlier in August, Last Week In AWS published an interesting and insightful article written by Corey Quinn, the Chief Cloud Economist at The Duckbill Group. The Duckbill Group specializes in reducing the AWS bills for their customers, a much-needed service in today’s cloud environment.
In this article, we take a look at the arguments Corey made in his post and dive into the details.
Before we begin, let’s just say that it’s great to be able to have this type of discussion. A year or two ago, we would wholeheartedly agree with Corey. But as times change and technology evolves, some problems that previously seemed insurmountable are longer an impediment to progress.
We agree that true multi-cloud means that the same workload is running across two or more clouds, in an active-active or active-passive (DR) configuration. Corey is spot on here, we’re not talking about various SaaS and IaaS offerings for disparate applications and use-cases and then calling that type of usage multi-cloud. If that was our definition, most organizations would already be there today.
We also agree that multi-cloud is really hard to achieve, or at least it was before we built CAST AI. Corey outlines some of the toughest problems in multi-cloud and makes a great case for why the world needs a tool like ours.
At CAST AI, we start with cloud primitives. These are Computers, Disks, and Networks, including some networking primitives such as load balancers, VPN services, and Direct/Express Connects (dedicated network interconnects). These are the fundamental components we need to build reliable clusters of computers, especially if we chose Kubernetes for orchestration, scaling and management of containerized applications.
Corey is right, if you haven’t containerized your application(s), going multi-cloud is going to be very hard. As an organization, you will need to carry out a re-architecture and modernization of your apps before taking advantage of multiple cloud providers. As you move to containers and Kubernetes, the solution becomes much more attainable.
Of course, you should have access to and use cloud-native services that your organization has come to rely on. If you’re using RDS, BigQuery, Azure Machine Learning, Elastic Cache, or any number of services provided by AWS, Google, or Azure, it would be very hard to cut those dependencies. We believe that you should have access to ALL of them in a secure way, over a private IP network. These services are definitely not closed to you. We call this a Multi-Cloud Architecture for Services.
However, this is the beginning of the multi-cloud journey where you can take some of your components and distribute them across clouds. Just as all technologies evolve, it’s going to take time for the market to develop multi-cloud solutions. Vendors like Snowflake and Yugabyte already offer a choice (A or B) of clouds for deployment, and others will follow. If one doesn’t take the first step, it’s impossible to ever complete a journey.
For completeness, all clouds have good Load Balancers these days, so there’s no need to run your own Nginx instances.
At CAST AI we orchestrate these things for you, and LBs become part of the Kubernetes ingress story.
We agree that you shouldn’t be spending your time on these low-level primitives and should be focusing on your code instead.
In his article, Corey says, “I have some bad news for you: You’re already locked in.” We agree, you probably are today! I do a lot of martial arts (Brazilian Jiu-Jitsu) and if someone puts me in a really bad position (I don’t like getting choked out), I try to get out of that bad spot, not just accept the inevitable and tap.
We need to pause here and talk about open source strategy for a moment. If your application stack is using proprietary protocols (DynamoDB, Spanner, Redshift to name a few examples) that don’t have an open-source equivalent, it’s going to be harder to move away from a single cloud for that component.
However, if you’re using RDS MySQL or RDS PostgreSQL, things become much easier.
New-generation distributed databases such as Yugabyte DB or CockroachDB all speak PostgreSQL, for example.
Take a look at your open source strategy, are there parts of your stack that are completely tied to a provider with no open-source equivalent? If that’s the case, it’s not the end of the world. You can still take advantage of multi-cloud for the parts that are portable. You can still use cloud services as well.
Here Corey makes a compelling argument that your team knows one cloud (like AWS) and prefers to stick to what it knows well. He actually wrote another article that discusses this topic in detail (The Lock-In You Don’t See).
We agree here that adding a second cloud in a traditional sense would require an entirely new team of cloud ninjas who speak and understand the new set of cloud APIs, Terraform, and other nuisances. And people who have spent a good chunk of their career on understanding AWS aren’t going to be super excited about learning a completely new cloud in a crunch.
However, the buy-in we are seeing in the industry is what is driving the Cloud Native Computing Foundation movement. People want to learn how
to deploy their apps on Kubernetes and other cloud-native projects. It’s not about cloud-specific Terraform, CLI, commands, or API specs.
Just like very few people need to code in assembler these days, there will come a time when software development engineers no longer need to know about low-level cloud APIs to provision compute, network, storage, and other resources.
As technology evolves, so do levels of abstraction… This is progress.
If you offer your team the opportunity to learn Kubernetes and other CNCF projects, they’re not going to quit. They’re going to thank you and take those courses. And… they will have a higher market value as a result.
Here is Corey’s core argument with respect to sticking to one cloud provider:
“Every cloud provider of substance (and also Google Cloud, zing) negotiate discounting percentages based upon percentage of spend. Cutting your spend in half reduces your negotiating base.”
He goes on to advise:
“Every time we’ve seen this happen with our clients, the discounting achieved from that threat is less than the discount that the customer would get simply by committing to higher spend levels.”
Committing to multi-year deals with reserved instances or savings plans sounds a lot like dealing with telcos and data centers of the past. We live in the age of just-in-time cloud computing, why are we negotiating deals with a used-car salesperson?
Our position at CAST AI is that reserved instances and savings plans are evil. They lock our customers for years with a particular cloud, and in cloud-years that feels like decades. Technology develops so quickly that we can’t predict what will happen in 6 months, never mind what will happen over a 3-year term.
Let’s take AWS as an example. With reserved pricing, if you pay upfront on a standard 3-year term, you can save around 60% depending on your instance family choice. AWS claims 72%, but we see that most options are far lower in terms of savings. If you spend $4,000,000 to $10,000,000 you can get an extra 10% discount! That is a lot of commitment at $4 million.
Let’s now consider another approach using Spot or Preemptible instances. What if you had a platform that could automatically buy the spot instances you need, and move your applications seamlessly in and out of those instance types? What could you save? Using the AWS Spot Instance Advisor, we see savings between 70% to 85% (US East) and even more in some cases. That’s all without paying $1 in advance and no commitment!
Our argument assumes that you have a platform that can move containers and workloads between regular and spot instances and that your application is tolerant to instance interruption. AWS gives you a 2-minute notice. GCP and Azure provide only less than 1-minute notice.
Kubernetes is a great orchestration tool to move pods/containers between instances (nodes) and our optimization engine catches preemption notifications and automatically reschedules work to nodes that are available. There is a lot more technical detail under the hood here, and not all workloads are going to be super tolerant to preemption.
The nice thing is, you can deploy this optimization strategy on a single cloud, without even considering other cloud providers.
You can get even more benefits from preemptive/spot instances when you leverage multiple clouds. What if AWS is experiencing super high demand in your region and spot instances aren’t available? You have at least 3 options to move forward:
You can just buy an on-demand instance for a short time while market conditions normalize, that may increase some instance costs for a few hours, You can buy a different instance type that will meet your requirements, Kubernetes will automatically reschedule your workloads based on the new cluster topology, or You can buy a preemptive/spot instance on an adjacent cloud for a short-term window or for a longer-term.
Without an automation platform to do the work, the choices are difficult to consider and reason about. Luckily, CAST AI does the work for you here and allows setting policies to govern optimization and overall cluster behavior.
In short, we would much rather see customers use the open market to leverage just-in-time best pricing and not see them locked into multi-year painful contracts where they need to fork out truckloads of cash to obtain some savings.
As Corey suggests, we don’t want you to have an adversarial relationship with your cloud providers. But we prefer that decisions are made based on the best price, performance, or technology - rather than locked-in, multi-year, pre-negotiated contracts.
We agree with Corey that cloud costs tend to get lower over time. That’s why locking-in pricing over 3 years is potentially very harmful. If you know that things will be cheaper next year, your negotiated discount is going to be less meaningful.
Overall long-term pricing trends aside, you want to protect yourself from market-driven price spikes and take advantage of market-driven price drops.
This is where real-time purchasing of compute across more than one cloud provider can really help.
Having a dynamic, self-adjusting platform that can look for optimizations within one or more cloud providers is going to help with small and large price changes. If a particular instance drops in price and now becomes much more affordable from a price/performance perspective, your application should start benefiting from that new lower price instance automatically. Again, this is not easy to achieve without automation.
I personally love this argument. It’s kind of like saying that “automobiles don’t exist in reality” in the year 1880. Sure, just because the first car was invented in 1885, it doesn’t mean that people weren’t already working towards realizing that vision.
Corey says “… there are no articles or conference talks of companies talking about their successful multi-cloud strategies paying off.”
Well, that’s not entirely true. Even prior to his article published on August 5th, Zoom announced that they’re moving a part of their platform to Oracle (OCI), and today they are running Zoom on both clouds in parallel. In fact, Corey writes about this in his April 28th blog post and praises Zoom for making that choice. In this case, the price/performance of compute and egress costs sealed the deal, so Corey is arguing against his own position here.
Caveat: I used to work for Oracle Cloud, Zoom is an amazing example that is well covered.
We’ve also seen other examples where a prominent US retailer wanted to take advantage of the Oracle Database technology while using Azure for other parts of their stack. In this case, Oracle and Microsoft announced an interconnect between the two clouds to enable this type of multi-cloud collaboration.
In this post, we haven’t touched some obvious discussion points. Egress costs and latency across clouds are two very important topics that we will cover next. Data gravity is another important and closely related topic to cover.
While multi-cloud and cost optimization are in their early stages, we believe that they can potentially bring tremendous growth and savings opportunities for customers that adopt cloud-native and modern architecture approaches. It’s going to take a while for customers to realize the full benefits of these capabilities, and that’s why we need to get started now.
We launched CAST AI to help developers realize the benefits of multi-cloud and cost-optimized Kubernetes. We hope that you follow our journey.
You can try CAST AI now by following this link and let us know what you think - we look forward to hearing from you.
Co-founder and CTO, CAST AI Formerly Vice President of Security Products OCI at Oracle, Leon’s professional experience spans across tech companies such as IBM, Truition, and HostedPCI.
He founded and served as the CTO of Zenedge, an enterprise security company protecting large enterprises with a cloud WAF. Leon has 20+ years of experience in product management, software design, and development, all the way through to production deployment. He is an authority on cloud computing, web application security and Payment Card Industry Data Security Standard (PCI DSS), e-commerce, and web application architecture.