There are a handful of providers that large parts of the internet rely on, such as Google, AWS, Fastly, and Cloudflare. While these providers can boast five or even six nines of availability, they're not perfect, and - like everyone - they occasionally experience downtime.
For customers to get value from your product or service, it has to be available. That means that all the systems required to deliver the service are working, including:
There's a great paper called The Calculus of Service Availability. It tries to apply maths to how people interpret availability. It points out that in order for a system to provide a certain availability, any third parties that it depends on need to have an order of magnitude higher than its own availability (e.g. for a system to provide 99.99%, its dependencies need to have ~99.999%).
In practice, this means that there are some services that need significantly higher availability than others.
As a consumer-grade service provider (e.g. an e-commerce site), a 99.99% availability is likely to be sufficient. Above this, the consumers’ dependencies (of which you have no control) such as their internet connection or device are collectively less reliable. This means that investment to significantly improve availability beyond this point isn't particularly valuable.
By contrast, if you're a cloud provider, your customers are relying on you having a significantly higher availability guarantee so that they can serve their customers while building on top of your platform.
In general, most consumer systems can afford a small amount of unexpected downtime without world-ending consequences: in fact, most customers won't notice, as their connection and devices are likely less reliable. Given that achieving more reliability is extremely expensive, it's important you know when to stop, as the time you save can be invested in delivering product features that your customers will genuinely value.
Multi-cloud is a great example. Multi-cloud is a shorthand for building a platform that runs on both multiple cloud providers (e.g. AWS, GCP, Azure, etc.). This is the only way to be resilient to a full cloud provider outage - you need a whole second cloud provider that you can lean on instead.
This is an incredibly expensive thing to do. It increases the complexity of your system, meaning that engineers have to understand multiple platforms whenever they're thinking about infrastructure. You become limited to just the feature set that is shared by both cloud providers, meaning that you end up missing out on the full benefits of both.
You've also introduced a new component - whatever handles the routing/load balancing between the two cloud providers. To improve your availability using multi-cloud, this new component has to have significantly higher availability than the underlying cloud providers: otherwise, you're simply replacing one problem with another.
Unless you have very specific needs, you'll do better purchasing high-availability products from the best-in-class providers than building your own.
If you're interested in reading more, there's a great write-up from Corey Quinn on the trade-offs on multi-cloud.
Being on the receiving end of a big provider outage is stressful: you can be completely down with very limited recovery options apart from 'wait until the provider fixes it’.
In addition, it's likely that some of your tools are also down as they share dependencies on the third party. When Cloudflare goes down, it takes a large percentage of the internet with it. AWS is the same. That can increase panic and further complicate your response.
So how should we think about these kinds of incidents, and how do we manage them well?
Your site is down. Instead of desperately trying to fix things to bring your site back up, you are ... waiting. What should you be doing?
As we discussed above, availability is something that cloud providers are really very good at. The easiest thing you can usually do to improve availability is to use the products that cloud providers build for exactly this reason.
Most cloud providers offer multi-zone or multi-region features which you can opt into (for a price) and vastly decrease the likelihood of these outages.
As with all incidents, it's important to understand the impact of the outage on your customers. Take the time to figure out what is and isn't working - perhaps it's not a full outage, but a service degradation. Or there are some parts of your product that aren't impacted.
If you can, find a way to tell your customers what's going on. Ideally via your usual channels, but if those are down then find another way: social media or even old-fashioned emails.
Translate the impact into something your customers can easily understand. What can they do? what can't they do? Where can they follow along? (maybe the third party's status page) to find out more.
Can you change anything about your infrastructure to bypass the broken component? Provide a temporary gateway for someone to access a particular critical service? Ask someone to email you a CSV file that you can manually process?
This is your chance to think outside the box: it's likely to be for a short time period so you can do things that won't scale.
What's going to happen when the third-party outage ends? Is it business as usual? Have you got a backlog of async work that you need to get through, which might need to be rate-limited? Are you going to have data inconsistencies that need to be reconciled?
Ideally, you'd have some tried and tested methods for disaster recovery that the team is already familiar with and are frequently rehearsed.
In absence of that, try to forecast as much as you can, and take steps to mitigate the impact of these. Maybe scale-up queues ready for the thundering herd, or apply some more aggressive rate limiting. Keep communicating, giving your customers all the information they need to make good decisions.
After the incident is over, what can we learn?
Writing a debrief document after a third-party outage doesn't feel good:
What happened? Cloudflare went down
What was the impact? No-one could visit our website
What did we learn? It's bad when Cloudflare goes down 🤷♀️
Incidents that you can control often feel better than third party incidents where you can't control the outcome. After the incident, you can write a post-mortem, learn from it, and get a warm fuzzy feeling that you've improved your product along the way.
However, in the cold light of day, the numbers are unlikely to support this theory. Unless you have the best SRE team in the world, you aren't going to ship infrastructure products with better availability than a cloud provider.
Instead, we should again focus on the things that are within our control.
Learn more about post-mortems and incident debriefs in our Incident Management Guide.
It's pretty stressful to be trying to figure out what is impacted by a third-party outage in the middle of an incident. To avoid that, you need to understand the various dependency chains in advance.
This is tricky to do as a pen-and-paper exercise: often the most reliable way is to spin up a second environment (that customers aren't using) and start turning bits of the system off.
Once you've got an understanding of the dependencies, when an incident does happen, you'll be able to focus your attention on the relevant parts of your system.
As part of this, you can also run Game days to help train responders in disaster recovery. These are the exercises that can produce disaster recovery plans (and familiarity) which can be so valuable when bringing your systems back online.
Sometimes, often for historic reasons, you'll end up relying on multiple third parties where really, one would do the job. Whenever you add a dependency, you significantly reduce your availability. If you can consolidate on fewer appropriately reliable dependencies, it will significantly improve your overall availability.
We can also consider blast radius here: are there ways to make some of your products work while a certain provider is down. This doesn't mean using another provider necessarily, but perhaps you could boot service [x] even if service [y] is unavailable.
Reducing the number of components is likely to reduce your exposure to these kinds of outages.
Your availability is always, at best, the availability of all your critical providers, combined. Be honest with yourselves and your customers about what a realistic availability target is within those constraints, and make sure your contracts reflect that.
Also published here.