Not Every Metric Needs a Goal—The Truth About Setting Smart SLOs

The Internet is a complicated distributed system consisting of millions of services continuously talking to each other processing billions of requests every second. A majority of these requests succeed, but all of them don’t and still the internet work pretty well. For systems that work with other systems, it’s important to adhere to some agreements on how well these systems would work so that the show must go on.

Since it’s impossible for systems to work correctly 100% of time, our goal is to determine how much of the time the system should work correctly and how to monitor that your system in production does what it is supposed to do. There are few formal terminologies that we will go through to help in understanding how to set goals regarding how well your systems should work.

SLIs: Service Level Indicators

SLIs or service level indicators are a set of indicators that you observe about your service. These indicators indicate the health of your service. These indicators are emitted either from within a service or from systems interacting with the service (clients of the service).

Some of the examples of common traditional service level indicators are availability (what percentage of requests succeed), latency (how long it takes for most requests(99% or 95%) of requests to complete. There could be other indicators for service that help in determining health of service like CPU usage percentage, memory usage percentage, network load.

In the world of AI infra, where your service is responsible for dealing with data and models, indicators like feature freshness and model freshness/correctness also become very important indicators. Based on research outline in this paper “the quality of data which a recommender system is based on may have important impact on recommendation quality.” So in AI world it’s important to think about your SLIs holistically.

SLIs are meant for the service owning team and can help them set service level objectives (SLOs) which we will discuss next. However, it is important to note that all SLIs don’t need to be translated to SLOs. A rough way of thinking about this is that SLIs are like metrics and SLOs are like goals. You don’t need goals for all the metrics and some metrics are more critical than others and thus translate to goals.

SLOs: Service Level Objectives

SLOs are some of the critical goals that you set for your service. All SLIs don’t turn into SLOs. It is generally important to align these SLOs with the business goals of the service and SLOs should translate to some business metrics. One should not be pedantic about maintaining an unnecessarily high bar for your service. Some of the common examples of SLOs are high availability (99%, referred to as two 9s, 99.9% , referred to as three 9s or 99.99, referred to as 4 9s).

When setting an availability SLO for your service you should clearly understand what is the gain that you get by setting your availability SLO from 99% to 99.9%. Setting a SLO can be a tradeoff between operational cost for the service and business loss. It is not practical to debug every failed request for the service. In the case of AI models some other SLAs like model freshness, data freshness may have an equally important significance as availability.

The SLOs that you need for your service are dependent on your specific business needs and there is no golden set of SLOs that you must have for your service. However some traditionally common SLOs are around service availability, latency etc.. If your service is not latency sensitive, it is okay to not have a latency SLO. For example: If you are running an e-commerce website but your product recommendations are based on catalog that stops refreshing, or you are using signals from users which are not being updated, the quality of your recommendations will start declining and your loss from 0.1% drop in availability will probably be much lower than one day delay in data freshness.

SLAs: Service Level Agreements

The only difference between SLOs and SLAs is that SLAs are external commitments that you make for your service based on your SLOs. SLAs can be a bit looser than SLO as there is often a monetary penalty if you miss your SLAs. For example: It is reasonable to have a SLO of having 99.9% availability over a 15 minute window and having a SLA of having 99.9% availability over a 60 minute window. It is also okay to have different tiers for service and some clients may be willing to pay extra for a better SLA. The cost to maintain a better SLA often goes towards extra redundancy in the system and also hiring better operational teams.

These external commitments can be to your partner team or external customers. For example, you can guarantee to your clients that 99.9% of requests over a period of one hour will receive a response and you will actively manage any incidents that happen if your service doesn’t meet this agreement.

All the major cloud providers publish SLAs for their services. Here are SLAs for AWS, Azure and GCP. An example of SLA from AWS is 99.9% uptime monthly for S3 standard tier access. This means that over a period of 720 hours (30 days) only 44 minutes of downtime is accepted. If SLA drops below 95% , which means if S3 is down for more than 36 hours over a period of one month, 100% of the amount will be refunded in credits for that month.

Understanding SLI, SLO and SLA with a pizza example

Imagine you're ordering pizza:

SLI (Service Level Indicator): This is like measuring how long it takes for the pizza to be delivered. For example, "Average pizza delivery time is 30 minutes." It's a specific measurement of the service.
SLO (Service Level Objective): This is your target for the delivery time. For example, "We aim for 90% of pizzas to be delivered within 45 minutes." It's the goal you want to achieve for your service performance.
SLA (Service Level Agreement): This is the promise the pizza place makes to you. For example, "If your pizza isn't delivered within 60 minutes, you get a free pizza next time." It's the agreement with consequences if the objective isn't met.

Recap and tips on how to set the right SLOs for your system

Set SLOs based on your business requirements
1. You should set SLAs that help you meet your business goals.
2. SLAs which are set for being pedantic about high availability, low latency cause a lot of operational workload and must be avoided.

Define realistic metrics based on your current and past state of the system
1. Setting an SLO which your service is not currently meeting is a terrible idea. If you set a SLO which will cause alarms for your service will also lead to a lot of churn.
2. Setting a SLO of 99.9% for availability for 1 hour, when in the last quarter your average availability is 99.91% is a bad idea because any minor issue will cause a SLO alarm.

Look at your external dependencies
1. You can run fault inject experiments to understand what is the impact for various dependencies for your service. An always running fault injection for your system keeps you always prepared and this was popularized by Netflix’s chaos monkey. You can understand impact of your dependencies based on fault injection experiments.
2. If you run an AI service for recommendations, you have a lot of external dependencies like data freshness, model freshness, model availability. You must make sure what each of those contribute to business metrics and make sure you have set SLAs on them. In the AI world, it is important to look past traditional SLAs like availability and latency and look at factors like freshness and its business impact as well.

Revisit your SLOs periodically
1. It’s very important to understand that SLOs are not set in stone and you must revisit them in your quarterly or annual reviews.
2. Each SLO should be reconsidered periodically to see if it still makes sense for your business goals.

Not Every Metric Needs a Goal—The Truth About Setting Smart SLOs

Too Long; Didn't Read

SLIs: Service Level Indicators

SLOs: Service Level Objectives

SLAs: Service Level Agreements

Understanding SLI, SLO and SLA with a pizza example

Recap and tips on how to set the right SLOs for your system

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Categories

Trending Topics

Not Every Metric Needs a Goal—The Truth About Setting Smart SLOs

Too Long; Didn't Read

SLIs: Service Level Indicators

SLOs: Service Level Objectives

SLAs: Service Level Agreements

Understanding SLI, SLO and SLA with a pizza example

Recap and tips on how to set the right SLOs for your system

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Categories

Trending Topics