Service Level Objectives (SLOs) are powerful decision-making tools way beyond the team coalface while providing value there. SLOs as Code in Reliably - the reliability automation platform for developers; provide executable, versionable artifacts that help you capture, frame, collaborate, and enable essential reliability conversations at any point in a system’s evolution.
I have a confession; I love Service Level Objectives (SLOs). In my experience, SLOs have risen to be one of the most important parts of Site Reliability Engineering (SRE) adoption. Time and again, I’ve seen huge value in having SLOs even if you are not planning to apply all the aspects of SRE.
SLOs tell us what we care about and what good looks like for a system’s users. For this reason, SLOs can be incredible decision-making tools way beyond the team coalface (while providing value at the coalface as well!). While Service Level Indicators (SLIs) tell you what can be measured; SLOs tell you what matters (primarily - what matters to the system’s users).
This is why SLOs are the first concept that has been defined in code as part of Reliably, the new reliability toolkit for developers. In this article, I’m going to talk about why “SLOs as Code” is such an important step on our journey towards “Reliability as Code” (#reliabilityascode).
Firstly, SLOs are great conversation starters. Even before one line of code has been written, it’s possible to talk about how facets of the future system should behave to deliver the right reliability experience to the system’s future users.
Many systems die in early implementation because reliability is an afterthought. Still, by bringing the SLO conversation early to the forefront, everyone gets an opportunity to collaborate. Even more importantly, SLOs help in understanding what the users will care about and how reliable the system needs to be.
It doesn't mean that SLOs only enable valuable conversations for new, greenfield systems. SLOs can encourage the same conversations for pretty much any system, whether it be a greenfield or a slightly muddy “heritage” system (I prefer “heritage” to legacy, as for some reason legacy systems are something we look down on sometimes).
SLOs can encourage everyone involved to ask, “What do we care about?”, “What’s the right level of reliability we need?”, “What does reliable look like to our users?” or even, “How do we balance cost and reliability?”.
Regardless of the time these SLO conversations happen, they can add huge value by bringing reliability to the top table in the architecture and design process.
Reliably’s SLO code artifact captures, frames, and supports these conversations. Using the SLOs artifact, you can develop and evolve your SLOs, even before you have any means of measuring those SLOs for real with Service Level Indicators (SLIs):
services:
- name: website
service-levels:
- name: 95th of requests response time under 100ms
type: latency
criteria:
threshold: 100ms
sli: []
slo: 95
sli: []
window: PT1H
- name: 99th of requests response time under 500ms
type: latency
criteria:
threshold: 500ms
slo: 99
sli: []
window: PT1H
- name: 99th of requests responses not 5xx
type: availability
slo: 99
sli: []
window: PT1H
In the above code snippet, we’ve described three SLOs for simple website service.
NOTE: You can create your own SLO definitions using the Reliably SLO init command. More information is available in the Reliably docs.
SLOs are frequently defined and captured in monitoring and observability tools on the market. There’s nothing wrong with this. It just often means that the SLOs are not as visible to all the different collaborators involved as they could be, especially across an organization where there may be different monitoring and observability systems in play.
It’s also common for SLOs to be subjected to a lifecycle that includes versioning, releasing while open for collaboration. Sound familiar? It does! This is the exact set of requirements we have for working with code generally, and so this is another reason why Reliably has codified SLOs as code artifacts that can be created, managed, versioned, and collaborated on using the same (or similar) processes you use for working with other system-critical artifacts.
Over time you can enrich your SLOs with Service Level Indicators (SLIs), as shown in the snippet:
services:
- name: website
service-levels:
- name: 95th of requests response time under 100ms
type: latency
criteria:
threshold: 100ms
slo: 95
sli:
- id: myprojectid/google-cloud-load-balancers/myloadbalancer-name
provider: gcp
window: PT24H
- name: 99th of requests response time under 500ms
type: latency
criteria:
threshold: 500ms
slo: 99
sli:
- id: myprojectid/google-cloud-load-balancers/myloadbalancer-name
provider: gcp
window: PT24H
- name: 99th of requests responses not 5xx
type: availability
slo: 99
sli:
- id: myprojectid/google-cloud-load-balancers/myloadbalancer-name
provider: gcp
window: P7D
SLIs are measurements that, collected over a given window, give you “good” and “bad” events that roll up into the overall calculation of whether the SLO is still being met, is trending dangerously close to not being completed, or has been broken completely.
SLOs, coded using Reliably and eventually including some SLIs, can be reported against at any time and by anyone with the permissions, using the SLO report command:
$ reliably slo report
You can even watch your SLOs with live updates using the --watch switch:
$ reliably slo report --watch
There’s much more to dig into with the reliably SLO report command, check out the docs for more.
In this article, I’ve shared why SLOs are a powerful concept in SRE and beyond. SLOs provide a crucial conversation enabler regarding what matters in terms of reliability in a given system. This is why they are the first concept captured in code using Reliably as part of our #ReliabilityAsCode mission.
Also published here.