The popular messaging service Slack recently experienced a global outage that impacted millions of users. The majority of which were individuals working from home or remote learning due to the Coronavirus pandemic. The outage lasted for an extended period, potentially impacting their service level agreements (SLAs), impacts the brand while coming hot on the heels of the announcement of their $27.7 billion acquisition by Salesforce. Outages like the one Slack experienced are increasingly common in a technology-focused society; so how do we avoid costly issues like this?
An outage (also known as downtime) is a period of time when a given service or system is unavailable, failing to provide and perform its primary functionality. Completely removing the possibility of experiencing an outage is almost impossible, however, there are a number of ways in which the odds of an outage can be significantly reduced. Some of the things that can be done to try and avoid an incident are:
Having a high-test coverage is only part of the story when it comes to testing an application's robustness . A high unit test coverage can be deceiving as this simply proves that the software functions as expected, in isolation.
Unit testing should be combined with other forms of testing including integration testing, security testing, and performance testing (which should also cover various subtypes such as load testing and stress testing). Including a wider array of testing types ensures that both the functional and non-functional factors that could impact the overall system are being checked under various conditions that could occur in production.
Horizontal scaling improves the application's stability and robustness. It simply means that each component in the system can increase its capacity by adding more instances.
For example, a simple CRUD application that uses RDS in AWS might horizontally scale by adding additional MySQL RDS instances as read replicas in multiple availability zones (AZs).
Chaos engineering is a form of resiliency testing created by Netflix when they built their Chaos Monkey application. It functions by turning off random components in their architecture to simulate outages for particular services. This aided Netflix in their migration to a cloud-hosted infrastructure by allowing them to identify potential pitfalls. This occurs particularly by finding dependencies that, when removed, would have caused incidents in a production environment. Doing this regularly in a non-production environment allowed these incidents to occur in a controlled environment that does not impact customers, mitigating the potential impact on the business.
There are a number of resiliency improvements that can be made to an application in order to reduce the chances of it causing or intensifying an outage. Some simple improvements can include adding an exponential backoff retry strategy to call to other services, building circuit breaker functionality into the system, and implementing queues between services in order to provide a smoother, more consistent throughput to the service.
Implementing queues also brings with it the advantage of assisting in decoupling the components from each other, meaning that each part of the system is more independent and has fewer dependencies. These strategies can also help to reduce load and enhance the fault tolerance of the application, both of which result in a higher level of resiliency.
Successful deployment of a component is only part of the story. Many deployments appear successful at first before running into issues. Such issues include using of resources such as memory, an unexpected failure that occurs irregularly, or even security issues that are not at first apparent.
These reasons are why observability is incredibly important. Logs, metrics, and traces allow you to continually monitor your system, as well as notifying your technical team when a component begins to show signs of an underlying problem using alerts.
Good processes should underpin the entire engineering practice and ensure that human error is avoided or that any incidents that occur are rectified swiftly. Code reviews should be carried out to mitigate as many potential issues as possible, whereas processes like on-call policies and incident post-mortems foster a culture of continually iterating and learning from mistakes when they do occur.
While these methods do not necessarily prevent outages, they are strong foundations when it comes to reducing the odds of an incident occurring and it causing a total outage. These incidents are rare, but when they do occur they can prove costly to an organization and preparation should be in place to minimize these costs and the chances of such an occurrence.
Previously published at https://kylejones.io/how-to-reduce-the-chances-of-an-outage