The circuit breaker is a design pattern, used extensively in distributed systems to prevent cascading failures. In this post, we’ll go through the problem of cascading failures and go over how the circuit breaker pattern is used.
Before jumping into the circuit breaker pattern, let’s try and understand what problem it tries to solve.
When service A tries to communicate with service B it allocates a thread to make that call. There are 2 kinds of failures that can occur while making the call. We use the example of a user service making a call to friends service.
''' user service '''
def get_user_info(user_id: str):
try:
friends_service.get_friends(user_id)
except Exception as e:
raise InternalServerError
Immediate Failures: In immediate failure, an exception is raised immediately (like: Connection Refused) and the service A thread is freed.
Timeout Failures: If service_b takes a long time to respond, as we get new requests to service A, we’re getting more and more threads waiting for service_b. If several requests are made while waiting for timeouts this can exhaust service A’s thread-pool and can bring down service A.
“Your code can’t just wait forever for a response that might never come, sooner or later, it needs to give up. Hope is not a design method.” -Michael T. Nygard, Release It!
Let’s walk through an example of a social media application to understand this better. Here we have an
aggregator
service which is what the client interacts with, it aggregates results from a bunch of services including the user
service. User service calls photo service and friends service which in turn calls friends_db
.Here, friends service tries to make requests to the
friends_db
, however, friends_db is not responding with an immediate failure, instead keeps the threads from the friends
service waiting. The friends
service tries to retry thereby using more threads. As it gets new requests more threads are waiting on the friends_db
to respond.We can now see how
friends
service is now becoming the source of timeouts for user service. User service exhausts it’s thread-pool waiting for requests from friends service, just how friends service was waiting for friends_db
. We can now see how failure in friends_db caused a cascading failure in services indirectly dependent on it.Eventually, the aggregator service will also come down with the same reason. The client calls the aggregator service and so our system is effectively shut down for the users. We see how one error in one component of our architecture caused a cascading failure bringing all other services down.
The circuit breaker is usually implemented as an interceptor pattern/chain of responsibility/filter. It consists of 3 states:
The following shows the circuit breaker interceptor in its 3 states
The circuit breaker is implemented as an interceptor intercepting all requests from user service to friends service. In this picture it is in the “closed” state and allows all requests to be passed to the friends service
The circuit-breaker switches to the “open” state when the number of failures to the friend service are more than the failure threshold. It doesn’t allow requests from the user service to reach friends service instead it responds immediately with a default response
After a set “recovery timeout” period has passed the circuit breaker switches to a “half-open” state where it allows some of the requests to reach the friends service and the others are terminated and responded with the default response.
Let’s look at a Python example for a circuit breaker. You can create your own circuit breaker using:
from circuitbreaker import CircuitBreaker
class MyCircuitBreaker(CircuitBreaker):
FAILURE_THRESHOLD = 20
RECOVERY_TIMEOUT = 60
EXPECTED_EXCEPTION = RequestException
@MyCircuitBreaker()
def get_user_info(user_id):
try:
friends_service.get_friends(user_id)
except Exception as e:
raise InternalServerError
We can also leverage the sidecar pattern to this. In this approach, we don’t have to modify our services by wrapping them around circuit-breakers, but instead, we ship our applications with a sidecar like Envoy. All outbound traffic from the service is proxies through Envoy. Envoy supports the circuit breaking out of the box. Following is an example configuration of circuit-breaking with Envoy:
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 1000
max_requests: 1000
- priority: HIGH
max_connections: 2000
max_requests: 2000
Also published here.