My team has a lambda function that is scheduled to run every hour. It succeeds 90% of the time but fails 10% of the time due to network error.
When it fails, it does so silently. And we have to regularly check its logs and manually make up for what is missing. This is quite inconvenient.
We want a better way to do this. We want the lambda to automatically retry a few times after it fails. And if it still fails after all attempts, we want to be notified through email.
And we achieved this using the AWS Step Function. It saved us tons of time, and we like how it simplifies the logic and reduces the amount of code (and bugs) that we otherwise have to write.
This post will show you how to do that.
We will first see how to create a step function in the AWS console, and then how to do that through an infrastructure-as-code tool such as Serverless.
Go to AWS console > Step Functions > click on Create state machine.
Select Design your workflow visually, choose the Standard Type, and hit Next.
In Workflow Studio, drag a Lambda: Invoke block into the first state.
Under Configuration > API Parameters > Function name, choose the target lambda in the dropdown.
Under Configuration > Additional configuration > Next state, choose Go to end.
A retrier defines a set of retry rules such as max retry attempts and retry interval. A retrier reruns the lambda after it fails with a certain error.
Step Function allows you to add multiple retriers to handle different errors. To keep it simple, we will add one retrier that runs on all errors.
Under Error handling > Retry on errors, click Add new retrier.
Under Retrier # 1 > Errors, select States.ALL. This means this retrier will apply to all errors.
Set the Interval to be 5 seconds, Max attempts to be 2, and the Backoff rate to be 1.
Interval and max attempts are easy to understand, the backoff rate determines how the retry interval increases. For example, if the interval is 5 seconds and the backoff rate is 2, the lambda will wait for 5 seconds before retrying after the first failure, 10 seconds after the second failure, 20 seconds after the third, and so on.
A catcher defines a set of error handling rules if the lambda fails after all retries.
I want to send an email with AWS Simple Notification Service if all retries failed.
Under Error handling > Catch errors, click Add a new catcher.
Under Catcher # 1 > Errors, select States.ALL. This means the catcher can be triggered by all errors.
Under Catcher # 1 > Fallback state, click Add new state. This will create a new error handling branch in the workflow.
Search for SNS in the search bar on the left, and drag an Amazon SNS Publish block into the fallback state.
Next, click on the SNS: Publish block to edit it.
Under Configuration > API Parameters > Topic, select a topic. For example, the HelloFuncFailed topic here will send an email to me. See this documentation on how to set up SNS to send emails.
Now that we added the Lambda, defined retry and catch rules in the step function, you can click Next to review the definition, and then create the state machine.
To make it easier to share and maintain the step function configuration, you can also deploy the same step function with an infrastructure-as-code tool. Below is the Serverless definition for the step function that we created above.
# serverless.yml
service: myService
provider:
name: aws
runtime: nodejs12.x
functions:
hello:
handler: hello.handler # required, handler set in AWS Lambda
name: hello-function
stepFunctions:
stateMachines:
helloStepFunc:
name: helloStepFunc
definition:
StartAt: HelloLambda
States:
HelloLambda:
Type: Task
Resource:
Fn::GetAtt: [hello, Arn]
End: true
Retry:
- ErrorEquals:
- States.ALL
IntervalSeconds: 5 # 5 seconds
MaxAttempts: 3
BackoffRate: 1
Catch:
- ErrorEquals:
- States.ALL
Next: SNSNotifcation
SNSNotifcation:
Type: Task
Resource: arn:aws:states:::sns:publish
Parameters:
Subject: Hello Lambda failed after retries
Message.$: $
TopicArn: xxx:HelloFuncFailed # your topic arn here
End: true
plugins:
- serverless-step-functions # need to run $npm install --save-dev serverless-step-functions
The above template assumes that the lambda code is defined in a hello.js
file in the same directory. You can also refer to an existing Lambda by its Amazon Resource Name (Arn). See the Serverless documentation for more details.
So this is how to use AWS Step Function to add retry on errors and notification logics to a lambda function. You can create a step function through an AWS console or create one using an infrastructure-as-code tool such as Serverless.
Step Function has saved my team lots of time. It simplifies our error handling logic and allows us to implement a set of rather complex rules with a few lines of code.
Hopefully, there is something you can take away and apply to your project. And please let me know if you have any questions. 🙂