paint-brush
Creating Custom Incident Response Workflows with n8n šŸšØ [A How To Guide]ā€‚by@tanay1337
579 reads
579 reads

Creating Custom Incident Response Workflows with n8n šŸšØ [A How To Guide]

by Tanay PantMay 9th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Iā€™ve been involved in the DevOps world for a while and yet, I finished reading The Phoenix Project only recently. This piqued my interest in how teams execute their incident response playbooks. Itā€™s enlightening to see the different approaches teams take, to hone what works best for them.

People Mentioned

Mention Thumbnail

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Creating Custom Incident Response Workflows with n8n šŸšØ [A How To Guide]
Tanay Pant HackerNoon profile picture

Iā€™ve been involved in the DevOps world for a while and yet, I finished reading The Phoenix Project only recently. This piqued my interest in how teams execute their incident response playbooks. Itā€™s enlightening to see the different approaches teams take, to hone what works best for them.

(Disclaimer: The Author is the Head of Developer Relations at n8n)

I wanted to test how automating a minimalist incident response playbook would look like and I decided to test it out with three of my favorite tools n8n, PagerDuty and Mattermost. Hereā€™s a quick introduction to the three tools, in case you arenā€™t aware of them:

  1. n8n is a fair-code licensed tool that helps you automate tasks, sync data between various sources, and react to events all via a visual workflow editor.
  2. PagerDuty is a SaaS incident response platform for IT departments in companies.
  3. Mattermost is a flexible and open-source messaging alternative to Slack.

To avoid panic during an incident, a lot of companies have an incident response playbook. I created a minimalist six-step playbook for this tutorial. Whenever, a service goes down or something unexpected happens, the on-call team would follow this high-level protocol:

  1. Triage issue in Jira
  2. Create auxiliary channel
  3. Invite the on-call team to the channel
  4. Acknowledge the issue
  5. Fix the issue
  6. Resolve the ticket

We will automate this playbook with three workflows in n8n and this is how the result shall look like once we are done.

Workflow 1ā€Šā€”ā€ŠMake sure everyone knows whatĀ happened

Our first workflow will cover the first three steps of the playbook. Whenever a service goes down and creates an incident report on PagerDuty, we want the workflow to automate the following tasks for us:

  1. A webhook gets triggered and informs a general incidents channel on Mattermost that something is wrong.
  2. Create an auxiliary channel for the specific incident, invite the on-call team to it and share its link for those interested in the incident.
  3. Triage an issue on Jira.
  4. Share the links of the auxiliary channel, PagerDuty incident and the Jira issue in the Incidents channel, and the auxiliary channel.
  5. Share action buttons in the auxiliary channel to acknowledge and resolve the incident.

Letā€™s get started with the nodes of the first workflow. I have also submitted Workflow 1 on n8n.io, in case youā€™d like to skim through this workflow. Please note that youā€™ll still need to configure a couple of things like your credentials, channels on Mattermost as well as the settings of the nodes. You can find information on how to setup n8n in the documentation.

1. Webhook node: Get data from PagerDuty

First of all, we need to pull in the new incident reports from PagerDuty. To do that start n8n with the tunnel parameter:

n8n start --tunnel

Note: Make sure that you donā€™t forget to add the
--tunnel
parameter.

Add a new node by clicking on the + button on the top right of the Editor UI. Select the Webhook node under the Triggers section.

In the Node Editor view, set the HTTP method to

POST
. For the Path, I have entered
webhook
but feel free to add something else here according to your preferred convention. Now, youā€™ll need to save the workflow. I named it ā€˜Incident Response Workflowā€™. Once the workflow is saved, click on Webhook URLs, select Test, and then click on the URL to copy it to the clipboard.

Note: Donā€™t forget to save the workflow first before copying the Webhook URLs.

Hereā€™s a GIF of me following the steps mentioned above.

Now that we have our Webhook node ready on n8n, weā€™ll need to configure the settings on PagerDuty, so that it sends the new incident reports to the webhook.

Unless your team already uses PagerDuty, you can create a free trial account on PagerDuty. If you are creating a new account, youā€™ll also have to create a service that PagerDuty will be monitoring.

PagerDuty has integrations with a lot of services, to monitor them, in case something goes wrong. Once you have created your service, letā€™s configure the webhooks for the service.

To do that, select the Configuration menu on the top and click on Services. Click on the More button on the right side and select View Integrations from the menu (do this for the service that you want to configure the webhook for). Now, under the section called Extensions, click on the New Extension button and select ā€˜Generic V2 Webhookā€™ as the Extension Type. I entered

n8n
as the name and entered the URL that the copied from the Webhook node. Click on the Save button and we are done!

Hereā€™s a GIF of me following the steps mentioned above.

Now, click on the Execute Workflow button to register the webhook. Once youā€™ve done that, you can create a new incident at PagerDuty. Your Webhook node will receive all the details. Keep in mind that the Test webhooks are only valid for 120 seconds. It should look something like in the following image.

At times, when you are sending too many requests from PagerDuty, it will disable the webhook. Youā€™ll have to re-enable it by going to the list of extensions and clicking on the Re-enable button.

2. Mattermost node: Create an auxiliary channel

Now, we need to create a Mattermost node that will create an auxiliary channel so that the on-call team can coordinate on a fix for the incident.

To do that, click on the + button and click on the Mattermost node. In the Node Editor, enter your Mattermost credentials. Hereā€™s some detailed information on how to create an access token for the credentials. I have used an access token from a bot account, but you can also use the access token from your account.

Note: Throughout the tutorial, please make sure that the nodes are connected properly before you start the configuration in the Node Editor. If you donā€™t do this, the variables mentioned in the tutorials might not be visible to you.

Once you are all sorted out with the credentials, select ā€˜Channelā€™ as the Resource in the Node Editor. Now select your team as the Team ID (in case you are unable to acquire that, please check with your system admin). We now need to enter a Display Name for the channel. Since this would be a dynamic piece of information, click on the gears icon next to the field and select Add Expression. Select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > log_entries > [Item: 0] > incident > summary

Quite some indentation, I know! This will make sure that the display name of the channel would be the same as the incident summary on PagerDuty to keep things coherent. Now you need to enter a Name. This needs to be a unique value, so weā€™ll select the

id
from the Incident report. Click on Add Expression and select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > id

Perfect, now click on Execute Node and this will create an auxiliary channel on Mattermost. Hereā€™s a GIF of me following the steps mentioned above.

3. Mattermost node: Add on-call team to auxiliary channel

Once the auxiliary channel has been created, we need to make sure that all the on-call team members have been added to the channel. However, right now weā€˜ll add a single user to the channel.

To do that create another Mattermost node. Select the credentials that you entered earlier. Select ā€˜Channelā€™ as the Resource and click on ā€˜Add Userā€™ for Operation. Now we have to specify the Channel ID where the user should be added. Since this is another dynamic piece of information, click on Add Expression and in the Variable Selector, select the following:

Nodes > Mattermost > Output Data > JSON > id

Now we will specify a user by selecting ourselves from the dropdown list for User ID. Click on the Execute Node button and you will notice that you will be added to the channel. This node ensures that the specified user is always added to the auxiliary channel created by the workflow.

Hereā€™s a GIF of me following the steps mentioned above.

As an exercise, try using the PagerDuty API to pull a list of the email IDs of the people who are on-call and add them to the auxiliary channel in Mattermost. Feel free to pick this up once you are finished with the tutorial.

4. Jira Software node: Triage the issue inĀ Jira

Since the playbook specifies that the issue should also be triaged in Jira, weā€™ll need to add a node that creates a ticket in Jira. To do that, create a Jira node by clicking on the + button on the top right.

In the Node Editor, enter the Credentials for Jira. Hereā€™s detailed information on how you can create a new API Token for the credentials.

Once you are sorted out with the Credentials, select the Project where the tickets would be created. I selected a test project that I created specifically for this tutorial. In the Issue Type, I selected ā€˜Storyā€™ but feel free to select ā€˜Bugā€™ or something else. Summary is a dynamic piece of information, select Add Expressions and pick the

summary
variable just like you did for the Display Name section while configuring the Mattermost node to create a channel.

Click on Execute Node and this will create a Jira ticket for you. Hereā€™s a GIF of me following the steps mentioned above.

5. Mattermost node: Post details in the Incidents channel

The next thing that needs to be done is to post the details of the incident in the Incidents channel. We will need to share the following information in the channel:

Summary of the incidentLink to the Auxiliary channelLink to the PagerDuty incidentLink to the Jira ticket

Sharing these pieces of information will ensure that if someone outside of the on-call team is interested to check out what is going on, they can get this information from the Incidents channel.

To do this, create a new Mattermost node. In the Node Editor, select your Credentials. Now we need to enter the Channel ID. Since this is not a dynamic piece of information (the Incidents channel would always be there and hence, the ID will remain the same), we need to grab its Channel ID.

If you donā€™t already have a channel like this for the tutorial, you create manually create a new channel on Mattermost. To get its ID, click on the down arrow next to the channel name and click on the View Info option. This will reveal the ID of the channel. You can then copy and paste that in the Channel ID field in the node. In the message section, I entered the following expression to include the information that we mentioned in the list above.

šŸšØ New incident: {{$node["Webhook"].json["body"]["messages"][0]["incident"]["summary"]}}

Auxiliary Channel -> https://mattermost.internal.n8n.io/test/channels/{{$node["Mattermost"].json["name"]}}
PagerDuty Incident -> {{$node["Webhook"].json["body"]["messages"][0]["incident"]["html_url"]}}
Jira Issue -> https://n8n.atlassian.net/browse/{{$node["Jira Software"].json["key"]}}

Finally, click on the Execute Node button to send this information to your Incidents channel. Hereā€™s a GIF of me following the steps mentioned above.

6. Mattermost node: Post details and action buttons in the auxiliary channel

As a last step of this workflow, we need to provide the information that we talked about in the previous node to the auxiliary channel as well. Moreover, we will need to provide the following two buttons in the channel:

  1. Acknowledge: Clicking this button will change the status of the incident on PagerDuty from ā€˜Triggeredā€™ to ā€˜Acknowledgedā€™.
  2. Resolve: Clicking this button will change the status of the incident on PagerDuty from ā€˜Acknowledgedā€™ to ā€˜Resolvedā€™ and mark the ticket in Jira to ā€˜Doneā€™.

To do this, create a new Mattermost node and connect it to the Jira node. This will ensure that this and the previous Mattermost node can run in parallel. In the Node Editor, select your Credentials. Next, youā€™ll need to enter the Channel ID of the auxiliary channel. You can follow the steps mentioned in Workflow 1, Step 3 to do that. In the Message section, I entered the following expression (this is quite similar to the Message from the previous node):

āš ļø {{$node["Webhook"].json["body"]["messages"][0]["log_entries"][0]["incident"]["summary"]}}

PagerDuty incident: {{$node["Webhook"].json["body"]["messages"][0]["log_entries"][0]["incident"]["html_url"]}}
Jira issue: https://n8n.atlassian.net/browse/{{$node["Jira Software"].json["key"]}}

Now, we need to create the buttons which will trigger the actions that we talked about. To do that, under Attachments, click on the Add attachment button, click on Add attachment item, and select Actions. Then click on the Add Actions button and name it

Acknowledge
.

Now click on the Add Integration button. This will allow us to give the URL of the webhook this button will trigger on being clicked. Weā€™ll leave this empty for now.

Weā€™ll also need to send details (to the next workflow) about the PagerDuty incident to mark as resolved when the button is clicked. To do that, click on the Add Context to Integration button under the Context section. Weā€™ll enter

pagerduty_incident
as the Property Name. Since the Property Value is a dynamic piece of information, click on Add Expression. In the Variable Selector, select the following:

Nodes > Webhook > Output Data > JSON > body > messages > [Item: 0] > incident > id

Now, add another button called

Resolve
and following the same steps mentioned above. For this button, weā€™ll need to add the context of the pager duty incident and the Jira ticket key. Iā€™ll leave this as an exercise for you. For the sake of uniformity, you can name the Property Name
jira_key
.

In case you were wondering, it is important to send the context with the buttons as there might be multiple auxiliary channels at any given time and multiple people clicking on different Acknowledge and Resolve buttons. We need the correct context so that we donā€™t close up the wrong PagerDuty incidents and Jira tickets by mistake.

Click on the Execute Node button to send all this information to the auxiliary channel. Hereā€™s a GIF of me following the steps mentioned above.

Workflow 2ā€Šā€”ā€ŠMake sure that the incident is acknowledged

Our second workflow will cover the fourth step of the playbook. Once all the people responsible get notified that an incident has occurred, we need to make sure that there is a quick and easy way to acknowledge the incident so that it is clear that someone in the on-call team has got it.

Letā€™s get started with the nodes of the second workflow. I have also submitted Workflow 2 on n8n.io, in case youā€™d like to skim through this workflow. Please note that youā€™ll still need to configure a couple of things like your credentials as well as the settings of the nodes.

1. Webhook node: Get data from the Acknowledge button

We now need to set up a Webhook node that listens to the event when somebody clicks on the Acknowledge button in the auxiliary channel.

Create a Webhook node the same way you did in Workflow 1, Step 1. Now copy the link of the Test webhook from this Webhook node, go to the node from Workflow 1, Step 6 and paste it in the URL field in the Integration section of the Acknowledge button under Actions.

Once you are done with that, click on the Execute Node button to register the webhook and test it by clicking on the Acknowledge button in the auxiliary channel. Hereā€™s a GIF of me following the steps mentioned above.

2. PagerDuty node: Acknowledge the incident on PagerDuty

Now we need to get the ID of the incident from the webhook node to know which incident to mark as acknowledged. We get this information from the context that we added to the Integration of the button.

Add a PagerDuty node by clicking on the + button on the right side. In the Node Editor view, first of all, youā€™ll have to enter the Credentials for PagerDuty. Hereā€™s detailed information on how you can create a new API Token for the credentials. Once you are done with that, select ā€˜Updateā€™ as the Operation. Since the Incident ID is a dynamic piece of information, click on Add Expression and select the following in the Variable Selector:

Nodes > Webhook > Output Data > JSON > body > context > pagerduty_incident

In the Email field, I have just entered my email. In the Update Fields section, click on the Add Field button and select Status. From the dropdown list in the Status field, select ā€˜Acknowledgedā€™. Now, click on the Execute Workflow button. Go to the auxiliary channel and click on the Acknowledge button. This will change the status of your incident report from ā€˜Triggeredā€™ to ā€˜Acknowledgedā€™. Hereā€™s a GIF of me following the steps mentioned above.

3. Mattermost node: Confirm the acknowledgment

Now we just need to confirm the change of status of the PagerDuty incident by sending a message to the auxiliary channel. Iā€™ll leave this as an exercise for you. In case you run into any troubles, hereā€™s a GIF of me creating this node.

Workflow 3ā€Šā€”ā€ŠMake sure that everything is marked resolved after theĀ fix

Our third workflow will cover the sixth step of the playbook. Once the issue has been fixed, we need to make sure that the incident on PagerDuty has been marked as ā€˜Resolvedā€™ and the ticket on Jira has been marked as ā€˜Doneā€™. We also need to ensure that everyone in the Incidents and the auxiliary channel is aware of the resolution as well.

Letā€™s get started with the nodes of the third workflow. The nodes of this workflow have been left as an exercise for you. I have added GIFs for the nodes and have also submitted Workflow 3 on n8n.io, in case you run into any troubles. Please note that youā€™ll still need to configure a couple of things like your credentials as well as the settings of the nodes.

1. Webhook node: Get details from the ResolveĀ button

Just like in the last workflow, we need a Webhook node that listens to the event when somebody clicks on the Resolve button in the auxiliary channel. Hereā€™s a GIF of me creating this node.

2. PagerDuty node: Resolve the incident on PagerDuty

Now we need to change the status of the PagerDuty incident from ā€˜Acknowledgedā€™ to ā€˜Resolvedā€™. This is very similar to the Workflow 2, Step 2. Hereā€™s a GIF of me creating this node.

3. Jira Software node: Resolve the incident onĀ Jira

Now we need to update the status of the Jira ticket to ā€˜Doneā€™. Hereā€™s a GIF of me creating this node.

4. Mattermost nodes: Announce the resolution in the auxiliary and Incidents channel

Lastly, we need to create two Mattermost nodes:

  1. To acknowledge in the auxiliary channel that the incident report on PagerDuty and the ticket on Jira have been resolved.
  2. To announce in the Incidents channel that the incident has been resolved.

Hereā€™s a GIF of me creating this node.

Congratulations, you successfully built an automated incident response workflow using n8n, PagerDuty and Mattermost šŸŽ‰

Letā€™s run the whole system end to end. First of all, youā€™ll have to click on the Execute Workflow button on all three workflows to register the Webhook nodes. Go ahead and get started by creating a new incident on PagerDuty.

Now, to make sure that the workflow runs permanently without you having to press the Execute Workflow on all three workflows before each incident creation, weā€™ll need to use the Production webhook.

To do that, youā€™ll just need to get the Production webhook URL from the different Webhook nodes, update the URLs on PagerDuty and the Mattermost node from Workflow 1, Step 6, save the workflows and finally activate the workflows. This will make your workflows ready to use.

Note: When working with a Production webhook, please ensure that you have saved and activated the workflow. Donā€™t forget that the data flowing through the webhook wonā€™t be visible in the Editor UI with the Production webhook.

Conclusion

Today we created an automatic incident workflow using a variety of n8n nodes. The first-class support for webhooks and APIs allows n8n to integrate a very wide array of services and products, to create powerful workflows in a simplified way. This was an example of automating a minimalist incident response playbook. Which other services are you using for managing incidents in your organization? In case you have created other workflows with n8n that use different nodes, Iā€™d love to check them out, please consider sharing those workflows with the community.

In case youā€™ve run into an issue while following the tutorial, feel free to reach out to me on Twitter or ask for help on our forum šŸ’™

(Disclaimer: The Author is the Head of Developer Relations at n8n)