paint-brush
How to Debug and Solve a Big Production Problem With SaaSby@horosin

How to Debug and Solve a Big Production Problem With SaaS

by Karol HorosinJuly 24th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

In this article I outline a comprehensive guide for software engineers and related roles on how to effectively debug and resolve critical production issues in a SaaS environment. Key steps include clear communication through dedicated channels, such as Slack, providing regular updates using a message template, and escalating when necessary. The article emphasizes gathering essential information, like steps to reproduce the issue and system knowledge, and maintaining thorough documentation of hypotheses and tests. Additionally, it advocates for creating small tools to automate repetitive tasks and conducting post-incident analyses to prevent future occurrences, focusing on process improvement rather than individual blame.
featured image - How to Debug and Solve a Big Production Problem With SaaS
Karol Horosin HackerNoon profile picture

Software development is mostly not actually writing software. Sometimes, it’s debugging a critical issue that cannot wait and is beyond the abilities of first-line support. Keep reading to learn what tools you can and should employ when dealing with long-running hard-to-fix problems.


You may be guarded by countless levels of the corporate hierarchy, but eventually, you will either be asked to debug a production issue or you will be in a job that will require it.


This article is aimed primarily at software engineers but also QAs, managers, leads, and solution engineers.

How Does an Issue Reach You?

You get an invite to a call, and you’re tagged in a Jira ticket, on Slack, or DM’ed. Ok, cool.


Someone is asking you to drop a feature you’ve been working on, switch context, and help. Annoying, I know. Give yourself 90 seconds to grieve and move on.


If someone is asking you for urgent help with a production issue, they mean it.


Before you start, make sure your direct superior/manager/leas knows you are switching priorities.


And if you start investigating, make sure everyone knows it, so other people are not wasting time doing the same.


As you’ll see, comms are the key.

Communication

Before we get to any technical tips, let’s start with communication. I’ll mention some tools, some should be a part of the system already in place, automatic procedures. If they’re not, you should introduce them.

Incident Slack Channel

If you’re reading this in the early XXI century, you have a company chat app.


Make sure there is a channel where all the people who can help with the problem or need information about it have a place to talk. Invite people, and ask them to add anyone interested.


Spam this channel with progress updates.


Example name: inc-2023-01-04-sign-up-down . It can also include a ticket number instead of a date.

Cadency

Depending on the criticality of the issue, post summaries of what’s going on.


  • If it’s an urgent issue that is classified as P1 (priority 1, critical system down, significant financial impact), you’re likely to post summaries each hour or two.


  • If it’s a long-running issue with lower priority, begin the day with a plan for what’s next and close the day with a progress summary.

Message Template

Here’s what I use:


Summary of [name/ticket] incident investigation as of 4PM, Jan 18th

Resolved: yes/no/partially

∑ Brief summary

  • we know that the cause of the issue is: …
  • fix by … didn’t work, we’re trying …., estimated test at …
  • replication is hard, streamlining it

⏭️ Next steps

  • […]

🧠 Other notes

  • [ideas and resources]


These updates may not get reactions but many people will read them and quietly thank you for your thoroughness. If you are a recipient of those, like them, react with 👀 emoji, or whatever. Feedback is always good.


Forward daily updates on the main channel for issue discussions or on the team channel, so people less involved but still interested or able to help can see them.

Escalation

Whenever you are stuck - flag it and ask for help. You have an important task, it’s not worth being quiet. If your direct manager isn’t listening - try their manager.


When asking people for help, state how urgent is the question and when you need it done.


Don’t sit alone on an unresolved issue. You work with other specialists, present your idea to someone, and get feedback.

Getting Essential Information

There are a few types of information you are going to need. Apart from what comes with your programming experience.


  1. How to reproduce the issue? Steps taken, environment details.


  2. Input and output data - input that is causing the issue, the erroneous output, the right output.


  3. Logs with real-world examples of the issue in the wild.


  4. System knowledge - think knowing the part of the product in question, documentation.


  5. The initial estimate of the impact on customers.


If you lack any of these at the start, work hard to get them first. I’ve been asked to work on a bug that no one could reproduce a few times. Customers that encountered it, didn’t have time to jump on a call. Imagine how hard it was to work on such an elusive problem.


You are OK to push back in these situations. Make sure to be collaborative with your customer-facing colleagues who will help gather this information. If you can, tell them exactly what you need. You can use the list above.


Regarding point 5, estimating impact, it will need to be done properly either by you or someone else in the end. It will be needed for proper customer communication.

Working on the Issue

Finding and fixing a bug is a little bit like doing science. You are likely going to follow a process like this:


  1. Construct a hypothesis


  2. Test it by doing experiments


  3. Analyze your data and draw a conclusion


  4. Come up with a fix


  5. Test the fix


  6. If it doesn’t work → repeat


Make sure to write down your current hypothesis and log any additional ideas. Keep track of what you checked and what you didn’t. If the work will end up taking a long time or you’ll be forced to hand it over to someone else, these notes will be extremely useful.


In general, maintain a log of what you experimented with. This will help you avoid running in circles and provide good updates for the team.

Create Tools!

Whenever an experiment requires manual labor and you have to do it multiple times, create small tools to help you do it. As an example, if experimenting requires decrypting some values stored in your system, create a piece of code that will do that en masse instead of using online tools and console programs all the time.


If you need to parse some files and you are opening tens of them and looking for clues - create a small JS website that will help you extract the right information.


Mini-tools will help you move faster and you won’t need to repeat the same work. Of course, make them only when it is necessary.


At many projects of mine, these tools later became a part of regular testing and development processes.

After the Incident

Your company, likely, will have a process, usually, it is called RCA - Root Cause Analysis. Someone should schedule a meeting with all of the people involved in the work and prepare a document, later distributed to everyone interested.


On top of this note, relevant work preventing such incidents in the future should be scheduled.


The point of the whole exercise is to learn from mistakes.


This meeting should not be geared towards blaming individuals. It should point out where the processes failed.


When everything is fixed, make sure everyone knows it is the case.


If you spent some extra hours working on it, ask to be compensated or to be able to work less in the coming days. Make sure to note down what you did and how it helped the company and use it in your next evaluation meeting.

Summary

If you have dealt with production issues at your job, this guide probably sounds familiar. If you are yet to be asked to fix something, this article gives you the proper frame of mind.


Remember:


  • communication is the key, don’t neglect it


  • fixing a bug is like doing scientific research: form and test hypotheses, note everything down


  • create tools that will help you with manual work


  • take and give credit, forget about the blame


What are your experiences with troubleshooting live issues?


Subscribe to my profile by filling in your email address on the left, and be up-to-date with my articles!

Don't forget to follow me on Twitter @ horosin, and subscribe to my blog’s newsletterfor more tips and insights!

If you don't have Twitter, you can also follow me on LinkedIn.