There is something weird about this title, something creepy. But it is actually true — doing post-mortems at our workplace increased our sleep quality and made our lives better.
Our company develops products available worldwide, and we have an audience from all around the globe. Global user traffic means that many time zones are on the table. Most of our customers are active late in the evenings. Sometimes it’s very convenient for us — we do not need to get up in the middle of the night for the planned maintenance. We can proceed with it first thing in the morning. But there are some significant drawbacks to it as well — if anything wrong happens to our production, it is almost always at night or during the weekends. So, we have to get out of our beds, put down poker chips and/or our drinks right away. And that sucks! Well — sucked.
Such emergencies are very rare to us now, and it took many experiments and strict procedures to achieve that, but I believe that the implementation of post-mortems made the most impact.
The concept of a post-mortem in software engineering was introduced to us by our CTO, and it was entirely new to me. It took some time for me to understand all of the advantages of such procedure and what rules should be followed to get most out of it.
If you never heard of it, it is basically a meeting which is held after major issues to analyze what had happened, why it happened, what could we have done better in resolving it and how to prevent it in the future.The idea behind it is to make a specific mistake only once. As we like to say — if you make a mistake — it is an experiment, if you make the same mistake again — it is a failure.
Here are a few advantages of post-mortems from my personal experience:
The very first issue that we discovered with the help of post-mortems was that we sucked in reacting fast. We did not have alerts accurate enough to notify us about an emergency; we did not know who of us should take action; we did not know how to communicate during this stressful period; etc. We have managed to resolve all of this by documenting detailed timelines from the appearance of an issue to full resolution of it and putting all conversations via Slack, phone, etc., to one place.
After a few post-mortems, we have learned how to configure alerts correctly, where to send them (some are worthy to be sent via SMS, others — not), who should react to them and how to successfully notify everybody involved about what is happening.
It may not sound like a lot, but we have been sleeping better since we have made this discovery. The number of emergencies started to plummet, and we need fewer people to solve issues now.
Pretty much from the start of doing them, we have decided that post-mortems should be available for everyone in the company to read. All of our products have much in common, so this felt like a smart thing to do. And — it is — if you are doing it right. Of course, we made some mistakes here as well, but not for long. The main problem with shared post-mortems was their readability — everyone involved in the emergency could read them like a poem and completely understand it. But for people from other teams, it seemed like a drunk man wrote it.
It was very straightforward for us how to fix it — we have asked those people to read them and give us feedback. A few trial-and-error iterations, and — boom — it’s solved.
From that moment on, all discoveries one team makes are instantly available to all other teams as well. It is hard to measure how it has reduced the number of problems repeating from team to team, but I bet this made at least some difference.
This one is so obvious that there is not much to tell about it. There is only one question that you have to ask yourself during post-mortem to achieve quality improvement: Is this was the only place where such problem may occur, or there is more?
Once you start looking for and fixing similar potential issues, your product quality improves massively.
My favorite piece of change that this concept brought to my life is the fact that post-mortems can be used everywhere. We have started out by implementing them as our default procedure after software related emergencies. But the idea started to spread to other aspects of our daily job routines.
For example, we have decided to improve the quality of our user stories by using post-mortems. There were many ways to share good practices among people, but we have chosen to track how well user stories work and in case of user story causing any problem — we would do a post-mortem to find out what exactly went wrong and how to do a better job next time.
We are also trying to implement this concept as a tool to improve our communication. The idea is to prevent miscommunication from happening again by finding causes of it with the help of a post-mortem.
Last but not least is the personal aspect. I do not know many of my coworkers do this, but it helped me to improve myself a lot. This one is as well pretty simple. Every time I encounter a problem at the office, I do a post-mortem in my head. It can be performance drop or conflict with colleagues or even wrong mindset.
Every single time any of this happens, I sit for a few minutes and analyze my actions to find out where was the problem, what I could have done better and how not to make that same mistake again.
Remember: you do it once — it is an experiment, you do it twice -it is a failure.