Any developer who has been on live support knows the drill. Tickets flood the dashboard and decision pressure build. There is no way to resolve all issues in one go. Some ticket will have to take priority to the others and receive attention first.
That’s the uncontroversial bit, It’s a fact of life that not everything can be done at once and it just manifests here on a dashboard of support tickets. Things get a bit more heated when a priority ticket has to be chosen. Assuming that you’re neither a solo-entrepreneur or all-powerful dictator, you’ll be bound to discuss this with a couple of people who have slightly different background motivations. What ensues is a kind of round table debate which some team members may find all too political.
There is a lot to say about the psychology and interpersonal dynamics of such exchanges. But how would we approach a triage session if we weren’t bound by so much humanness? In other words, is there an optimal or near-optimal strategy applicable to the process of assigning priorities to a bunch of tickets that need looking at?
There is. The path to such a strategy lies in recognizing that this commonly faced software engineering problem is in fact an instance of a decision problem under risk and uncertainty.
Let’s get concrete(r).
Given a support ticket, the following actions are available:
The seasoned developer might, at this point, quickly associated certain typical scenarios with each of these fixes. Firstly, a no fix can be chosen when the impact of the issue is extremely low. For example, it’s not customer-facing and only occurs after a series of edge inputs. Secondly, an external fix is chosen when it is clear that the error occurs due to a fault in third party software. In such cases, it is even imperative to negotiate a fix from the third party as to not introduce an additional dependency in the existing software system (say system A has a hack to cover for a bug in system B, when system B fixes its bug, system A might have to revert its hack). For example, an API unable to handle requests at a reasonable interval should not be flood-defended at consumer side. Thirdly, quick fixes can alleviate issues which have a high impact on the users. Lastly, full fixes provide the most robust solution but might take up significantly longer development time.
There are already a couple of general lessons to draw from just thinking about the scenarios in which these action occur.
Great observations, but can we do better? I think so, let’s get formally concrete(r).
For each of the possible actions, calculate the expected return according to the following formula:
E[action] = P(issue | action) x V(issue) + P(no issue | action) x V(no issue)
where
P(): probability function, assign a numerical value between 0 and 1 representing how likely it is to have the issue
P(A|B): the conditional probability, how likely it is that A occurs given that B occurred
V(): value function, assigns a numerical value to each issue representing the gain/loss the issue brings to the software system
There is a button on the front page whose font size is wrong. Annoying but not deadly.
We can leave it as is, the site is functional. There is no external fix, it’s our button? We’re using a library of UI components. So maybe their developer coded an Easter egg, changing the font size of every 5th button on a page. A quick fix can be devised by adding a bespoke stylesheet which is not transpiled or packaged up. A proper fix involves editing the namespaced stylesheet and pushing the change through the CI/CD pipeline and a release cycle.
Now, image the following imaginary stakeholders (any resembles to real person is purely coincidental):
Forgive me the cheek-in-tongue depictions of these characters, exaggeration can be a purposeful device to augment an exposition. I am confident that real companies are populated by people far more nuanced.
Before any formal analysis of the decision problem, these 5 people each have their preferred action based on their personal preferences and dispositions:
What would be the outcome of a meeting among those people deciding the right course of action regarding this issue? If all people have equal weight in the decision process, it’d be a quick fix. However, people have different standings. For example, Seasoned Steve might be able to argue his case and convince the other that a full fix is really most appropriate.
What’s happening in the background is that each of them is calculating their version of the expected value of each action and then communicating that to the others. Depending on their level of inventiveness, a diverse set of reasons might be brought forward as to why they believe their preferred action to be the right action for the company as a whole.
At this point, you might think: what is the value of having a formal method to break down a priority decision when they’re made individually anyway?
V(issue) in interval [-10, -100], i.e. some negative number depending on how bad the issue is perceived to be (eg. Boss Betty, concerned about the company’s reputation might estimate the negative impact of the issue to be higher than Seasoned Steve); let’s assume it to be -50
V(no issue) = 100, some positive number
P(issue | no fix) = 1
P(no issue | no fix) = 1 - P(issue | no fix) = 0
P(issue | external fix) = 0.99
P(no issue | external fix) = 1 - P(issue | external fix) = 0.01
P(issue | quick fix) = 0.2
P(no issue | quick fix) = 1 - P(issue | quick fix) = 0.8
P(issue | full fix) = 0.01
P(no issue | full fix) = 1 - P(issue | full fix) = 0.99
E[no fix] = P(issue | no fix) x V(issue) + P(no issue | no fix) x V(no issue)
= 1 x -50 + 0 x 100 = -50
E[external fix] = P(issue | external fix) x V(issue) + P(no issue | external fix) x V(no issue)
= 0.99 x -50 + 0.01 x 100 = -49.5 + 1 = -48.5
E[quick fix] = P(issue | quick fix) x V(issue) + P(no issue | quick fix) x V(no issue)
= 0.2 x -50 + 0.8 x 100 = -10 + 80 = 70
E[full fix] = P(issue | full fix) x V(issue) + P(no issue | full fix) x V(no issue)
= 0.01 x -50 + 0.99 x 100 = -0.5 + 99 = 88.5
According to the above calculation, we should prioritize the full fix. The quick fix is a close second. No fix and the external fix should be avoided.
Whether it is desirable to explicate each triaging decision to this level of detail is questionable. Whether the developer who has read this article has an edge in understanding what’s really going on during a triage call? A certainty.