I was calling my mother when WhatsApp died on Oct 4. That in itself wasn't a bad thing as she was asking me when I will get married :) But truth be told, I waited anxiously for the root cause of the complex distributed systems issue that caused a global meltdown of 3 of the biggest software systems known to mankind, used by billions across the world. Imagine my disappointment when this statement came out: "its root cause was a faulty configuration change on our end". Come on folks… I was really hoping for a bug in implementation of a provably-correct consensus protocol which caused a cascading failure in leader election….”
In some ways, this is still a step forward as at least the DevOps community uses honest words like "operator error" or "configuration mistake", which place the root cause where it resides.
Dijkstra said it right: "… begin with cleaning up our language by no longer calling a bug a bug but by calling it an error. It is much more honest because it squarely puts the blame where it belongs, viz. with the programmer who made the error. The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation."
But the question remains: How could we still be making "operator errors" in a world where software is everything?
Following Dijkstra's footsteps, let us (the community) begin by rejecting vague root causes like "faulty configuration", "operator error", etc. Someone did something which went horribly wrong. Can we get answers to simple questions: What was the change? How was it tested? How did it make it to the whole world before? What has changed to avoid a recurrence? At Amazon, we used to do it through the Correction of Error (COE) process. Wouldn't it be great if all those details are published to eventual end-users to inspire transparently?
Errors are a reality so stopping at the blame won't cut it. Our community needs to think deeply about why these errors make it to production and how our tooling can be made operator proof.
Graph source: https://www.kentik.com/blog/facebooks-historic-outage-explained/