Well, It all started on the morning of 28th February, 2017. It was a normal day for engineers at Amazon.
There billing system got very slow and an engineering team started debugging the issue to make the better user experience as a routine maintenance activity.
Amazon always strive for customer oriented experience.
PS: I am really very sad for whatever happened to such a flawless company.
While debugging the issue, they had to take some servers (which were responsible for S3 billing) offline from serving the live traffic.
Amazon S3 is highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites.
Like Instagram use Amazon services to function, So most likely all of the data instagram has is stored with Amazon S3 service.
Say, If S3 service is served by 100 servers so amazon had to take some servers offline(say 5 servers). They wanted the load to be distributed to the remaining 95 servers.
For all non tech guys, please have a look at the video that explains about how the load is distributed to multiple servers to keep the site up and running.
Like the above video, amazon may have hundreds of servers balanced by the Load Balancer. They had to take some servers offline.
But due to a typo, the command executor mistakenly took multiple servers offline.
Just an information,(You can skip this)
In Unix, if you have to remove files from the current directory the command is like
rm -rf .
( . means current directory)
However if you do something like
rm -rf /
( / means home directory)
This will remove all the files in home directory if you have root permissions.
Something relative with different commands may have happened while removing servers from the online set.
Just a small typo but came up with blasting effects for the internet.
This large unintentional removal took the two S3 subsets down.
Very soon it made a blunder, as the unavailability of storage service led down many amazon services.
Primarily amazon web service that is used by Instagram, Vine, and IMDB,Trello, Quora, IFTTT, Medium, Websites build with wix.com and Splitwise etc.
Alexa was struggling to stay online, too.
Amazon S3 is used by around 148,213 websites, and 121,761 unique domains, according to data tracked by SimilarTech.
Nest’s app was unable to connect to thermostats and other devices for a period of time as well. Which eventually broke many people’s home appliances.
The blow was so bad that amazon dashboard was also down, so estimating the issues and current status evaluation was impossible. So amazon couldn’t comment on when will it go up.
Amazon S3 is the backbone for all of its services as shown above.
So to make the amazon service up and running, they had to restart all the servers that were taken offline.
Amazon services have done really well in past few years and have got a lot of subscribers. There infrastructure was grown significantly big.
Amazon did not restart servers for many years in past and hence unexpectedly restarting servers took longer than what was expected.
“We want to apologize for the impact this event caused for our customers. We will do everything we can to learn from this event and use it to improve our availability even further.”.
That’s huge loss! I pray for that guy who executed this command ;-)
Indeed, amazon is flawless when it comes to cloud services it offers. It has given a tough run to Google Cloud and Microsoft Azure.
I feel real bad for whatever that happened, but perhaps this will help Amazon build a more robust system. Hence it will emerge as the number one player in cloud for years to come.
PS: They deserve a promotion from my profile.
Thank you for reading, If you like it please recommend the same.
Share it among your friends, colleagues so they know about the wonderful world of technology.
The world which seems flawless but has a lot of effort under the hood by the mysterious software engineers.