In October 2019, we made a critical mistake that led to 10,000 important files disappearing overnight. It was a disaster—one that could have ruined our business. But five years later, that same experience saved our new company from an even bigger crisis.
This is a story about data loss, misconfigurations, and the hard lessons that led us to build a bulletproof backup system. If you're running a system that stores critical data, this could help you avoid making the same mistakes.
Gama (gama.ir) is a K-12 educational content-sharing platform launched in 2014 in Iran, with over 10 million users worldwide. It provides services such as:
Since our content was user-generated, maintaining secure file storage was a top priority. We used MooseFS, a distributed file system with five nodes and a triple-replication model, ensuring redundancy.
A simple external HDD where we stored copies of every file. It worked fine, and we rarely needed it. But then, we made a dangerous assumption.
One of our engineers suggested migrating to GlusterFS, a more well-known distributed file system. It sounded great—more scalability, higher adoption, and seemingly better performance. After evaluating the cost-benefit tradeoff, we decided to switch.
Two months later, the migration was complete. Our team was thrilled with the new system. Everything seemed stable… until it wasn’t.
There was just one small problem:
Our backup HDD was 90% full, and we needed to make a decision.
Because we had never really needed our full backups before, we assumed GlusterFS was reliable enough.
We removed our old backup strategy and trusted GlusterFS replication.
That was a bad decision.
Two months later, one morning, we started receiving reports: some files were missing.
At first, we thought it was a network glitch—something minor. But as we dug deeper, we found that Gluster was showing missing chunks and sync errors.
3:30 AM: We decided to restart the Gluster network, believing a fresh bootstrap would fix the problem. At first, it seemed to work!
We thought we had solved it.
Then, a WhatsApp message from the content team came in:
“The files are empty.”
Wait, what? The files existed, but they contained nothing.
We checked manually. The files still had size and metadata, but when we opened them, they were completely blank.
10,000 files were gone.
We had a backup HDD. That should have saved us, right?
Wrong. Because after migrating to GlusterFS, we had restructured our directory system. Every file had a new hashed path in the database.
Our old backups were useless because they had different filenames.
We tried multiple recovery methods. Nothing worked.
In the end, we had to email thousands of users, asking them to re-upload their lost files.
It was a nightmare. But it forced us to rethink everything.
After this disaster, we completely redesigned our storage and backup strategy. Our solution had two parts:
We no longer rely on a single storage system. Instead, we implemented a three-layered backup strategy:
Fast forward five years. Gamatrain.com, our new business in the UK, faced another rare incident.
But this time, we didn’t lose a single file.
Why? Because of the lessons we learned in 2019 and the system we built to prevent it.
#devops #backupstrategy #datarecovery #engineeringfailures #disasterrecovery