This is the story of how we cut down our AWS costs by 80% in just under two weeks.
I need to begin with an introduction. We have used AWS since 2018 for all our projects, and it has worked miracles for us. We are a fully distributed team, and having our own data center somewhere in the world would be problematic. It is much easier to rent resources from AWS and skip all the capital expenses.
The problem with AWS is that developers can basically create any resources without having to approve them with our financial department. With traditional data centers, this is not the case — buying an additional server would need getting an invoice from the store and asking the financial department to pay for it.
So, basically, the basis of the problem is that with AWS, the developers can just buy the resources in the amounts they want and when they want.
We are not a huge company, and our AWS costs are just a little higher than $7k per month across all AWS accounts. Also, it is worth mentioning that we host only DEV and QA stands, as PROD stands are paid for by our customers. Our resources are mostly individual dev machines, test databases, and various custom resources for research projects such as Kinesis Firehose, Sage Maker, etc. So we have a lot of random resources that are hard to categorize, structure, predict and control.
So, how did we tackle lowering our AWS costs?
First, we started looking into the Cost Explorer and identified the most expensive items:
We found a Bitcoin Node that was running for the last four months and costing us $600/month as it required a large SSD with additional provisioned speed. We had a small research into Bitcoin Ordinals and did not remove the machine.
Resolution: we archived the volume (costs $6/month) and terminated the VM.
Savings: $594/month
We found an Nvidia Tesla GPU machine that costs $531/month. We use it up to this day for generative AI experiments. We are thinking of building our own app that generates text-to-video, so we need this machine.
Resolution: moved the volume to a spot instance.
Savings: $360/month
Not the most expensive, but the most amazing finding was that we forgot to remove a demo-PROD stand in one of the unused regions where we deployed our terraform scripts to test the rollout of PROD “from scratch”.
Savings: $340/month.
Many smaller items.
Resolutions: vary.
Savings: $1700/month
Second, we started moving everything possible to spot instances. This is a simple procedure. For an individual machine, you need to shut it down, detach the volume (remember to write down the mount path), and then terminate the machine. Then you create a new spot instance (no matter what AMI, just make sure that the CPU architecture is compatible with your previous volume). Once the spot instance is created, detach (and don’t forget to delete!) the new volume and attach the previous volume on the same mount path as it was on the original machine. For Beanstalk environments, it’s simpler — we just changed the capacity settings to utilize only spot instances.
Savings: $1000/month
Third, we cleared unused S3 buckets (we did some auto-trading bots that accumulated a lot of streaming data). And setup auto-removing of data in multiple S3 buckets, so that we don’t store trading data for more than a year as it becomes completely obsolete and unuseful.
Savings: $300/month
Fourth, we shrank some resources. It’s a matter of checking the consumed CPU and RAM, and if we see less than 50% constant use, we lower the tier.
Savings: $300/month (would be 3x more on on-demand instances)
Fifth, we set up auto-shutdown on individual machines. We created multiple lambda functions for different types of tasks: shutdown a SageMaker Jupyter VM after 1 hour of inactivity, shutdown individual VMs, DEV and QA stands for the night period when nobody is working. These lambda functions are run on cloudwatch events daily. There are lambdas to enable DEV and QA stands as well to facilitate the process.
Savings: $500/month
Also, we implemented some smaller solutions for further savings, but they are not covered in this article.
So far, we have saved about $5500 of our $7000 monthly bill, which is around 80% of all costs! I knew that we were overspending on AWS, but never knew that it was THAT much. Over the course of the year, it means about $66,000 in savings.
After having our own experience with cloud cost optimization, I understood how important it is to carefully track cloud costs. Basically, cloud cost optimization can save enough to boost the business if you put the saved money into marketing. Or you could take it out as dividends and buy a new car. The sum is great and there are many things that can be done with it.
Since it is out of the question that cloud cost optimization is an absolutely needed endeavor, how do companies approach it? Let’s think about ways of implementing cloud waste management, from the simplest to the most advanced.
You could approach the problem in the most traditional way possible. Deny the countless possibilities provided by AWS and just restrict your developers from buying EC2 machines.
SQS? No. DynamoDB? No. Just use EC2 virtual machines and install everything on them.
Pros:
You can predict the spending very well, as there is a flat rate for each type of EC2 VM
The developers will stuff the available machines with the software they need. Just like in a traditional physical on-premise data center, thus increasing the effectiveness of money spending
Cons:
You miss out on the benefits of auto-scaling
Your developers waste time implementing things that are already there
You miss auto-updates of software that would be applied automatically
All-in-all, it is not a good strategy to work with the cloud as if you just rent hosting on GoDaddy.
What if you allow the developers to use and scale any resources, but they have to negotiate them with the special department that controls the costs? The developers do not have their own rights to buy/scale resources, but they can ask a special person to buy/scale a resource for them.
Let’s say a developer needs a Kinesis Firehose endpoint (yes, I mention a service that you most probably have not even heard about). Would it be a simple task for the developer to explain what he/she wants to the controller? And then the developer should also explain the reasoning behind scaling and probably even prove that the architecture choice is good and not wasteful in terms of cost management.
Upon providing a specific example, one could see that it just does not work this way. It could work only if the cost management team consists of experts.
And that’s just the tip of the iceberg. Now consider:
A resource becoming unneeded due to the architectural change
A developer leaving the job and not removing the resources they used for their individual development purposes
An emergency when a resource needs to be scaled quickly to avoid business trouble
Pros:
The developers are allowed to utilize the maximum benefits of AWS-managed resources
The spending is well-controlled
Cons:
A more advanced way would be to actually find and hire experts in AWS that would control the spending. They can use the tools that AWS provides to control spending out of the box. It has:
a cost explorer
a tagging subsystem
reserved instances
savings plans
cost anomalies
much more
These tools are not user-friendly and require well-educated personnel that knows what to do with them. However, you can actually start controlling your cloud costs. This approach requires not only tools and highly skilled workers but also a framework in which the team works: periodic check-ups of underutilized resources, shrink&clean procedures, and others.
A team that is basically DevOps with a financial-conscious approach is called FinOps.
Pros:
The developers have the full power of AWS
Small bureaucracy overhead for the developers
The financial team has full control over the spending in various aspects: per project, per team, etc.
The developers consume resources in a conscious manner
Cons:
Once you think seriously about hiring (or growing your own) FinOps team, you should also consider a 3rd party cloud cost optimization software, such as Infinops. It is your automatic FinOps team member that works 24/7 and is not susceptible to human error. Such software automatically controls your cloud for underused resources and other known ways of saving, such as:
Using spot instances
Using reserved instances
Reducing the number of OpenSearch clusters in the QA environment
Disabling personal VMs for the night
Auto-shutting off expensive SageMaker VMs with Jupyter
etc
All those tips come automatically, as your system is constantly scanned for changes. And such advice can save you up to 80% of your monthly bill. This usually means saving at least tens of thousands of dollars over the course of the year.
Pros:
Great tool for the FinOps team
Helps beginner FinOps with optimization techniques
Reduces the human factor
Enforces periodic reviews of resource consumption
Enforces tags, lifecycle management, etc
Allows tracking multiple AWS accounts at once
Cons:
In conclusion, I'd like to say that managing AWS costs can be tricky. The company's 80% savings show it's possible to spend less with the right moves. Whether you're setting limits on resources, getting approvals, using expert teams, or automated tools, it's essential to keep a close eye on expenses. After all, using the cloud should be about making things easier, not pricier.