648 reads

Grow Your AI While Cutting Machine Learning Costs: SageMaker Will Now Manage Spot Instances For You.

by Ed GrayJanuary 19th, 2020

Too Long; Didn't Read

AWS announced in the summer of 2019 that SageMaker can manage Spot instances without needing additional tooling. Spot instances are Amazon's unused cloud computing capacity. They are the exact same as On-Demand EC2 instances in terms of capabilities and infrastructure, except that AWS can reclaim the Spot instance and sell it back to On-demand customers when necessary. How much of a discount and how frequently the Spot instances is interrupted can depend on the instance type and Availability Zone. For example, a Spot p3.16xlarge in the US East, Ohio AZ costs 70% less than EC2 and is interrupted by Amazon less than 5% of the time.

People Mentioned

Company Mentioned

featured image - Grow Your AI While Cutting Machine Learning Costs: SageMaker Will Now Manage Spot Instances For You.

In 2018, OpenAI released a study that found the compute power used by the largest AI training runs has doubled every 3.5 months since 2012. From autonomous vehicles to DNA analysis, there's little doubt the demand for machine learning and AI is driving the supply of increased computing power today.

This can leave many organizations priced out of the AI race. You're at a distinct disadvantage if your competitors can afford the computing power to train models and apply them more quickly than you - so why bother? But AWS announced in the summer of 2019 that SageMaker can now manage Spot instances without needing additional tooling, which can help make effective machine learning more obtainable for more organizations.

Spot instances are Amazon's unused cloud computing capacity AWS sells at a steep discount. They are the exact same as On-Demand EC2 instances in terms of capabilities and infrastructure, except that AWS can reclaim the Spot instance and sell it back to On-Demand customers when necessary. How much of a discount and how frequently the Spot instance is interrupted can depend on the instance type and Availability Zone.

To get a better estimate on how much more computing power you could harness at lower prices, check out the Spot Instance Advisor here. For example, a Spot p3.16xlarge in the US East, Ohio AZ costs 70% less than On-Demand and is interrupted by Amazon less than 5% of the time. In the EU, Frankfurt AZ, a Spot g3s.xlarge also costs 70% less than On-Demand, but is interrupted much more frequently at >20%. While in the newer Middle East, Bahrain AZ, no P instances are even offered and g4s are the only available accelerated computing instance offered on Spot.

Amazon regularly updates the Spot Instance Advisor based upon market trends, but Spot no longer relies on a complicated, frequently-changing bidding process. You simply tell AWS the maximum price you're willing to pay for the Spot instances and if they're available - at a price less than the maximum - you get them at the current Spot price. This makes the cost and interruption trends more stable and easier to predict. You can read more about this change in Spot's pricing model on the AWS Blog here.

So depending on the requirements of a workload and how well it can manage interruption, Spot instances may not always be right fit - but it's almost always a good fit for SageMaker. As the AWS Podcast noted in their discussion on Spot instances, it's a "no-brainer" to use Spot for transient workloads like machine learning. SageMaker already made managing distributed training relatively easy, now with support for Spot, it's making it easier to implement and more affordable.

Before, engineers would have to do the extra work of checkpointing and restoring interrupted workloads, as described on Hacker Noon by Vadim Fedorov here. Now, with a simple adjustment in the settings wizards or SDK, SageMaker will copy checkpoint data over to S3 when a Spot instance is interrupted - and restore the data and resume when the excess computing becomes available again.

Recently the AWS Blog shared about how Cinnamon AI and their migration of their machine learning workloads from on-premise to SageMaker. Not only is the inherent batch workloads of machine learning a challenge for traditional on-premise infrastructure, but Cinnamon AI had separate development environments for their products that they were able to consolidate all onto AWS.

After optimizing their training to run on SageMaker and On-Demand P2 and P3s, Cinnamon AI then able to take advantage of its integrated Managed Spot Training feature. Today, almost all their model trainings run on Spot, which reduced their compute costs by 70%. They were able to invest those savings into running 40% more daily jobs, while still lowering their overall machine learning costs.

So don't assume just because you don't have the big bucks, doesn't mean you can't get the big insights of machine learning. With the discounts of Spot instances and the new ease of their integration with SageMaker, more and more savvy engineers can take advantage of AI. I hope this has been a helpful primer on one way your organization can effectively join the AI race - better than your competitors.

Have any other tips on how to take advantage of machine learning and AI on a budget? Please share in a comment below.