Upgrading a Kubernetes cluster is a challenging task, especially when you are running production load on it. But with a little planning, it is possible to upgrade an EKS cluster without downtime. Yes! You heard it right, WITHOUT DOWNTIME! Here’s a checklist for all the Amazon EKS users who wish to upgrade their EKS clusters without any downtime.
Migrate any removed/deprecated Kubernetes APIs
Kuberenetes releases often deprecate the usage of certain APIs and remove them in subsequent versions. Some major API removals happened in K8s1.16 and K8s1.22. If you’re upgrading upto any of these versions or beyond, it would be a good practice to always check for any removed/deprecated apis still being used in your cluster and take action on them accordingly so that there are no scary post-upgrade surprises for you. Use a tool like Silver-surfer to scan through all the kubernetes native objects and upgrade any APIs being removed before the upgrade to the latest apiversion.
Ensure Pod Disruption Budgets are set in your deployments
Pod disruption budgets allow you to safely drain your pods from one node to another node while still meeting an uptime SLA. Consider setting up Pod Disruption budget according to how critical is the application and how much load it would be serving during upgrade activity. Here are a few recommendations:
A) Conservative approach
Set pod disruption budget to MaxUnavailable: 1
Recommended for high traffic/volume apps.
Node draining could take more time. This approach will ensure only 1 replica of an app is taken down so that the burden on other replicas doesn’t increase too much at once.
B) Aggressive approach
Set pod disruption budget to MaxUnavailable: 50%
Recommended if you’re doing the activity at a non-peak time.
Node draining would be quicker. This approach will ensure that your pods are migrated to the new nodes as quickly as possible while still maintaining the SLAs of your application in the off-peak hours.
Ensure Liveness and Readiness probes are set
Kuberentes heavily depends on Liveness and Readiness probes to determine the health of a running pod and exhibit auto-healing and auto scaling capabilities for which K8s is popular for. Please ensure that all your apps have Liveness and Readiness probes in place before you start the migration process for a no-disruption migration.
Plan to jump 2 K8s Node versions at a time if possible
Kubernetes supports version drift of up-to two minor versions during the upgrade process. So nodes can safely be up to two minor versions behind the control plane version. You can only upgrade the control plane one minor version at a time, but nodes can jump more than one minor version at a time, provided the nodes stay within two minor versions of the control plane. Which means if I want to upgrade from K8s1.16 to K8s1.18, I can upgrade the EKS plane from EKS1.16 to EKS1.17 and then from EKS1.17 to EKS1.18 and then jump the nodes directly from K8s 1.16 AMI to K8s1.18 AMI, which means we will need to do the node draining activity only once while jumping 2 K8s minor version.
Plan the upgrade during off hours
Execute the upgrade activity as a scheduled maintenance and plan accordingly even though there’s no downtime expected. Choose a 2-3 hours off-peak hours window for the activity and proceed as follows: Start the EKS plane upgrade as soon as the scheduled maintenance slot starts, each EKS plane upgrade takes roughly 20-40 mins. So, if you’re planning to upgrade the EKS plane from EKS1.20 to EKS1.22, it will take roughly 60 mins. You can also consider starting the EKS plane upgrade 30-60 mins before your scheduled maintenance slot if you think the slot you have taken is not enough to complete the node drain activity as AWS maintains it’s EKS plane SLAs during the upgrade process too.
To upgrade the EKS cluster, you can stick to the standard way of upgrading the EKS cluster using EKSCTL