paint-brush
How to Scale AI Infrastructure With Kubernetes and Dockerby@dataengonline
578 reads
578 reads

How to Scale AI Infrastructure With Kubernetes and Docker

by Natapong SornpromFebruary 15th, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Firms increasingly make use of artificial intelligence (AI) infrastructures to host and manage autonomous workloads. Scalability ensures that AI systems can handle increasing workloads without any loss of performance. Organizations use Docker and Kubernetes to meet such needs.
featured image - How to Scale AI Infrastructure With Kubernetes and Docker
Natapong Sornprom HackerNoon profile picture
0-item
1-item

Firms increasingly make use of artificial intelligence (AI) infrastructures to host and manage autonomous workloads. Consequently, there has been significant demand for scalable as well as resilient infrastructures that will be able to meet heterogeneous application or cloud requirements. Organizations use Kubernetes and Docker to meet such needs because firms realize that both are highly effective use cases that deliver scalable AI infrastructures.

Deploying AI infrastructure typically provides adequate computation power to execute and process large datasets. These demands can translate into the need for scalable methods that enable AI models to run on large workloads without hurting performance.

Why Companies Need to Scale Up Their AI Infrastructure

AI systems, nonetheless, are also resource-intensive, normally demanding both high computing capacity and the ability to process high levels of data. As more advanced AI applications and a larger scale become required, scalability becomes more critical. Scalability ensures that AI systems can handle increasing workloads without any loss of performance.

Expanding Data Volumes

The growing amount of data is a concern for AI systems in many facets. Most AI models, especially those based on deep learning, heavily depend on large amounts of data during training and inference. However, without adequate scalable infrastructure, processing and interpreting such enormous quantities of data is a roadblock.

Optimized Performance

Scalable AI hardware supports reliable and stable performance despite drastically overwhelming computational loads. With Kubernetes, horizontal scaling of AI jobs is a breeze, and the dynamic resizing of replica numbers can be done as a function of necessity. In contrast, Docker containers support lean, isolated environments for running AI models where resource conflict is not a performance bottleneck.

Effective Resource Management

Efficient use of resources is the key to cost-effective and sustainable AI deployment. Kubernetes' resource requests and limits allow for fine-grained CPU and memory resource management by avoiding underprovisioning and overprovisioning. Docker's resource management fills the gap by isolating container resources.

Scaling AI Infrastructure With Kubernetes and Docker

Containerization is one of the milestones in the evolution of scalable artificial intelligence infrastructure. Containerization of the AI application and its dependencies in a Docker container ensures consistency throughout the development, testing, and deployment environments.

First, you must define a Dockerfile in order to install the environment. The Dockerfile is a series of instructions about how to build a Docker image. It declares a base image, the dependencies required, and the initial setup commands that apply to your app. The following is a basic Dockerfile for a Python machine-learning model:

# Use an official Python runtime as a parent image
FROM python:3.9-slim
 
# Set the working directory in the container
WORKDIR /usr/src/app
 
# Copy the current directory contents into the container
COPY . .
 
# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
 
# Expose the port the app runs on
EXPOSE 5000
 
# Define environment variable
ENV NAME World
 
# Run the app
CMD ["python", "./app.py"]

If the Dockerfile is ready, then you can build the Docker image and run the container. Run the following commands:

# Build the Docker image
docker build -t ml-model:latest .
 
# Run the container
docker run -p 5000:5000 ml-model:latest

Deploying the Dockerized AI Model to Kubernetes

Kubernetes provides a wide range of orchestration features that enable efficient application management in the containerized infrastructure. Deployment of the Docker image on Kubernetes ensures that a specified number of application replicas is always running. The following is an example of deployment.yaml file that you can use to deploy your Dockerized machine learning model:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  replicas: 3  
  selector:
	matchLabels:
  	app: ml-model
  template:
	metadata:
  	labels:
    	app: ml-model
	spec:
  	containers:
  	- name: ml-model-container
    	image: ml-model:latest
    	ports:
    	- containerPort: 5000


The above code snippet shows how to deploy the AI model, but you also need to make the model externally accessible. You will need to expose it by defining a Kubernetes Service. The service.yaml below illustrates an example:

apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
	app: ml-model
  ports:
	- protocol: TCP
  	port: 80
  	targetPort: 5000
  type: LoadBalancer


Use the kubectl command-line tool to apply the deployment and service configurations:

# Deploy the application
kubectl apply -f deployment.yaml
 
# Expose the service
kubectl apply -f service.yaml

Scaling With Kubernetes

Kubernetes provides excellent scaling capabilities to AI environments, maximizing resource utilization and performance. Horizontal scaling is done by adding additional containers, and vertical scaling involves adding additional resources like CPU or memory to a container.

Horizontal Scaling

Horizontal scaling is used to scale up the number of replicas (Pods) of an AI system to handle a higher workload. The process requires enabling dynamic scaling depending on the number of replicas. The command used to enable such a process is `kubectl scale`. The particular command is used to set up the deployment to function up to a maximum of five replicas:

`kubectl scale --replicas=5 deployment/ml-model-deployment`

The command scales up the ml-model-deployment to use five replicas of the machine-learning model container. The system dynamically provisions more Pods to meet the required number afterward.

Automatic Scaling using the Horizontal Pod Autoscaler (HPA)

Kubernetes facilitates auto-scaling using the Horizontal Pod Autoscaler (HPA). The HPA dynamically adjusts the number of replicas based on resource use, i.e., CPU or memory, in relation to set limits. The YAML configuration shown below is a relevant example of an HPA that dynamically scales for ml-model-deployment in response to CPU use:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
	apiVersion: apps/v1
	kind: Deployment
	name: ml-model-deployment
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50


In this setup, scaleTargetRef is used to define the Deployment to be scaled, i.e., ml-model-deployment. The minimum replica count is set using MinReplicas, while the maximum replica count is controlled using maxReplicas. In addition, the CPU utilization percentage is set using targetCPUUtilizationPercentage, i.e., to 50%.

CPU utilization of more than 50% across all Pods results in scaling up the replica count to a maximum of 10 automatically. As soon as CPU utilization drops below the set percentage, Kubernetes automatically reduces the replica count in order to release resources.

Vertical Scaling

Horizontal scaling is mainly to cope with more traffic, whereas vertical scaling provides more resources (such as CPU or memory) to existing containers. The process is to scale up or down resource requests and limits in the Kubernetes Deployment. In order to scale up the CPU and memory limits of the ml-model-deployment, one would need to open the deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  replicas: 3
  selector:
	matchLabels:
  	app: ml-model
  template:
	metadata:
  	labels:
    	app: ml-model
	spec:
  	containers:
  	- name: ml-model-container
    	image: ml-model:latest
    	ports:
    	- containerPort: 5000
    	resources:
      	requests:
        	cpu: "1"
        	memory: "2Gi"
      	limits:
        	cpu: "2"
        	memory: "4Gi"

In this updated configuration:

  • requests specify the minimum resources required for the container.
  • limits define the maximum resources the container can use.