I am one of the contributors to theĀ Spark NLPĀ open-source project and just recently this library started supporting end-to-endĀ Vision Transformers (ViT)Ā models. I use Spark NLP and other ML/DL open-source libraries for work daily and I have decided to deploy a ViT pipeline for a state-of-the-art image classification task and provide in-depth comparisons betweenĀ Hugging FaceĀ andĀ Spark NLP.
The purpose of this article is to demonstrate how to scale out Vision Transformer (ViT) models from Hugging Face and deploy them in production-ready environments for accelerated and high-performance inference. By the end, we will scale a ViT model from Hugging Face byĀ 25x times (2300%)Ā by using Databricks, Nvidia, and Spark NLP.
In this article I will:
In the spirit of full transparency, all the notebooks with their logs, screenshots, and even the excel sheet with numbers are providedĀ here on GitHub
Back in 2017, a group of researchers at Google AI published a paper that introduced a transformer model architecture that changed all Natural Language Processing (NLP) standards. The paper describes a novel mechanism called self-attention as a new and more efficient model for language applications. For instance, the two of the most popular families of transformer-based models are GPT and BERT.
A bit of Transformer historyĀ https://huggingface.co/course/chapter1/4
There is a great chapter about āHow Transformers WorkāĀ which I highly recommend for reading if you are interested.
Although these new Transformer-based models seem to be revolutionizing NLP tasks, their usage in Computer Vision (CV) remained pretty much limited. The field of Computer Vision has been dominated by the usage of convolutional neural networks (CNNs) and there are popular architectures based on CNNs (like ResNet). This had been the case until another team of researchers this time at Google Brain introduced theĀ āVision TransformerāĀ (ViT) in June 2021 in a paper titled:Ā āAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scaleā
This paper represents a breakthrough when it comes to image recognition by using the same self-attention mechanism used in transformer-based models such as BERT and GPT as we just discussed. In Transformed-based language models like BERT, the input is a sentence (for instance a list of words). However, in ViT models we first split an image into a grid of sub-image patches, we then embed each patch with a linear project before having each embedded patch become a token. The result is a sequence of embeddings patches which we pass to the model similar to BERT.
An overview of the ViT model structure as introduced inĀ Google Researchās original 2021 paperĀ
Vision Transformer focuses on higher accuracy but with less compute time. Looking at the benchmarks published in the paper, we can see the training time against theĀ Noisy StudentĀ dataset (published by Google in Jun 2020) has been decreased by 80% even though the accuracy state is more or less the same. For more information regarding the ViT performance today you should visit its page onĀ Papers With Code:
Comparison with state of the art on popular image classification benchmarks. (https://arxiv.org/pdf/2010.11929.pdf)
It is also important to mention that once you have trained a model via ViT architecture, you can pre-train and fine-tune your transformer just as you do in NLP. (thatās pretty cool actually!)
If we compare ViT models to CNNs we can see that they have higher accuracy with much lower cost for computations. You can use ViT models for a variety of downstream tasks in Computer Vision like image classification, detecting objects, and image segmentation. This can be also domain-specific in Healthcare you can pre-train/fine-tune your ViT models forĀ femur fractures,Ā emphysema,Ā breast cancer,Ā COVID-19, andĀ Alzheimerās disease.Ā¹
I will leave references at the end of this article just in case you want to dig deeper into how ViT models work.
[1]: Deep Dive: Vision Transformers On Hugging Face Optimum GraphcoreĀ https://huggingface.co/blog/vision-transformers
Vision Transformer (ViT) model (vit-base-patch16ā224) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224:
https://huggingface.co/google/vit-base-patch16-224
Fine-tuned ViT models used for food classification:
https://huggingface.co/nateraw/foodĀ āĀ https://huggingface.co/julien-c/hotdog-not-hotdog
There are however limitations & restrictions to any DL/ML models when it comes to prediction. There is no model with 100% accuracy so keep in mind when you are using them for something important like Healthcare:
Image is taken from: https://www.akc.org/expert-advice/lifestyle/do-you-live-in-dog-state-or-cat-state/āāāViT model: https://huggingface.co/julien-c/hotdog-not-hotdog
Can we use these models from Hugging Face or fine-tune new ViT models and use them for inference in real production? How can we scale them by using managed services for distributed computations such as AWS EMR, Azure Insight, GCP Dataproc, or Databricks?
Hopefully, some of these will be answered by the end of this article.
Some details about our benchmarks:
1- Dataset: ImageNet mini: sample (>3K)āāāfull (>34K)
I have downloaded ImageNet 1000 (mini) dataset from Kaggle: https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000
I have chosen the train directory with over 34K images and called it imagenet-mini since all I needed was enough images to do benchmarks that take longer. In addition, I have randomly selected less than 10% of the full dataset and called it imagenet-mini-sample which has 3544 images for my smaller benchmarks and also to fine-tune the right parameters like the batch size.
2- Model: The āvit-base-patch16ā224ā by Google
We will be using this model from Google hosted on Hugging Face: https://huggingface.co/google/vit-base-patch16-224
3- Libraries: Transformers š¤ & Spark NLP šĀ
ViT model on a Dell PowerEdge C4130
What is a bare-metal server? A bare-metal server is just a physical computer that is only being used by one user. There is no hypervisor installed on this machine, there are no virtualizations, and everything is being executed directly on the main OS (LinuxāāāUbuntu)āāāthe detailed specs of CPUs, GPUs, and the memory of this machine are inside the notebooks.
As my initial tests plus almost every blog post written by the Hugging Face engineering team comparing inference speed among DL engines have revealed, the best performance for inference in the Hugging Face library (Transformer) is achieved by using PyTorch over TensorFlow. I am not sure whether this is due to TensorFlow being a second-class citizen in Hugging Face due to fewer supported features, fewer supported models, fewer examples, outdated tutorials, and yearly surveys for the last 2 years answered by users asking more for TensorFlow or PyTorch just has a lower latency in inference on both CPU and GPU.
TensorFlow remains the most-used deep learning framework
Regardless of the reason, I have chosen PyTorch in the Hugging Face library to get the best results for our image classification benchmarks. This is a simple code snippet to use a ViT model (PyTorch of course) in Hugging Face:
from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
This may look straightforward to predict an image as an input, but it is not suitable for larger amounts of images, especially on a GPU. To avoid predicting images sequentially and to take advantage of accelerated hardware such as GPU is best to feed the model with batches of images which is possible in Hugging Face viaĀ Pipelines. Needless to say, you can implement your batching technique either by extending Hugging Faceās Pipelines or doing it on your own.
A simple pipeline forĀ Image ClassificationĀ will look like this:
from transformers import ViTFeatureExtractor, ViTForImageClassification
from transformers import pipeline
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=-1)
As per documentation, I have downloaded/loadedĀ google/vit-base-patch16ā224Ā for the feature extractor and model (PyTorch checkpoints of course) to use them in the pipeline with image classification as the task. There are 3 things in this pipeline that is important to our benchmarks:
> device: If itāsĀ -1Ā (default) it will only use CPUs while if itās a positive int number it will run the model on the associated CUDA device id.(itās best to hide the GPUs and force PyTorch to use CPU and not just rely on this number here).
> batch_size:Ā When the pipeline will useĀ DataLoaderĀ (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference is not always beneficial.
> You have to use either DataLoader or PyTorch Dataset to take full advantage of batching in Hugging Face pipelines on a GPU.
Before we move forward with the benchmarks, you need to know one thing regarding the batching in Hugging Face Pipelines for inference, that it doesnāt always work. As it is stated in Hugging Faceās documentation, settingĀ batch_sizeĀ may not increase the performance of your pipeline at all. It may slow down your pipeline:
https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching
To be fair, in my benchmarks I used a range of batch sizes starting from 1 to make sure I can find the best result among them. This is how I benchmarked the Hugging Face pipeline on CPU:
from transformers import pipeline
pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=-1)
for batch_size in [1, 8, 32, 64, 128]:
print("-" * 30)
print(f"Streaming batch_size={batch_size}")
for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
pass
Letās have a look at the results of our very first benchmark for the Hugging Face image classification pipeline on CPUs over the sample (3K) ImageNet dataset:
Hugging Face image-classification pipeline on CPUs ā predicting 3544 images
As it can be seen, it took around 3 minutes (188 seconds)Ā to finish processing aroundĀ 3544 imagesĀ from the sample dataset. Now that I know which batch size (8) is the best for my pipeline/dataset/hardware, I can use the same pipeline over a larger dataset (34K images) with this batch size:
Hugging Face image-classification pipeline on CPUs ā predicting 34745 images
This time it took around 31 minutes (1,879 seconds) to finish predicting classes forĀ 34745 imagesĀ on CPUs.
To improve most deep learning models, especially these new transformer-based models, one should use accelerated hardware such as GPU. Letās have a look at how to benchmark the very same pipeline over the very same datasets but this time on aĀ GPUĀ device. As mentioned before, we need to change theĀ deviceĀ to a CUDA device id likeĀ 0Ā (the first GPU):
from transformers import ViTFeatureExtractor, ViTForImageClassification
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model = model.to(device)
pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=0)
for batch_size in [1, 8, 32, 64, 128, 256, 512, 1024]:
print("-" * 30)
print(f"Streaming batch_size={batch_size}")
for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
pass
In addition to settingĀ device=0, I also followed the recommended way to run a PyTorch model on a GPU device viaĀ .to(device). Since we are using accelerated hardware (GPU) I also increased the maximum batch size for my testings to 1024 to find the best result.
Letās have a look at our Hugging Face image classification pipeline on a GPU device over the sample ImageNet dataset (3K):
Hugging Face image-classification pipeline on a GPU ā predicting 3544 images
As it can be seen, it took aroundĀ 50 secondsĀ to finish processing aroundĀ 3544 imagesĀ from our imagenet-mini-sample dataset on aĀ GPU device. The batching improved the speed especially compare to the results coming from the CPUs, however, the improvements stopped around the batch size of 32. Although the results are the same after batch size 32, I have chosen batch sizeĀ 256Ā for my larger benchmark to utilize enough GPU memory as well.
Hugging Face image-classification pipeline on a GPU ā predicting 34745 images
This time our benchmark took around 8:17 minutes (497 seconds) to finish predicting classes forĀ 34745 imagesĀ on aĀ GPUĀ device. If we compare the results from our benchmarks on CPUs and a GPU device we can see that the GPU here is the winner:
Hugging Face (PyTorch) is up to 3.9x times faster on GPU vs. CPU
I used Hugging Face Pipelines to load ViT PyTorch checkpoints, load my data into the torch dataset, and use out-of-the-box provided batching to the model on both CPU and GPU. TheĀ GPUĀ is up toĀ ~3.9x timesĀ faster compared to running the same pipelines on CPUs.
We have improved our ViT pipeline to perform image classification by using aĀ GPU deviceĀ instead of CPUs, but can weĀ improveĀ our pipeline further on bothĀ CPUĀ &Ā GPUĀ in a single machine before scaling it out to multiple machines? Letās have a look at the Spark NLP library.
Spark NLP is an open-source state-of-the-art Natural Language Processing library (https://github.com/JohnSnowLabs/spark-nlp)
Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 7000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization & Question Answering, Text Generation, Image Classification (ViT), and many more NLP tasks.
Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, GPT2, and Vision Transformer (ViT) not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.
ViT models on a Dell PowerEdge C4130
Spark NLP has the same ViT features forĀ Image ClassificationĀ as Hugging Face which were added in the recentĀ 4.1.0Ā release. The feature is calledĀ ViTForImageClassification,Ā it has overĀ 240 pre-trained modelsĀ ready to go,Ā and a simple code to use this feature in Spark NLP looks like this:
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.ml import Pipeline
imageAssembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \
.pretrained("image_classifier_vit_base_patch16_224") \
.setInputCols("image_assembler") \
.setOutputCol("class") \
.setBatchSize(8)
pipeline = Pipeline(stages=[
imageAssembler,
imageClassifier
])
If we compare Spark NLP and Hugging Face side by side for downloading and loading a pre-trained ViT model for an Image Classification prediction, apart from loading images and using post calculations likeĀ argmaxĀ outside the Hugging Face library, they are both pretty straightforward. Also, they both can be saved and serve later as a pipeline to reduce these lines into only 1 line of code:
Loading and using ViT models for Image Classification in Spark NLP (left) and Hugging Face (right)
Since Apache Spark has a concept calledĀ Lazy EvaluationĀ it doesnāt start the execution of the process until anĀ ACTIONĀ is called. Actions in Apache Spark can beĀ .count()Ā orĀ .show()Ā orĀ .write()Ā and so many other RDD-based operations which I wonāt get into it now and you wonāt need to know them for this article. I usually choose eitherĀ count()Ā the target column orĀ write()Ā the results on disks to trigger executing all the rows in the DataFrame. Also, like Hugging Face benchmarks, I will loop through selected batch sizes to make sure I can have all the possible results without missing the best outcome.
Now, we know how to load ViT model(s) in Spark NLP, we also know how to trigger an action to force computation over all the rows in our DataFrame to benchmark, and all that is left to learn is oneDNN fromĀ oneAPI Deep Neural Network Library (oneDNN). Since the DL engine in Spark NLP is TensorFlow, you can also enable oneDNN to improve the speed on CPUs (like everything else, you need to test this to be sure it improves the speed and not the other way around). I will also be using this flag in addition to normal CPUs without oneDNN enabled
Now that we know all the ViT models from Hugging Face are also available in Spark NLP and how to use them in a pipeline, we will repeat our previous benchmarks on the bare-metal Dell server to compare CPU vs. GPU. Letās have a look at the results of Spark NLPās image classification pipeline on CPUs over our sample (3K) ImageNet dataset:
Spark NLPimage-classification pipeline on a CPU without oneDNN ā predicting 3544 images
It took around 2.1 minutes (130 seconds)Ā to finish processing aroundĀ 3544 imagesĀ from our sample dataset. Having a smaller dataset to try different batch sizes is helpful to choose the right batch size for your task, your dataset, and your machine. Here is clear thatĀ batch size 16Ā is the best size for our pipeline to deliver the best result.
I would like to also enableĀ oneDNNĀ to see if in this specific situation it improves my benchmark compare to the CPUs without oneDNN. You can enable oneDNN in Spark NLP by setting the environment variable ofĀ TF_ENABLE_ONEDNN_OPTSĀ toĀ 1.Ā Letās see what happens if I enable this flag and re-run the previous benchmark on the CPU to find the best batch size:
Spark NLPimage-classification pipeline on a CPU with oneDNN ā predicting 3544 images
OK, so clearly enabling oneDNN for TensorFlow in this specific situation improved our results by at least 14%. Since we donāt have to do/change anything and all it takes is to sayĀ export TF_ENABLE_ONEDNN_OPTS=1Ā I am going to use that for the benchmark with a larger dataset as well to see the difference. Here is around seconds faster, but 14% on the larger dataset can shave off minutes of our results.
Now that I know the batch size of 16 for CPU without oneDNN and batch size of 2 for CPU with oneDNN enabled have the best results I can continue with using the same pipeline over a larger dataset (34K images):
Spark NLP image-classification pipeline on CPUs without oneDNN ā predicting 34745 images
This time our benchmark took around 24 minutes (1423 seconds) to finish predicting classes forĀ 34745 imagesĀ on aĀ CPUĀ device without oneDNN enabled. Now letās see what happens if I enable oneDNN for TensorFlow and use the batch size of 2 (the best results):
Spark NLP image-classification pipeline on CPUs with oneDNN ā predicting 34745 images
This time it took around 21 minutes (1278 seconds). As expected from our sample benchmarks, we can see aroundĀ 11% improvementsĀ in the results which did shave off minutes compared to not having oneDNN enabled.
Letās have a look at how to benchmark the very same pipeline on a GPU device. In Spark NLP, all you need to use GPU is to start it withĀ gpu=TrueĀ when you are starting the Spark NLP session:
spark = sparknlp.start(gpu=True)
# you can set the memory here as well
spark = sparknlp.start(gpu=True, memory="16g")
Thatās it! If you have something in your pipeline that can be run on GPU it will do it automatically without the need to do anything explicitly.
Letās have a look at our Spark NLP image classification pipeline on a GPU device over the sample ImageNet dataset (3K):
Spark NLPimage-classification pipeline on a GPU ā predicting 3544 images
Out of curiosity to see whether my crusade to find a good batch size on a smaller dataset was correct I ran the same pipeline with GPU on a larger dataset to see if the batch size 32 will have the best result:
Spark NLP image-classification pipeline on a GPU ā predicting 34745 images
Thankfully, it is batch size 32 that yields the best time. So it took around 4 and a half minutes (277 seconds).
I will pick the results fromĀ CPUs with oneDNNĀ since they were faster and I will compare them to theĀ GPUĀ results:
Spark NLP (TensorFlow) is up to 4.6x times faster on GPU vs. CPU (oneDNN)
This is great! We can see Spark NLP on GPU is up toĀ 4.6x times fasterĀ than CPUs even with oneDNN enabled.
Letās have a look at how these results are compared to Hugging Face benchmarks:
Spark NLP is 65% faster than Hugging Face on CPUs in predicting image classes for the sample dataset with 3K images and 47% on the larger dataset with 34K images. Spark NLP is also 79% faster than Hugging Face on a single GPU inference larger dataset with 34K images and up to 35% faster on a smaller dataset.
Spark NLP was faster than Hugging Face in a single machine by using either CPU or GPU ā image classification by using Vision Transformer (ViT)
What is Databricks? All your data, analytics, and AI on one platform
Databricks is a Cloud-based platform with a set of data engineering & data science tools that are widely used by many companies to process and transform large amounts of data. Users use Databricks for many purposes from processing and transforming extensive amounts of data to running many ML/DL pipelines to explore the data.
Disclaimer: This was my interpretation of Databricks, it does come with lots of other features and you should check them out: https://www.databricks.com/product/data-lakehouse
Databricks supports AWS, Azure, and GCP clouds: https://www.databricks.com/product/data-lakehouse
Hugging Face in Databricks Single Node with CPUs on AWS
Databricks offers aĀ āSingle NodeāĀ cluster type when you are creating a cluster that is suitable for those who want to use Apache Spark with only 1 machine or use non-spark applications, especially ML and DL-based Python libraries. Hugging Face comes already installed when you choose DatabricksĀ 11.1 MLĀ runtime. Here is what the cluster configurations look like for my Single Node Databricks (only CPUs) before we start our benchmarks:
Databricks single-node cluster ā CPU runtime
The summary of this cluster that usesĀ m5n.8xlargeĀ instance onĀ AWSĀ is that it has 1 Driver (only 1 node),Ā 128 GBĀ of memory,Ā 32 CoresĀ of CPU, and it costsĀ 5.71 DBUĀ per hour. You can read about āDBUā on AWS here:Ā https://www.databricks.com/product/aws-pricing
Databricks single-cluster ā AWS instance profile
Letās replicate our benchmarks from the previous section (bare-metal Dell server) here on our single-node Databricks (CPUs only). We start with Hugging Face and our sample-sized dataset of ImageNet to find out what batch size is a good one so we can use it for the larger dataset since this happened to be a proven practice in the previous benchmarks:
Hugging Face image-classification pipeline on Databricks single-node CPUs ā predicting 3544 images
It took around 2 minutes and a half (149 seconds) to finish processing aroundĀ 3544 imagesĀ from our sample dataset on a single-node Databricks that only usesĀ CPUs. The best batch size on this machine using only CPUs isĀ 8Ā so I am gonna use that to run the benchmark on the larger dataset:
Hugging Face image-classification pipeline on Databricks single-node CPUs ā predicting 34745 images
On the larger dataset with over 34K images, it took around 20 minutes and a half (1233 seconds) to finish predicting classes for those images. For our next benchmark we need to have a single-node Databricks cluster, but this time we need to have a GPU-based runtime and choose a GPU-based AWS instance.
Hugging Face in Databricks Single Node with a GPU on AWS
Letās create a new cluster and this time we are going to choose a runtime with GPU which in this case is calledĀ 11.1 ML (includes Apache Spark 3.3.0, GPU, Scala 2.12)Ā and it comes with all required CUDA and NVIDIA software installed. The next thing we need is to also select an AWS instance that has a GPU and I have chosenĀ g4dn.8xlargeĀ that has 1 GPU and a similar number of cores/memory as the other cluster. This GPU instance comes with aĀ Tesla T4Ā andĀ 16 GB memory (15 GBĀ usable GPU memory).
Databricks single-node cluster ā GPU runtime
This is the summary of our single-node cluster like the previous one and it is the same in terms of the number of cores and the amount of memory, but it comes with a Tesla T4 GPU:
Databricks single-node cluster ā AWS instance profile
Now that we have a single-node cluster with a GPU we can continue our benchmarks to see how Hugging Face performs on this machine in Databricks. I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine:
Hugging Face image-classification pipeline on Databricks single-node CPU ā predicting 3544 images
It took around a minute (64 seconds) to finish processing aroundĀ 3544 imagesĀ from our sample dataset on our single-node Databricks cluster with a GPU device. The batching improved the speed if we look at batch size 1 result, however, after batch size 8 the results pretty much stayed the same. Although the results are the same after batch size 8, I have chosen batch sizeĀ 256Ā for my larger benchmark to utilize more GPU memory as well. (to be honest, 8 and 256 both performed pretty much the same)
Letās run the benchmark on the larger dataset and see what happens with batch size 256:
Hugging Face image-classification pipeline on Databricks single-node CPU ā predicting 34745 images
On a larger dataset, it took almost 11 minutes (659 seconds) to finish predicting classes for over 34K images. If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner:
Hugging Face (PyTorch) is up to 2.3x times faster on GPU vs. CPU
TheĀ GPUĀ is up toĀ ~2.3x timesĀ faster compared to running the same pipeline on CPUs in Hugging Face on Databricks Single Node
Now we are going to run the same benchmarks by using Spark NLP in the same clusters and over the same datasets to compare it with Hugging Face.
First, letās install Spark NLP in your Single Node Databricks CPUs:
In theĀ LibrariesĀ tab inside your cluster you need to follow these steps:
ā Install New -> PyPI ->Ā spark-nlp==4.1.0Ā -> Install
ā Install New -> Maven -> Coordinates ->Ā com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0Ā -> Install
ā Will add `TF_ENABLE_ONEDNN_OPTS=1` to `Cluster->Advacend Options->Spark->Environment variables` to enable oneDNN
How to install Spark NLP in Databricks on CPUs for Python, Scala, and Java
Spark NLP in Databricks Single Node with CPUs on AWS
Now that we have Spark NLP installed on our Databricks single-node cluster we can repeat the benchmarks for a sample and full datasets on both CPU and GPU. Letās start with the benchmark on CPUs first over the sample dataset:
Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) ā predicting 3544 images
It took around 2 minutes (111 seconds) to finish processingĀ 3544 imagesĀ and predicting their classes on the same single-node Databricks cluster with CPUs we used for Hugging Face. We can see that the batch size of 16 has the best result so I will use this in the next benchmark on the larger dataset:
Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) ā predicting 34742 images
On the larger dataset with overĀ 34K images, it took around 18 minutes (1072 seconds) to finish predicting classes for those images. Next up, I will repeat the same benchmarks on the cluster with GPU.
Databricks Single Node with a GPU on AWS
First, install Spark NLP in your Single Node DatabricksĀ GPUĀ (the only difference is the use of āspark-nlp-gpuāĀ from Maven):
InstallĀ Spark NLPĀ in yourĀ Databricks cluster
ā In theĀ LibrariesĀ tab inside the cluster you need to follow these steps:
ā Install New -> PyPI ->Ā spark-nlp==4.1.0Ā -> Install
ā Install New -> Maven -> Coordinates ->Ā com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0Ā -> Install
How to install Spark NLP in Databricks on GPUs for Python, Scala, and Java
I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine:
Spark NLP image-classification pipeline on Databricks single-node GPU ā predicting 3544 images
It took less than a minute (47 seconds) to finish processing aroundĀ 3544 imagesĀ from our sample dataset on our single-node Databricks with a GPU device. We can see thatĀ batch size 8Ā performed the best in this specific use case so I will run the benchmark on the larger dataset:
Spark NLP image-classification pipeline on Databricks single-node GPU ā predicting 34742 images
On a larger dataset, it took almost 7 minutes and a half (435 seconds) to finish predicting classes for overĀ 34K images. If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner:
Spark NLP is up to 2.5x times faster on GPU vs. CPU in Databricks Single Node
This is great! We can see Spark NLP on GPU is up toĀ 2.5x times fasterĀ than CPUs even with oneDNN enabled (oneDNN improves results on CPUs between 10% to 20%).
Letās have a look at how these results are compared to Hugging Face benchmarks in the same Databricks Single Node cluster:
Spark NLPĀ is up toĀ 15%Ā faster than Hugging Face onĀ CPUsĀ in predicting image classes for the sample dataset with 3K images and up toĀ 34%Ā on the larger dataset with 34K images.Ā Spark NLPĀ is alsoĀ 51% fasterĀ than Hugging Face on a singleĀ GPUĀ for a larger dataset with 34K images and up toĀ 36%Ā fasterĀ on a smaller dataset with 3K images.
Spark NLPĀ is faster on bothĀ CPUsĀ andĀ GPUsĀ vs.Ā Hugging FaceĀ in Databricks Single Node
So far we established thatĀ Hugging FaceĀ onĀ GPUĀ is faster than theĀ Hugging FaceĀ onĀ CPUsĀ on a bare-metal server and Databricks Single Node. This is what you expect when you are comparing GPU vs. CPU with these new transformer-based models.
We have also established thatĀ Spark NLPĀ outperformsĀ Hugging FaceĀ for the very same pipeline (ViT model), on the very same datasets, in both bare-metal server and Databricks single node cluster, and it performs better on both CPU and GPU devices. This on the other hand was not something I expected. When I was preparing this article I expected TensorFlow inference in Spark NLP to be slightly slower than inference in Hugging Face by using PyTorch or at least be neck and neck. I was aiming for this section,Ā scaling the pipeline beyond a single machine. But it seems Spark NLP is faster than Hugging Face even in a single machine, on bothĀ CPUĀ andĀ GPU, over bothĀ smallĀ andĀ largeĀ datasets.
Question:Ā What if you want to make your ViT pipeline even faster? What if you have even larger datasets and you just cannot fit them inside one machine or it just takes too long to get the results back?
Answer:Ā Scaling out! This means instead of resizing the same machine, add more machines to your cluster. You need something to manage all those jobs/tasks/scheduling DAGs/manage failed tasks/etc. and those have their overheads, but if you need something to be faster or to be possible (beyond a single machine) you have to use some sort of distributed system.
Scaling up =Ā making your machine bigger or faster so that it can handle more load.
Scaling out =Ā adding more machines in parallel to spread out a load.
Looking at the page on Hugging Faceās official Website suggests scaling inference is only possible by using Multi-GPUs. As we describe what scaling out is, this is still stuck in a single machine:
https://huggingface.co/docs/transformers/performance
Also, not to mention that theĀ Multi-GPUsĀ solution forĀ inferenceĀ in Hugging Face doesnāt exist at the moment:
https://huggingface.co/docs/transformers/perf_infer_gpu_many
So it seems there is no native/official way toĀ scale outĀ Hugging Face pipelines. You can implement your architecture consisting of some microservices such as a job queue, messaging protocols, RESTful APIs backend, and some other required components to distribute each request over different machines, but this scales the requests by individual users instead of scaling out the actual system itself.
In addition, the latency of such systems is not comparable with natively distributed systems such as Apache Spark (gRPC might lower this latency, but still not competitive). Not to mention the single point of failure issue, managing failed jobs/tasks/inputs, and hundreds of other features you get out-of-the-box from Apache Spark that now you have to implement/maintain by yourself.
There is a blog post on the Hugging Face Website portraying the very same architecture by scaling REST endpoints to serve more users: āDeploying š¤ ViT on Kubernetes with TF Servingā ā I believe other companies are using similar approaches to scale out Hugging Face, however, they are all scaling the number of users/requests hitting the inference REST endpoints. In addition, you cannot scale Hugging Face this way onĀ Databricks.
For instance, inference inside fastAPI is 10x times slower than local inference:Ā https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c
Once Hugging Face offers some native solutions to scale out I will re-run the benchmarks again. Until then, there is no scaling out when you have to loop through the dataset from a single machine to hit REST endpoints in a round-robin algorithm. (think again about the part we batched rows/sequences/images to feed the GPU all at once, then youāll get it)
Spark NLP is an extension of Spark ML therefore it scales natively and seamlessly over all supported platforms by Apache Spark such as (and not limited) Databricks, AWS EMR, Azure Insight, GCP Dataproc, Cloudera, SageMaker, Kubernetes, and many more.
Zero code changes are needed! Spark NLP can scale from a single machine to an infinite number of machines without changing anything in the code!
You also donāt need to export any models out of Spark NLP to use it in an entirely different library to speed up or scale the inference.
Spark NLP ecosystem: optimized, tested, and supported integrations
Letās create a cluster and this time we chooseĀ StandardĀ insideĀ Cluster mode. This means we can have more than 1 node in our cluster which in Apache Spark terminology it means 1 Driver and N number of Workers (Executors).
We also need to install Spark NLP in this new cluster via theĀ LibrariesĀ tab. You can follow the steps I mentioned in the previous section for Single Node Databricks with CPUs. As you can see, I have chosen the same CPU-baed AWS instance I used to benchmark both Hugging Face and Spark NLP so we can see how it scales out when we add more nodes.
This is what our Cluster configurations look like:
Databricks multi-node (standard) cluster with only CPUs
I will reuse the same Spark NLP pipeline I used in previous benchmarksĀ (no need to change any code)Ā and also I will only use the larger dataset with 34K images. Letās begin!
Databricks with 2x Nodes ā CPUs only
Letās just add 1 more node and make the total of the machines that will do the processing to 2 machines. Letās not forget the beauty of Spark NLP when you go from a single machine setup (your Colab, Kaggle, Databricks Single Node, or even your local Jupyter notebook) to a multi-node cluster setup (Databricks, EMR, GCP, Azure, Cloudera, YARN, Kubernetes, etc.), zero-code change is required! And I mean zero! With that in mind, I will run the same benchmark inside this new cluster on the larger datasets with 34K images:
Spark NLP image-classification pipeline onĀ 2x nodesĀ with CPUs (oneDNN) ā predicting 34742 images
It took aroundĀ 9 minutesĀ (550 seconds) to finish predicting classes for 34K images. Letās compare this result onĀ 2x NodesĀ with Spark NLP and Hugging Face results on Databricks single node (I will keep repeating the Hugging Face results on a Single Node as a reference since Hugging Face could not be scaled out on multiple machines, especially on Databricks):
Spark NLPĀ isĀ 124% fasterĀ than Hugging Face withĀ 2x Nodes
Previously, Spark NLP beat Hugging Face on a Single Node Databricks cluster by using only CPUs byĀ 15%.
This time, by having only 2x nodes instead of 1 node, Spark NLP finished the process of over 34K images 124% faster than Hugging Face.Scale Spark NLP on CPUs with 4x nodes
Letās double the size of our cluster like before and go fromĀ 2x NodesĀ toĀ 4x Nodes.Ā This is how the cluster would look like with 4x nodes:
Databricks with 4x Nodes ā CPUs only
I will run the same benchmark on this new cluster on the larger datasets with 34K images:
Spark NLP image-classification pipeline onĀ 4x nodesĀ with CPUs (oneDNN) ā predicting 34742 images
It took aroundĀ 5 minutesĀ (289 seconds) to finish predicting classes for 34K images. Letās compare this result onĀ 4x NodesĀ with Spark NLP vs. Hugging Face on CPUs on Databricks:
Spark NLPĀ isĀ 327% fasterĀ than Hugging Face withĀ 4x Nodes
As it can be seen, Spark NLP is nowĀ 327% fasterĀ than Hugging Face on CPUs while using onlyĀ 4x NodesĀ in Databricks.
Now letās double the previous cluster by adding 4x more Nodes and make the total ofĀ 8x Nodes. This resizing the cluster by the way is pretty easy, you just increase the number of workers in your cluster configurations:
Resizing Spark Cluster in Databricks
Databricks with 8x Nodes ā CPUs only
Letās run the same benchmark this time onĀ 8x Nodes:
Spark NLP image-classification pipeline onĀ 8x nodesĀ with CPUs (oneDNN) ā predicting 34742 images
It took over 2 minutes and a half (161 seconds) to finish predicting classes for 34K images. Letās compare this result onĀ 8x NodesĀ with Spark NLP vs. Hugging Face on CPUs on Databricks:
Spark NLPĀ isĀ 666% fasterĀ than Hugging Face withĀ 8x Nodes
As it can be seen, Spark NLP is nowĀ 666% fasterĀ than Hugging Face on CPUs while using onlyĀ 8x NodesĀ in Databricks.
Letās just ignore the number of 6s here! (it was 665.8% if it makes you feel better)
To finish our scaling out ViT models predictions on CPUs in Databricks by using Spark NLP I will resize the cluster one more time and increase it toĀ 10x Nodes:
Databricks with 10x Nodes ā CPUs only
Letās run the same benchmark this time onĀ 10x Nodes:
Spark NLP image-classification pipeline onĀ 10x nodesĀ with CPUs (oneDNN) ā predicting 34742 images
It took less thanĀ 2 minutesĀ (112 seconds) to finish predicting classes for 34K images. Letās compare this result onĀ 10x NodesĀ with all the previous results from Spark NLP vs. Hugging Face on CPUs on Databricks:
Spark NLPĀ isĀ 1000% fasterĀ than Hugging Face withĀ 10x Nodes
And this is how youĀ scale out theĀ Vision Transformer model coming from Hugging Face onĀ 10x NodesĀ by usingĀ Spark NLPĀ in Databricks! Our pipeline now isĀ 1000% fasterĀ than Hugging Face on CPUs.
We managed to make ourĀ ViTĀ pipelineĀ 1000% fasterĀ than Hugging Face which is stuck in 1 single node by simply using Spark NLP, but we only usedĀ CPUs. Letās see if we can get the same improvements by scaling out our pipeline on aĀ GPU cluster.
Having a GPU-based multi-node Databricks cluster is pretty much the same as having a single-node cluster. The only difference is choosingĀ StandardĀ and keeping the same ML/GPU Runtime with the same AWS Instance specs we chose in our benchmarks for GPU on a single node.
We also need to install Spark NLP in this new cluster via theĀ LibrariesĀ tab. Same as before, you can follow the steps I mentioned in Single Node Databricks with a GPU.
Databricks multi-node (standard) cluster with GPUs
Our multi-node Databricks GPU cluster uses the same AWS GPU instance ofĀ g4dn.8xlargeĀ that we used previously to run our benchmarks to compare Spark NLP vs. Hugging Face on a single-node Databricks cluster.
This is a summary of what it looks like this time with 2 nodes:
Databricks with 2x Nodes ā with 1 GPU per node
I am going to run the same pipeline in this GPU cluster withĀ 2x nodes:
Spark NLP image-classification pipeline onĀ 2x nodesĀ with GPUs ā predicting 34742 images
It took 4 minutes (231 seconds) to finish predicting classes forĀ 34K images. Letās compare this result onĀ 2x NodesĀ with Spark NLP vs. Hugging Face on GPUs in Databricks:
Spark NLPĀ isĀ 185% fasterĀ than Hugging Face withĀ 2x Nodes
Spark NLP withĀ 2x NodesĀ is almostĀ 3x times fasterĀ (185%)Ā than Hugging Face on 1 single node while usingĀ GPU.
Letās resize our GPU cluster from 2x Nodes toĀ 4x Nodes.Ā This is a summary of what it looks like this time withĀ 4x NodesĀ using a GPU:
Databricks with 4x Nodes ā with 1 GPU per node
Letās run the same benchmark on 4x Nodes and see what happens:
Spark NLP image-classification pipeline onĀ 4x nodesĀ with GPUs ā predicting 34742 images
This time it took almost 2 minutes (118 seconds) to finish classifying allĀ 34K imagesĀ in our dataset. Letās visualize this just to have a better view of what this means in terms of Hugging Face in a single node vs. Spark NLP in a multi-node cluster:
Spark NLPĀ isĀ 458% fasterĀ than Hugging Face withĀ 4x Nodes
Thatās aĀ 458% increased performanceĀ compared to Hugging Face. We just made our pipelineĀ 5.6x times fasterĀ by using Spark NLP withĀ 4x nodes.
Next, I will resize the cluster to haveĀ 8x NodesĀ in my Databricks with the following summary:
Databricks with 8x Nodes ā with 1 GPU per node
Just as a reminder, each AWS instance (g4dn.8xlarge) has 1Ā NVIDIA T4 GPU 16GBĀ (15GB useable memory). Letās re-run the benchmark and see if we can spot any improvements as scaling out in any distributed system have its overheads and you cannot just keep on adding machines:
Spark NLP image-classification pipeline onĀ 8x nodesĀ with GPUs ā predicting 34742 images
It took almost a minute (61 seconds) to finish classifyingĀ 34K imagesĀ withĀ 8x NodesĀ in our Databricks cluster. It seems we still managed to improve the performance. Letās put this result next to previous results from Hugging Face in a single node vs. Spark NLP in a multi-node cluster:
Spark NLPĀ isĀ 980% fasterĀ than Hugging Face withĀ 8x Nodes
Spark NLP withĀ 8x NodesĀ is almostĀ 11x times faster (980%)Ā than Hugging Face on GPUs.
Similar to our multi-node benchmarks on CPUs I would like to resize the GPU cluster one more time to haveĀ 10x NodesĀ and match them in terms of the final number of nodes. The final summary of this cluster is as follows:
Databricks with 10x Nodes ā with 1 GPU per node
Letās run our very last benchmark in this specific GPU cluster (with zero code changes):
Spark NLP image-classification pipeline onĀ 10x nodesĀ with GPUs ā predicting 34742 images
It took less than a minute (51 seconds) to finish predicting classes for overĀ 34743 images. Letās put them all next to each other and see how we progressed scaling out our Vision Transformer model coming from Hugging Face in the Spark NLP pipeline in Databricks:
Spark NLPĀ isĀ 1200% fasterĀ than Hugging Face withĀ 10x Nodes
And we are done!
We managed toĀ scale outĀ ourĀ Vision TransformerĀ model coming from Hugging Face onĀ 10x NodesĀ by usingĀ Spark NLPĀ in Databricks! Our pipeline is nowĀ 13x times fasterĀ withĀ 1200% performance improvementsĀ compared to Hugging Face on GPU.
Letās sum up all these benchmarks by comparing first the improvements between CPUs, and GPUs, and then how much faster our pipeline can be by going from Hugging Face CPUs to 10x Nodes on Databricks by using Spark NLP on GPUs.
Spark NLP š on 10x Nodes with CPUs is 1000% (11x times) faster than Hugging Face š¤ stuck in a single node with CPUs
Spark NLP š on 10x Nodes with GPUs is 1192% (13x times) faster than Hugging Face š¤ stuck in a single node with GPU
What about the price differences between our AWS CPU instance and AWS GPU instance? (I mean, you get more if you pay more, right?)
AWS m5d.8xlargeĀ with CPUs vs.Ā AWSĀ g4dn.8xlargeĀ with 1 GPU and similar specs
OK, so the price seems pretty much the same! With that in mind, what improvements do you get if you move fromĀ Hugging FaceĀ onĀ CPUsĀ stuck in a single machine toĀ Spark NLPĀ onĀ 10x NodesĀ withĀ 10x GPUs?
Spark NLPĀ on GPUs isĀ 25x times (2366%) fasterĀ than Hugging Face on CPUs
Spark NLP š on 10x Nodes with GPUs is 2366% (25x times) faster than Hugging Face š¤ in a single node with CPUs
ViT
Hugging Face
Databricks
Spark NLP