Nebullvm 0.3.0 features more deep learning compilers and now supports additional optimization techniques, including quantization and half precision.
Nebuly is very excited to announce the new major release nebullvm 0.3.0, where nebullvmās AI inference accelerator becomes more powerful, stable and covers more use cases.
Nebullvm is an open-source library that generates an optimized version of your deep learning model that runs 2 to 10 times faster in inference without performance loss by leveraging multiple deep learning compilers (OpenVINO, TensorRT, ONNX Runtime, TVM, etc.).
With the new release 0.3.0, nebullvm can now accelerate inference up to 30x if you specify that you are willing to trade off a self-defined amount of accuracy/precision to get an even lower response time and a lighter model.
This additional acceleration is achieved by exploiting optimization techniques that slightly modify the model graph to make it lighter, such as quantization, half precision, distillation, sparsity, etc.
Find tutorials and examples on how to use nebullvm, as well as installation instructions in the mainĀ readme of nebullvm library.
And check below if you want to learn more about:
- Overview of Nebullvm 0.3.0
- Benchmarks
- How the new Nebullvm0.3.0 API Works
- Tutorials
Overview of Nebullvm
With this new version, nebullvm continues in its mission to be:
āļø Easy-to-use. It takes a few lines of code to install the library and optimize your models.
š„ Framework agnostic. nebullvm supports the most widely used frameworks (PyTorch, TensorFlow, šONNXš and Hugging Face, etc.) and provides as output an optimized version of your model with the same interface (PyTorch, TensorFlow, etc.).
š» Deep learning model agnostic. nebullvm supports all the most popular deep learning architectures such as transformers, LSTM, CNN and FCN.
š¤ Hardware agnostic. The library now works on most CPU and GPU and will soon support TPU and other deep learning-specific ASIC.
š Secure. Everything runs locally on your hardware.
āØ Leveraging the best optimization techniques. There are many inference techniques such as deep learning compilers, šquantization or half precisionš, and soon sparsity and distillation, which are all meant to optimize the way your AI models run on your hardware.
Benchmarks
We have tested nebullvm on popular AI models and hardware from leading vendors.
The table below shows the inference speedup provided by nebullvm. The speedup is calculated as the response time of the unoptimized model divided by the response time of the accelerated model, as an average over 100 experiments. As an example, if the response time of an unoptimized model was on average 600 milliseconds and after nebullvm optimization only 240 milliseconds, the resulting speedup is 2.5x times, meaning 150% faster inference.
A complete overview of the experiment and findings can be found onĀ this page.
Overall, the library provides great results, with more than 2x acceleration in most cases and around 20x in a few applications. We can also observe that acceleration varies greatly across different hardware-model couplings, so we suggest you test nebullvm on your model and hardware to assess its full potential. You can find the instructions below.
Besides, across all scenarios, nebullvm is very helpful for its ease of use, allowing you to take advantage of inference optimization techniques without having to spend hours studying, testing and debugging these technologies.
How the New Nebullvm API Works
With the latest release, nebullvm has a new API and can be deployed in two ways.
Option A: 2ā10x acceleration, NO performance loss
If you choose this option, nebullvm will test multiple deep learning compilers (TensorRT, OpenVINO, ONNX Runtime, etc.) and identify the optimal way to compile your model on your hardware, increasing inference speed by 2ā10 times without affecting the performance of your model.
Option B: 2ā30x acceleration, supervised performance loss
Nebullvm is capable of speeding up inference by much more than 10 times in case you are willing to sacrifice a fraction of your modelās performance. If you specify how much performance loss you are willing to sustain, nebullvm will push your modelās response time to its limits by identifying the best possible blend of state-of-the-art inference optimization techniques, such as deep learning compilers, distillation, quantization, half precision, sparsity, etc.
Performance monitoring is accomplished using the perf_loss_ths (performance loss threshold), and the perf_metric for performance estimation.
When a predefined metric (e.g. āaccuracyā) or a custom metric is passed as the perf_metric argument, the value of perf_loss_ths will be used as the maximum acceptable loss for the given metric evaluated on your datasets (Option B.1).
When no perf_metric is provided as input, nebullvm calculates the performance loss using the default precision function. If the dataset is provided, the precision will be calculated on 100 sampled data (option B.2). Otherwise, the data will be randomly generated from the metadata provided as input, i.e. input_sizes and batch_size (option B.3).
Tutorials
We suggest testing the library on your AI models right away by following the installation instructions below. If you want to get a first feel of the libraryās capabilities or take a look at how nebullvm can be readily implemented in an AI workflow, we have built 3 tutorials and notebooks where the library can be tested on the most popular AI frameworks TensorFlow, PyTorch and Hugging Face.
- Notebook: AccelerateĀ fast.aiĀ Resnet34
- Notebook: Accelerate PyTorch YOLO
- Notebook: Accelerate Hugging Faceās GPT2 and BERT
Explore more
Check out theĀ GitHub readmeĀ if you want to take a look at nebullvmās performance and benchmarks, tutorials and notebooks on how to implement nebullvm with ease. And please leave a ā if you enjoy the project andĀ join the Discord communityĀ where we chat about nebullvm and AI optimization.
Also published here.