This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, [email protected].
To take advantage of the opportunities described in the previous section, we design PolyThrottle, a system that navigates the tradeoff between latency SLO, batch size, and
energy. PolyThrottle optimizes for the most energy-efficient
hardware configurations under performance constraints and
handles scheduling of on-device fine-tuning.
Figure 1 shows a high-level overview of PolyThrottle’s workflow. In a production environment, sensors on the edge devices continuously collect data and send the data to the deployed model for inference. In the meantime, to adapt to a changing environment and data patterns, these data are also saved for fine-tuning later. Due to the limited computation resources on these edge devices, fine-tuning workloads are often scheduled in conjunction with the continuously
running inference requests. To address the challenges in model deployment on edge devices, PolyThrottle consists of two key components:
1. An optimization framework that finds optimal hardware configurations for a given model under predetermined SLOs using few samples.
2. A performance predictor and scheduler to dynamically schedule fine-tuning requests and adjust for the optimal hardware configuration while satisfying SLO.
PolyThrottle tackles these challenges separately. Offline, we automatically find the best CPU frequency, GPU frequency, memory frequency, and recommended batch size for inference requests that satisfy the latency constraints while minimizing per-query energy consumption. We discuss the details of the optimization procedure in Section 5. We also show that our formulation can find near-optimal energy configurations in a few minutes using just a handful of samples. Compared to the lifespan of long-running inference workloads, the overhead is negligible.
The optimal configuration is then installed on the inference server. At runtime, the client program processes the input and sends inference requests to the inference server. Meanwhile, if there are pending fine-tuning requests, the performance predictor predicts the inference latency when
running concurrent fine-tuning, and decides whether it is possible to satisfy the latency SLO if fine-tuning is scheduled concurrently. A detailed discussion on performance prediction can be found in Section 6. The scheduler then decides what the new configuration that can satisfy the latency SLO while minimizing per-query energy consumption is. If such a configuration is attainable, it will schedule fine-tuning requests iteration-by-iteration until all pending requests are finished.
Online vs. Offline: Adjusting the frequency of each hardware component entails writing to one or multiple hardware configuration files, a process that takes approximately 17ms each. On Jetson TX2 and Orin, each CPU core, GPU, and memory has a separate configuration file that determines
operating frequency. As a result, setting the operating frequencies for CPUs, GPU, and memory could require up to 150ms. This duration could exceed the latency SLO for many applications, and this is without accounting for the additional overhead of completing frequency changes. Since
the latency SLO for a specific workload does not change frequently, PolyThrottle determines the optimal hardware configuration before deployment and only performs online adjustments to accommodate fine-tuning workloads.