2 BACKGROUND: OMNIDIRECTIONAL 3D OBJECT DETECTION
3.1 Experiment Setup
3.2 Observations
3.3 Summary and Challenges
5 MULTI-BRANCH OMNIDIRECTIONAL 3D OBJECT DETECTION
5.1 Model Design
6.1 Performance Prediction
5.2 Model Adaptation
6.2 Execution Scheduling
8.1 Testbed and Dataset
8.2 Experiment Setup
8.3 Performance
8.4 Robustness
8.5 Component Analysis
8.6 Overhead
Panopticus dynamically schedules the inference configuration of the multi-branch model at runtime to maximize detection accuracy within the latency target ๐ . Specifically, the goal of the scheduler is to find the optimal selection ๐ โ for ๐ image-branch pairs, given the possible inference branches {๐ต ๐ } ๐ ๐=0 and multi-view images {๐ผ ๐ ๐ก } ๐ ๐=0 at time ๐ก. The key idea of the solution is to utilize the performance predictions for image-branch pairs. The schedulerโs objective function is formulated as follows:
where ๐๐ด and ๐๐ฟ denote the branch accuracy and latency predictors, respectively. Here, ๐๐๐ โ {0, 1} indicates a binary decision, where ๐๐๐ = 1 denotes the decision of ๐ th branch to run for ๐ th image. For each image, only one branch is allocated, i.e., ร๐ ๐=0 ๐๐๐ = 1, โ๐ โ {0, . . . , ๐}. Otherwise, multiple images can be processed by a single branch, resulting ๐๐ possible configurations in total. In the following, we describe the method to find ๐ โ .
6.1 Performance Prediction
To find the optimal selection of image-branch pairs, Panopticus first predicts the expected accuracy and latency of each pair. As described in Section 3, the modelโs detection capabilities are influenced by spatial characteristics, such as the number, location, and movement of objects. Panopticus predicts changes in the spatial distribution considering these factors. The forecasted spatial distribution is then harnessed to predict the expected accuracy and processing time of possible branch-image selections.
Prediction of spatial distribution. To predict the expected spatial distribution at the incoming time ๐ก, Panopticus utilizes the future states of tracked objects estimated via a 3D Kalman, described in Section 5.1. Based on the objectsโ state predictions such as expected locations, each object is categorized considering all property levels described in Table 2. Consequently, objects are classified into one of 80 categories (5 distance levels ร 4 velocity levels ร 4 size levels). For instance, a pedestrian standing nearby has levels of D0, V0, and S0. The predicted spatial distribution vector ๐ท โฒ ๐ก is then calculated as follows:
๐ท โฒ ๐ก = [๐๐ท0๐ 0๐0, ๐๐ท1๐ 0๐0, . . . , ๐๐ท4๐ 3๐3], (4)
where each element represents the ratio of the number of objects in each category to the total number of tracked objects at time ๐ก. In fact, Panopticus calculates ๐ท โฒ๐ ๐ก for each camera view ๐. To do so, Panopticus identifies the camera view that contains the predicted 3D center location of each object. Accordingly, the predicted distribution vectors allow for the relative comparison of expected scene complexities across camera views.
Accuracy prediction. The goal of the accuracy predictor ๐๐ด is to predict the expected accuracy of each branch-image pair. We propose an approach to model the detection performance of the pairs by utilizing the predicted spatial distribution. We employ a regression model to realize the ๐๐ด. We use the XGBoost regressor for ๐๐ด, trained on a validation set from the nuScenes dataset. The purpose of ๐๐ด is to predict the detection score (detailed in Section 8.2) of each branch based on the spatial distribution vector. For each pair of detection branch and camera view ๐, ๐๐ด takes the estimated spatial distribution ๐ท โฒ๐ ๐ก and a one-hot encoded branch type as inputs. As a result, the predicted scores for 16 detection branches are generated for each view ๐. To predict the detection score of the trackerโs branch for each view ๐, ๐๐ด utilizes ๐ท โฒ๐ ๐ก and the average confidence level of tracked objects. The rationale behind using confidence level is that the objects with higher certainty are more likely to appear in the near future.
Latency prediction. Latency prediction involves estimating the processing time of modules such as neural networks within detection branches. In general, these modules have consistent latency profiles at runtime. Accordingly, the expected latency of each detection branch can be determined simply by summing up its modulesโ latency profiles. Recall that forecasting the objectsโ future states can be performed instantaneously. On the other hand, the latency for updating the states of tracked objects by associating them with newly detected boxes depends on the number of objects in a given space. We modeled the processing time of state update with a simple linear regressor, trained by the same data used for ๐๐ด. The linear model is designed to predict the expected update latency as a function of the number of tracked objects.
6.2 Execution Scheduling
Finding the optimal selection between multi-view images and inference branches can be solved using a combinatorial optimization solver. We use integer linear programming (ILP) because the image-branch selection is represented with a binary decision, and also our objective function and constraint are linear. Specifically, we adopt Simplex and branch-andcut algorithms to efficiently find the optimal selection. The scheduling process including performance prediction takes up to 3ms on Orin Nano with limited computing power. Algorithm 1 shows the operational flow of our system which adapts to the surrounding space. At a given time ๐ก, the system first estimates the spatial distributions for each camera view (line 2), based on the state predictions of tracked objects (line 11). Utilizing the spatial distributions, the system predicts the detection scores of all
possible pairs of branches and camera views (line 3). To ensure a uniform comparison across camera views, each score is normalized against the score of the most powerful branch deployed on the target device. Next, the estimated latency for each branch is acquired using offline latency profiles (line 4). As our model includes modules that operate statically, such as the BEV head and tracker update, the effective latency limit is calculated by subtracting the estimated latencies of these modules from the latency target (lines 5-7). Then, the optimal selection of branch-image pairs is determined using the ILP solver (line 8). Taking the schedulerโs decision and incoming images as inputs, the model generates 3D bounding boxes (lines 12-14). These outcomes are subsequently utilized to update the states of tracked objects (line 15). An outdated object without a matched detection box is penalized by halving the confidence level and removed if the confidence is lower than the threshold. Finally, a downstream application utilizes the information of detected objects to provide its functionality such as obstacle avoidance (line 16).
Authors:
This paper is