Autonomous vehicles need to see the world fast and with precision. But real-time perception is tough when your model is drowning in millions of LiDAR points and high-res images. The solution? Sensor fusion, model quantization, and a sprinkle of TensorRT acceleration.
Let me walk you through how I optimized a multi-modal perception model to run 40% faster while keeping it sharp enough to dodge pedestrians and parked cars.
Autonomous vehicles rely on LiDAR and cameras, each with strengths and weaknesses. LiDAR excels at depth estimation, but it spits out millions of points per second. Cameras provide rich textures and colors but lack depth. Fusing both in a single pipeline means balancing speed and precision.
By the time LiDAR finishes scanning, the camera has already captured a new frame. This misalignment means object positions don’t match.
Fix: I used ego-motion compensation with Kalman filters to adjust LiDAR points dynamically. Here’s a snippet of how it works in Python:
import numpy as np
class EgoMotionCompensation:
def __init__(self, initial_pose, process_noise, measurement_noise):
self.state = initial_pose
self.P = np.eye(5) * 0.01
self.Q = np.diag(process_noise)
self.R = np.diag(measurement_noise)
def predict(self, dt):
# Unpack state
x, y, yaw, vx, vy = self.state
x_pred = x + vx * dt
y_pred = y + vy * dt
yaw_pred = yaw
self.state = np.array([x_pred, y_pred, yaw_pred, vx, vy])
F = np.eye(5)
F[0, 3] = dt
F[1, 4] = dt
self.P = F @ self.P @ F.T + self.Q
def update(self, measurement):
#measurement = [x_meas, y_meas, yaw_meas]
H = np.array([
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0]
])
z = np.array(measurement)
y_err = z - H @ self.state
S = H @ self.P @ H.T + self.R
K = self.P @ H.T @ np.linalg.inv(S)
self.state += K @ y_err
I = np.eye(5)
self.P = (I - K @ H) @ self.P
def apply_correction(self, lidar_points, dt):
#predict and update using IMU/odometry
self.predict(dt)
#Suppose we have a measurement from the IMU or odometry
#assume partial measurement [x_meas, y_meas, yaw_meas]
measurement = [self.state[0], self.state[1], self.state[2]]
self.update(measurement)
x, y, yaw, _, _ = self.state
cos_yaw, sin_yaw = np.cos(yaw), np.sin(yaw)
transformed = []
for px, py, pz in lidar_points:
#rotate and translate
x_new = (px * cos_yaw - py * sin_yaw) + x
y_new = (px * sin_yaw + py * cos_yaw) + y
transformed.append((x_new, y_new, pz))
return np.array(transformed)
This tweak reduced misalignment by 85%, boosting object detection accuracy.
Now that we fixed the time lag, we have another issue: LiDAR gives point clouds, while cameras provide RGB images. We needed a single representation for both.
We built a Bird’s Eye View (BEV) model that projects LiDAR points onto the image plane and aligns them with camera pixels.
Here’s how the projection works in Python:
import torch
import torch.nn.functional as F
import cv2
import numpy as np
def voxelize_points(lidar_points, voxel_size=(0.1, 0.1, 0.2)):
#Transform LiDAR points into 3D voxel indices
coords = np.floor(lidar_points / np.array(voxel_size)).astype(int)
coords -= coords.min(0)
voxels = np.zeros((coords[:,0].max()+1, coords[:,1].max()+1, coords[:,2].max()+1))
for (x_idx, y_idx, z_idx) in coords:
voxels[x_idx, y_idx, z_idx] += 1
return voxels
def project_voxel_to_image(voxel_coords, camera_matrix, rvec, tvec, voxel_size):
"""
Projects 3D voxel centers into image space using a known camera matrix (including distortion).
"""
#compute center for each voxel
voxel_centers = []
for (x_idx, y_idx, z_idx) in voxel_coords:
center_x = x_idx * voxel_size[0]
center_y = y_idx * voxel_size[1]
center_z = z_idx * voxel_size[2]
voxel_centers.append([center_x, center_y, center_z])
voxel_centers = np.array(voxel_centers, dtype=np.float32).reshape(-1, 1, 3)
#project to image plane
img_pts, _ = cv2.projectPoints(voxel_centers, rvec, tvec, camera_matrix, None)
return img_pts
def fuse_features(camera_features, lidar_features):
"""
cross-attention: queries come from camera_features,
keys/values from lidar_features, to refine camera representation
based on 3D data.
"""
#camera_features, lidar_features are [Batch, Channels, Height, Width]
fused = torch.cat([camera_features, lidar_features], dim=1)
fused = F.relu(F.conv2d(fused, torch.randn(128, fused.size(1), 3, 3)))
return fused
This improved detection by 29%, helping the model understand its surroundings better.
Even with better fusion, inference time was killing us. Running deep learning models on embedded hardware isn’t the same as cloud GPUs. We needed to shrink the model without losing accuracy.
Converted PyTorch to TensorRT: TensorRT optimizes computation graphs, fuses layers, and reduces memory overhead.
Applied INT8 Quantization: This compressed 32-bit floating points into 8-bit representations making computations 4x faster.
Quantization technique that was used:
import torch
from torch.ao.quantization import (
get_default_qconfig_mapping,
prepare_fx,
convert_fx
)
def apply_quantization(model, calibration_data):
model.eval()
#Use a QConfigMapping that sets INT8 or mixed precision
qconfig_mapping = get_default_qconfig_mapping("fbgemm")
example_inputs = (calibration_data[0].unsqueeze(0), calibration_data[1].unsqueeze(0))
prepared_model = prepare_fx(model, qconfig_mapping, example_inputs)
#Run calibration
with torch.no_grad():
for data in calibration_data:
fused_input = data[0].unsqueeze(0), data[1].unsqueeze(0)
prepared_model(*fused_input)
quantized_model = convert_fx(prepared_model)
return quantized_model
An example of how I converted the PyTorch model:
import tensorrt as trt
import os
def build_tensorrt_engine(onnx_file_path, precision="int8", max_batch_size=4, input_shape=(3, 256, 256)):
trt_logger = trt.Logger(trt.Logger.INFO)
builder = trt.Builder(trt_logger)
network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flags=network_flags)
parser = trt.OnnxParser(network, trt_logger)
if not os.path.exists(onnx_file_path):
raise FileNotFoundError(f"ONNX file not found: {onnx_file_path}")
with open(onnx_file_path, "rb") as f:
if not parser.parse(f.read()):
for i in range(parser.num_errors):
print(parser.get_error(i))
raise ValueError("Failed to parse the ONNX file.")
config = builder.create_builder_config()
if precision == "int8":
config.set_flag(trt.BuilderFlag.INT8)
elif precision == "fp16":
config.set_flag(trt.BuilderFlag.FP16)
profile = builder.create_optimization_profile()
#define min/opt/max batch sizes
min_shape = (1,) + input_shape
opt_shape = (max_batch_size // 2,) + input_shape
max_shape = (max_batch_size,) + input_shape
input_name = network.get_input(0).name
profile.set_shape(input_name, min_shape, opt_shape, max_shape)
config.add_optimization_profile(profile)
engine = builder.build_engine(network, config)
return engine
We’re still exploring end-to-end models, where perception and planning happen in one deep learning pipeline. The idea? Instead of multiple modules for object detection, lane estimation, and path planning, we let a neural network learn everything from scratch.
Some early results show promise but training needs massive datasets and interpretability is a challenge. Still, we’re inching toward fully AI-driven navigation.
By blending sensor fusion, ego-motion correction, and TensorRT acceleration, we made autonomous AI faster and smarter.
The next step? Scaling it beyond research and getting it on the road. If you’re working on AV models, let’s connect, I’d love to hear how you’re tackling inference speed and perception challenges.