paint-brush
How I Made Autonomous Vehicle AI 40% Faster Without Losing Accuracyby@vineethvatti
158 reads New Story

How I Made Autonomous Vehicle AI 40% Faster Without Losing Accuracy

by Vineeth Reddy VattiMarch 8th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Autonomous vehicles need to see the world fast and with precision. But real-time perception is tough when your model is drowning in millions of LiDAR points and high-res images. The solution isSensor fusion, model quantization, and a sprinkle of TensorRT acceleration.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail

Coins Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How I Made Autonomous Vehicle AI 40% Faster Without Losing Accuracy
Vineeth Reddy Vatti HackerNoon profile picture
0-item

Autonomous vehicles need to see the world fast and with precision. But real-time perception is tough when your model is drowning in millions of LiDAR points and high-res images. The solution? Sensor fusion, model quantization, and a sprinkle of TensorRT acceleration.

Let me walk you through how I optimized a multi-modal perception model to run 40% faster while keeping it sharp enough to dodge pedestrians and parked cars.


The Bottleneck: A Million Points, A Million Problems

Autonomous vehicles rely on LiDAR and cameras, each with strengths and weaknesses. LiDAR excels at depth estimation, but it spits out millions of points per second. Cameras provide rich textures and colors but lack depth. Fusing both in a single pipeline means balancing speed and precision.

Problem 1: The Lag Between LiDAR and Camera Data

By the time LiDAR finishes scanning, the camera has already captured a new frame. This misalignment means object positions don’t match.

Fix: I used ego-motion compensation with Kalman filters to adjust LiDAR points dynamically. Here’s a snippet of how it works in Python:

import numpy as np

class EgoMotionCompensation:
    def __init__(self, initial_pose, process_noise, measurement_noise):
        self.state = initial_pose  
        self.P = np.eye(5) * 0.01
        self.Q = np.diag(process_noise)
        self.R = np.diag(measurement_noise)

    def predict(self, dt):
        # Unpack state
        x, y, yaw, vx, vy = self.state
        x_pred = x + vx * dt
        y_pred = y + vy * dt
        yaw_pred = yaw 
        self.state = np.array([x_pred, y_pred, yaw_pred, vx, vy])
        F = np.eye(5)
        F[0, 3] = dt
        F[1, 4] = dt
        self.P = F @ self.P @ F.T + self.Q

    def update(self, measurement):
        #measurement = [x_meas, y_meas, yaw_meas]
        H = np.array([
            [1, 0, 0, 0, 0],
            [0, 1, 0, 0, 0],
            [0, 0, 1, 0, 0]
        ])
        z = np.array(measurement)
        y_err = z - H @ self.state
        S = H @ self.P @ H.T + self.R
        K = self.P @ H.T @ np.linalg.inv(S)
        self.state += K @ y_err
        I = np.eye(5)
        self.P = (I - K @ H) @ self.P

    def apply_correction(self, lidar_points, dt):
        #predict and update using IMU/odometry
        self.predict(dt)
        #Suppose we have a measurement from the IMU or odometry
        #assume partial measurement [x_meas, y_meas, yaw_meas]
        measurement = [self.state[0], self.state[1], self.state[2]]
        self.update(measurement)

        x, y, yaw, _, _ = self.state
        cos_yaw, sin_yaw = np.cos(yaw), np.sin(yaw)
        transformed = []
        for px, py, pz in lidar_points:
            #rotate and translate
            x_new = (px * cos_yaw - py * sin_yaw) + x
            y_new = (px * sin_yaw + py * cos_yaw) + y
            transformed.append((x_new, y_new, pz))
        return np.array(transformed)

This tweak reduced misalignment by 85%, boosting object detection accuracy.


Problem 2: Sensor Fusion: Making LiDAR and Cameras Play Nice

Now that we fixed the time lag, we have another issue: LiDAR gives point clouds, while cameras provide RGB images. We needed a single representation for both.


We built a Bird’s Eye View (BEV) model that projects LiDAR points onto the image plane and aligns them with camera pixels.

How We Did It:

  1. Calibrated LiDAR and camera with intrinsic & extrinsic matrices.
  2. Projected LiDAR points into the image space.
  3. Fused features using a cross-attention transformer.


Here’s how the projection works in Python:

import torch
import torch.nn.functional as F
import cv2
import numpy as np

def voxelize_points(lidar_points, voxel_size=(0.1, 0.1, 0.2)):
    #Transform LiDAR points into 3D voxel indices
    coords = np.floor(lidar_points / np.array(voxel_size)).astype(int)
    coords -= coords.min(0)
    voxels = np.zeros((coords[:,0].max()+1, coords[:,1].max()+1, coords[:,2].max()+1))
    for (x_idx, y_idx, z_idx) in coords:
        voxels[x_idx, y_idx, z_idx] += 1
    return voxels

def project_voxel_to_image(voxel_coords, camera_matrix, rvec, tvec, voxel_size):
    """
    Projects 3D voxel centers into image space using a known camera matrix (including distortion).
    """
    #compute center for each voxel
    voxel_centers = []
    for (x_idx, y_idx, z_idx) in voxel_coords:
        center_x = x_idx * voxel_size[0]
        center_y = y_idx * voxel_size[1]
        center_z = z_idx * voxel_size[2]
        voxel_centers.append([center_x, center_y, center_z])

    voxel_centers = np.array(voxel_centers, dtype=np.float32).reshape(-1, 1, 3)
    #project to image plane
    img_pts, _ = cv2.projectPoints(voxel_centers, rvec, tvec, camera_matrix, None)
    return img_pts

def fuse_features(camera_features, lidar_features):
    """
    cross-attention: queries come from camera_features,
    keys/values from lidar_features, to refine camera representation
    based on 3D data.
    """
    #camera_features, lidar_features are [Batch, Channels, Height, Width]
    fused = torch.cat([camera_features, lidar_features], dim=1)
    fused = F.relu(F.conv2d(fused, torch.randn(128, fused.size(1), 3, 3)))
    return fused

This improved detection by 29%, helping the model understand its surroundings better.


Slashing Inference Time by 40% With TensorRT and Quantization

Even with better fusion, inference time was killing us. Running deep learning models on embedded hardware isn’t the same as cloud GPUs. We needed to shrink the model without losing accuracy.

The Solution: TensorRT + Quantization

  1. Converted PyTorch to TensorRT: TensorRT optimizes computation graphs, fuses layers, and reduces memory overhead.


  2. Applied INT8 Quantization: This compressed 32-bit floating points into 8-bit representations making computations 4x faster.


Quantization technique that was used:

import torch
from torch.ao.quantization import (
    get_default_qconfig_mapping,
    prepare_fx,
    convert_fx
)

def apply_quantization(model, calibration_data):
    model.eval()

    #Use a QConfigMapping that sets INT8 or mixed precision
    qconfig_mapping = get_default_qconfig_mapping("fbgemm")
    example_inputs = (calibration_data[0].unsqueeze(0), calibration_data[1].unsqueeze(0))
    prepared_model = prepare_fx(model, qconfig_mapping, example_inputs)
    
    #Run calibration
    with torch.no_grad():
        for data in calibration_data:
            fused_input = data[0].unsqueeze(0), data[1].unsqueeze(0)
            prepared_model(*fused_input)
    
    quantized_model = convert_fx(prepared_model)
    return quantized_model


An example of how I converted the PyTorch model:

import tensorrt as trt
import os

def build_tensorrt_engine(onnx_file_path, precision="int8", max_batch_size=4, input_shape=(3, 256, 256)):
    trt_logger = trt.Logger(trt.Logger.INFO)
    builder = trt.Builder(trt_logger)
    network_flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    network = builder.create_network(flags=network_flags)
    parser = trt.OnnxParser(network, trt_logger)

    if not os.path.exists(onnx_file_path):
        raise FileNotFoundError(f"ONNX file not found: {onnx_file_path}")

    with open(onnx_file_path, "rb") as f:
        if not parser.parse(f.read()):
            for i in range(parser.num_errors):
                print(parser.get_error(i))
            raise ValueError("Failed to parse the ONNX file.")

    config = builder.create_builder_config()
    if precision == "int8":
        config.set_flag(trt.BuilderFlag.INT8)
    elif precision == "fp16":
        config.set_flag(trt.BuilderFlag.FP16)

    profile = builder.create_optimization_profile()
    #define min/opt/max batch sizes
    min_shape = (1,) + input_shape
    opt_shape = (max_batch_size // 2,) + input_shape
    max_shape = (max_batch_size,) + input_shape

    input_name = network.get_input(0).name
    profile.set_shape(input_name, min_shape, opt_shape, max_shape)
    config.add_optimization_profile(profile)

    engine = builder.build_engine(network, config)
    return engine

Results:

  • Inference time dropped from 250ms → 75ms per frame.
  • Memory footprint reduced by 40%.
  • Real-time performance improved to 5 FPS.

Next Steps: End-to-End Learning

We’re still exploring end-to-end models, where perception and planning happen in one deep learning pipeline. The idea? Instead of multiple modules for object detection, lane estimation, and path planning, we let a neural network learn everything from scratch.


Some early results show promise but training needs massive datasets and interpretability is a challenge. Still, we’re inching toward fully AI-driven navigation.


Final Thoughts: AI That Sees and Thinks Fast

By blending sensor fusion, ego-motion correction, and TensorRT acceleration, we made autonomous AI faster and smarter.


The next step? Scaling it beyond research and getting it on the road. If you’re working on AV models, let’s connect, I’d love to hear how you’re tackling inference speed and perception challenges.


Further Reading: