paint-brush
Turbocharging AI Sentiment Analysis: How We Hit 50K RPS with GPU Micro-servicesby@vineethvatti
157 reads New Story

Turbocharging AI Sentiment Analysis: How We Hit 50K RPS with GPU Micro-services

by Vineeth Reddy VattiMarch 7th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

The sentiment analysis stack was one big codebase for data ingestion, model inference, logging, and storage. It worked great, until traffic shot up forcing us to over-provision every single component. In this post, I’ll show you how we pivoted to microservices, leveraging Kubernetes, and a streaming ETL pipeline.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - Turbocharging AI Sentiment Analysis: How We Hit 50K RPS with GPU Micro-services
Vineeth Reddy Vatti HackerNoon profile picture
0-item

I remember the day our single-process sentiment analysis pipeline finally buckled under a surge of requests. The logs were ominous: thread pools jammed, batch jobs stalled, and memory soared. That’s when we decided to break free of our monolithic design and rebuild everything from scratch. In this post I’ll show you how we pivoted to microservices leveraging Kubernetes, GPU-aware autoscaling, and a streaming ETL pipeline to handle massive social data in near real time.


Monoliths Are Fine, Until They’re Not

Originally our sentiment analysis stack was one big codebase for data ingestion, tokenization, model inference, logging, and storage. It worked great, until traffic shot up forcing us to over-provision every single component. Updates were worse, re-deploying the entire application just to patch the inference model felt wasteful.


By switching to micro-services, we isolated each function:

  1. API Gateway: Routes requests and handles authentication.
  2. Text Cleanup & Tokenization: CPU-friendly text handling.
  3. GPU-Based Sentiment Service: Actual model inference.
  4. Data Storage & Logs: Keeps final sentiment results and error logs.
  5. Monitoring: Observes performance with minimal overhead.


We can now scale each piece independently, boosting performance at specific bottlenecks.


Containerizing for GPU Inference

Our first big step was containerization. Let’s look at a Dockerfile for the GPU-enabled inference service:

FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04

WORKDIR /app

# Install Python and system dependencies
RUN apt-get update && \
    apt-get install -y python3 python3-pip git && \
    rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --upgrade pip

# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy project files
COPY . .

EXPOSE 5000
CMD ["python3", "sentiment_inference.py"]

This base image includes CUDA drivers and libraries for GPU acceleration. Once we build it and push the container to a registry it’s ready for orchestration.

Kubernetes: GPU Autoscaling in Action

With Kubernetes (K8s), we can deploy and scale each micro-service. We bind inference pods to GPU-backed node types and auto-scale based on GPU utilization:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: sentiment-inference-gpu
spec:
 replicas: 2
 selector:
   matchLabels:
     app: sentiment-inference-gpu
 template:
   metadata:
     labels:
       app: sentiment-inference-gpu
   spec:
     nodeSelector:
       kubernetes.io/instance-type: "g4dn.xlarge"
     containers:
     - name: inference-container
       image: myrepo/sentiment-inference:gpu-latest
       resources:
         limits:
           nvidia.com/gpu: 1
           memory: "8Gi"
           cpu: "2"
         requests:
           nvidia.com/gpu: 1
           memory: "4Gi"
           cpu: "1"

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: sentiment-inference-hpa
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: sentiment-inference-gpu
 minReplicas: 2
 maxReplicas: 15
 metrics:
 - type: Pods
   pods:
     metric:
       name: nvidia_gpu_utilization
     target:
       type: AverageValue
       averageValue: "70"

Whenever GPU load hits our 70% threshold Kubernetes spins up additional pods. This mechanism keeps the system snappy under heavy load, but avoids unnecessary costs during downtime.

Achieving 50K RPS with Batch Inference & Async I/O

Single inference calls for each request can throttle performance. We batch multiple requests together for better GPU utilization:

import asyncio
from fastapi import FastAPI, Request
from threading import Thread
from queue import Queue
import torch
import tensorrt as trt

app = FastAPI()

REQUEST_QUEUE = Queue(maxsize=10000)
BATCH_SIZE = 32

TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
engine_path = "models/sentiment_model.trt"

def load_trt_engine():
    with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        return runtime.deserialize_cuda_engine(f.read())

engine = load_trt_engine()

def inference_worker():
    while True:
        batch = []
        while len(batch) < BATCH_SIZE and not REQUEST_QUEUE.empty():
            batch.append(REQUEST_QUEUE.get())

        if batch:
            texts = [item["text"] for item in batch]
            scores = run_tensorrt_inference(engine, texts)  # Batches 32 inputs at once
            for idx, score in enumerate(scores):
                batch[idx]["future"].set_result(score)

Thread(target=inference_worker, daemon=True).start()

@app.post("/predict")
async def predict(req: Request):
    body = await req.json()
    text = body.get("text", "")
    loop = asyncio.get_running_loop()
    future = loop.create_future()
    REQUEST_QUEUE.put({"text": text, "future": future})
    result = await future

    return {"sentiment": "positive" if result > 0.5 else "negative"}

This strategy keeps GPU resources humming efficiently, leading to dramatic throughput gains.

Real-Time ETL: Kafka, Spark, and Cloud Storage

We also needed to handle high-volume social data ingestion. Our pipeline uses Kafka for streaming, Spark for real-time transformation, and Redshift for storage.

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, udf
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("TwitterETLPipeline").getOrCreate()

schema = StructType([
    StructField("tweet_id", StringType()),
    StructField("text", StringType()),
    StructField("user", StringType())
])

df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
    .option("subscribe", "tweets") \
    .option("startingOffsets", "latest") \
    .load()

parsed_df = df.select(from_json(col("value").cast("string"), schema).alias("tweet"))

def custom_preprocess(txt):
    return txt.replace("#", "").lower()

udf_preprocess = udf(custom_preprocess, StringType())

clean_df = parsed_df.select(
    col("tweet.tweet_id").alias("id"),
    udf_preprocess(col("tweet.text")).alias("clean_text"),
    col("tweet.user").alias("username")
)

query = clean_df \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

query.awaitTermination()

Spark picks up raw tweets from Kafka, cleans them, and sends them on to be stored or scored. We can scale both Kafka and Spark to accommodate millions of tweets per hour.

Pain Points & Lessons Learned

Early on, we hit puzzling memory issues because our GPU resource limits were mismatched with physical hardware. Pods crashed randomly under load. We also realized that tuning batch sizes is a balancing act: for internal analytics, we want bigger batches, for end-user requests we keep them modest to minimize latency.

Conclusion

By weaving together micro-services, GPU acceleration, and a streaming-first ETL architecture, we transformed our old monolith into a high-octane sentiment pipeline that laughs off 50K RPS. It’s not just about speed, batch inference strategies ensure minimal resource waste while flexible ETL pipelines let us adapt to surging data volumes in real time. Gone are the days of over-provisioning or patching everything just to fix a single inference bug. With a robust containerized approach, each service scales on its own terms, keeping the entire stack lean, reliable, and ready for the next traffic spike. If you’ve been feeling the pinch of a bogged-down monolith, now’s the time to rev the engine with micro-services and real-time data flows.